r 甲基化样本验证
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了r 甲基化样本验证相关的知识,希望对你有一定的参考价值。
#Get actual genotypes from sample. Most likely you'll need imputed genotypes for this to work. I am showing how to merge genotypes by chromosome here
for chr in {1..22}
do
plink --bfile /home/maihofer/mrsvetsagcta/genotypes/mrsc_bgn__chr"$chr" --extract barfield.snplist --make-bed --out mrsc_seyma_"$chr"
done
ls mrsc_seyma_* | grep .bim | sed 's/.bim//g' | awk '{print $1".bed",$1".bim",$1".fam"}' > seyma_mrs.list
plink --merge-list seyma_mrs.list --make-bed --out mrsc_seyma0
#Make sure that the genotyping rate is high in markers used.
plink --bfile mrsc_seyma0 --geno 0.05 --make-bed --out mrsc_seyma
#So now we need to assign ACTUAL allele calls to the phony methylation allele calls (Right now the EWAS SNP calls are called as A/B instead of A C G T)
#We do this by calculating allele frequencies in the GWAS data of the methylation subjects
#Then it's just a matter of matching allele frequency between EWAS and GWAS data
#Calculate AFs in GWAS sample, in just the methylation subjects.
plink_win64/plink --noweb --bfile mrsc_seyma --keep methylation_subjects.txt --freq --out mrsc_freqs_subs
#Calculate AFs in EWAS sample
plink_win64/plink --noweb --file seyma_barfield --freq --out seyma_freqs
#To the person doing this. Line up the two allele frequency files in excel to get the new allele codes. Only take ones with MAF < 30% and low difference after alignment.
#i.e. in the GWAS there is T C and T has 30% frequency. In the EWAS there is A B and B has 30% frequency. Therefore B should be assigned to T and A should be assigned to C.
#Save a file called 'purported alleles' that has a row for each SNP, with SNP name, A , B , and the actual alleles they are assigned to. E.g. in the last example this would be
# rs1234 A B C T
#Update EWAS alleles based on matching MAFs
plink_win64/plink --noweb --file seyma_barfield --update-alleles purported_alleles.txt --extract purported_alleles.txt --make-bed --out seyma_updated
#Merge EWAS and GWAS
plink_win64/plink --noweb --bfile mrsc_seyma --bmerge seyma_updated.bed seyma_updated.bim seyma_updated.fam --make-bed --out alldata
plink_win64/plink --noweb --bfile alldata --make-bed -extract purported_alleles.txt --out alldata2
#Do IBD (in GWAS PI hat 1 is a perfect match, but this is imperfect, so look for a .8 match or so to identify a matching sample)
plink_win64/plink --bfile alldata2 --indep-pairwise 50 5 0.2 --out seymamarkers_ibd
plink_win64/plink --noweb --bfile alldata2 --maf 0.05 --extract seymamarkers_ibd.prune.in --genome --min 0.5 --out alldata_ibd
#Does ibd work in mrs alone??
#plink_win64/plink --noweb --bfile mrsc_seyma --genome --min 0.7 --out ibdtestmrs
impute_median <- function(x)
{
med <- median(x,na.rm=T)
x[is.na(x)] <- med
return(x)
}
##Read in a .csv file consisting of Barfield CpGs, one row per CpG, one column per sample.
d1 <- read.csv('MRS_Long_barfieldProbes.csv',header=T,nr=20000)
d1$CpG <- d1$X
## Convert CpG names to SNP names
probenames <- read.csv("barfield_probes.csv",header=T,stringsAsFactors=F,na.strings=c(NA,"."))
#Do this by selecting nearest downstream or upstream snp
probenames$SNP <- probenames$US_SNP_ID
nearest <- which(probenames$MinDist_US > probenames$MinDist_DS | is.na(probenames$SNP))
probenames[nearest,]$SNP <- probenames[nearest,]$DS_SNP_ID
#Subset the probe file to just the CpG and SNP name
probenames2 <- subset(probenames,select=c(CpG,SNP))
#Remove duplicated SNPs
#(duplicated removes the second file in a list.
#The better way to do this would be to take the closer stuff, i.e. just sort the original file by distance prior to running this step)
probenames3 <- probenames2[!duplicated(probenames$SNP),]
#Merge the dataset with the SNP names
d2 <- merge(d1,probenames3,by="CpG")
#Use the SNP names as row names
row.names(d2) <- d2$SNP
#Remove the CpG and SNP name columns from the data, don't need them anymore
d2 <- d2[,!(names(d2) %in% c("CpG","X","SNP"))]
#Impute the median over missing data
dat_imputed <- t(apply(d2, 1, impute_median))
#Call genotypes.
#This is a lazy way that assumes the allele is not monomorphic and is distributed over the full range of possible methylation values
#it's not right, but it works!!
get_genotypes2 <- function(x)
{
genotype <- rep(NA, length(x))
gt0 <- which(x < .33)
gt1 <- which(x >= .33 & x <= .66)
gt2 <- which(x > .66)
genotype[gt0] <- "A A"
genotype[gt1] <- "A B"
genotype[gt2] <- "B B"
return(as.character(genotype))
}
dat_cl <- as.data.frame(apply(dat_imputed,1,get_genotypes2),stringsAsFactors=F)
row.names(dat_cl) <- colnames(dat_imputed)
#Data will have column out transposed. Rename rows of data to subject methylation IDs
dat_cl$methid <- row.names(dat_cl)
#Make a .map file
mapfile <- data.frame(0,colnames(dat_cl)[colnames(dat_cl) != "methid"],0,0)
write.table(mapfile,'seyma_barfield.map',col.names=F,quote=F,row.names=F)
##Make a .ped file.
#Read a file in that will be used to link EWAS IDs to GWAS IDs. This file should have columns for: methylation, FID,IID,Mother IID(just set to 0), father IID (just set to 0), gender (set all to 1 if you dont know) and phenotype (set all to 1 if you dont know) .
#If you don't know any of these, even the purported GWAS ID, it's fine
plinkin <- read.csv('seyma_preplink.csv',header=T,stringsAsFactors=F)
plinkin$methid <- paste("X",plinkin$methid,sep="") #I just do this because my methylation IDs had an X added to them by R
plinkin2 <- subset(plinkin,select=c(methid,FID,IID,M,F,G,P)) #My file also had extra columns, I subset to the PLINK ones
#Export a PLINK PED file containing the added columns and the 'called' rs ids
datexp <- merge(plinkin2,dat_cl,by=c("methid"))
datexp2 <- datexp[,-1]
write.table(datexp2,"seyma_barfield.ped",quote=F,col.names=F,row.names=F,na="0 0",sep="\t")
以上是关于r 甲基化样本验证的主要内容,如果未能解决你的问题,请参考以下文章
Uanle TCGA数据挖掘——预后相关的甲基化位点及构建重要基因的风险模型