r 甲基化样本验证

Posted 2021-05-23

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了r 甲基化样本验证相关的知识，希望对你有一定的参考价值。


 
#Get actual genotypes from sample. Most likely you'll need imputed genotypes for this to work. I am showing how to merge genotypes by chromosome here
for chr in {1..22}
do
plink --bfile /home/maihofer/mrsvetsagcta/genotypes/mrsc_bgn__chr"$chr" --extract barfield.snplist --make-bed --out mrsc_seyma_"$chr"
done

ls mrsc_seyma_* | grep .bim | sed 's/.bim//g' | awk '{print $1".bed",$1".bim",$1".fam"}' > seyma_mrs.list

plink --merge-list seyma_mrs.list --make-bed --out mrsc_seyma0

#Make sure that the genotyping rate is high in markers used.
plink --bfile mrsc_seyma0 --geno 0.05 --make-bed --out mrsc_seyma

#So now we need to assign ACTUAL allele calls to the phony methylation allele calls (Right now the EWAS SNP calls are called as A/B instead of A C G T)
#We do this by calculating allele frequencies in the GWAS data of the methylation subjects
#Then it's just a matter of matching allele frequency between EWAS and GWAS data 

#Calculate AFs in GWAS sample, in just the methylation subjects. 
plink_win64/plink --noweb --bfile mrsc_seyma --keep methylation_subjects.txt --freq --out mrsc_freqs_subs


#Calculate AFs in EWAS sample
 plink_win64/plink --noweb --file seyma_barfield --freq --out seyma_freqs

#To the person doing this. Line up the two allele frequency files in excel to get the new allele codes. Only take ones with MAF < 30% and low difference after alignment.
#i.e. in the GWAS there is T C and T has 30% frequency. In the EWAS there is A B and B has 30% frequency. Therefore B should be assigned to T and A should be assigned to C.
#Save a file called 'purported alleles' that has a row for each SNP, with SNP name, A , B , and the actual alleles they are assigned to. E.g. in the last example this would be
# rs1234 A B C T


#Update EWAS alleles based on matching MAFs
 plink_win64/plink --noweb --file seyma_barfield --update-alleles purported_alleles.txt --extract purported_alleles.txt --make-bed --out seyma_updated

#Merge EWAS and GWAS
 plink_win64/plink --noweb --bfile mrsc_seyma --bmerge seyma_updated.bed seyma_updated.bim seyma_updated.fam --make-bed --out alldata
 plink_win64/plink --noweb --bfile alldata --make-bed -extract purported_alleles.txt  --out alldata2

 #Do IBD (in GWAS PI hat 1 is a perfect match, but this is imperfect, so look for a .8 match or so to identify a matching sample)
  plink_win64/plink --bfile alldata2 --indep-pairwise 50 5 0.2 --out seymamarkers_ibd
 plink_win64/plink --noweb --bfile alldata2  --maf 0.05 --extract  seymamarkers_ibd.prune.in --genome --min 0.5 --out  alldata_ibd
 
 


#Does ibd work in mrs alone??
#plink_win64/plink --noweb --bfile mrsc_seyma --genome --min 0.7 --out ibdtestmrs

impute_median <- function(x)
{
	med <- median(x,na.rm=T)
	x[is.na(x)] <- med
	return(x)
}

##Read in a .csv file consisting of Barfield CpGs, one row per CpG, one column per sample.
d1 <- read.csv('MRS_Long_barfieldProbes.csv',header=T,nr=20000)
d1$CpG <- d1$X

## Convert CpG names to SNP names
probenames <- read.csv("barfield_probes.csv",header=T,stringsAsFactors=F,na.strings=c(NA,"."))

#Do this by selecting nearest downstream or upstream snp
probenames$SNP <- probenames$US_SNP_ID
nearest <- which(probenames$MinDist_US > probenames$MinDist_DS | is.na(probenames$SNP))
probenames[nearest,]$SNP <- probenames[nearest,]$DS_SNP_ID

#Subset the probe file to just the CpG and SNP name
probenames2 <- subset(probenames,select=c(CpG,SNP))

#Remove duplicated SNPs 
#(duplicated removes the second file in a list. 
#The better way to do this would be to take the closer stuff, i.e. just sort the original file by distance prior to running this step)

probenames3 <- probenames2[!duplicated(probenames$SNP),]

#Merge the dataset with the SNP names
d2 <- merge(d1,probenames3,by="CpG")

#Use the SNP names as row names
row.names(d2) <- d2$SNP

#Remove the CpG and SNP name columns from the data, don't need them anymore
d2 <- d2[,!(names(d2) %in% c("CpG","X","SNP"))]

#Impute the median over missing data
dat_imputed <- t(apply(d2, 1, impute_median))

#Call genotypes.
#This is a lazy way that assumes the allele is not monomorphic and is distributed over the full range of possible methylation values
#it's not right, but it works!!

get_genotypes2 <- function(x)
{
	genotype <- rep(NA, length(x))
	gt0 <- which(x < .33)
	gt1 <- which(x >= .33 & x <= .66)
	gt2 <- which(x > .66)
	genotype[gt0] <- "A A"
	genotype[gt1] <- "A B"
	genotype[gt2] <- "B B"
 
	return(as.character(genotype))

}


dat_cl <-  as.data.frame(apply(dat_imputed,1,get_genotypes2),stringsAsFactors=F)
row.names(dat_cl) <- colnames(dat_imputed)

#Data will have column out transposed. Rename rows of data to subject methylation IDs
dat_cl$methid <- row.names(dat_cl)

#Make a .map file
mapfile <- data.frame(0,colnames(dat_cl)[colnames(dat_cl) != "methid"],0,0)

write.table(mapfile,'seyma_barfield.map',col.names=F,quote=F,row.names=F)

##Make a .ped file. 

#Read a file in that will be used to link EWAS IDs to GWAS IDs. This file should have columns for: methylation, FID,IID,Mother IID(just set to 0), father IID (just set to 0), gender (set all to 1 if you dont know) and phenotype (set all to 1 if you dont know) .
#If you don't know any of these, even the purported GWAS ID, it's fine
plinkin <- read.csv('seyma_preplink.csv',header=T,stringsAsFactors=F)
plinkin$methid <- paste("X",plinkin$methid,sep="") #I just do this because my methylation IDs had an X added to them by R

plinkin2 <- subset(plinkin,select=c(methid,FID,IID,M,F,G,P)) #My file also had extra columns, I subset to the PLINK ones

#Export a PLINK PED file containing the added columns and the 'called' rs ids
datexp <- merge(plinkin2,dat_cl,by=c("methid")) 
datexp2 <- datexp[,-1]

write.table(datexp2,"seyma_barfield.ped",quote=F,col.names=F,row.names=F,na="0 0",sep="\t")

以上是关于r 甲基化样本验证的主要内容，如果未能解决你的问题，请参考以下文章

甲基化数据QC：使用甲基化数据计算样本间的相关性

Uanle TCGA数据挖掘——预后相关的甲基化位点及构建重要基因的风险模型

r 在甲基化年龄分析的背景下，展示拦截术语如何从抽样方法中调整估计偏差

使用IGV简单绘制甲基化分布图

甲基化测序技术及其在老年疾病和衰老模型中的应用

DNA甲基化及其测量方法（转）