外显子WES数据检测CNV方法梳理与软件汇总
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了外显子WES数据检测CNV方法梳理与软件汇总相关的知识,希望对你有一定的参考价值。
参考技术A 总结(看的有限):1. 大部分方法为基于深度(划分区间检测)的方法(FPKM、区域碱基数的类似FPKM算法)。
2. 大部分软件会使用对照样本作为reference作为基线,但是reference本身的方差可能会比较大,方差可能是bias也可能是CNV等,方差的差异在算法处理过程中是不能消除的。并且大部分使用对照的算法在检测common CNV的时候可能都不准确,比如人群频率50%的CNV,reference处理的时候,可能会把单拷贝作为正常二倍体处理,这样正常二倍体可能会被作为三倍体检出。
3. 外显子数据矫正常利用GC含量、mappability对覆盖深度进行矫正。比如GC矫正,一般是对于一个窗口w,标准化后的深度,等于窗口原始深度值/具有相同GC含量窗口的深度值。
4. 检测CNV之前一般会有质控,去除一些bias较大的区间,比如考虑区间覆盖度,样本整体覆盖情况,GC含量极端区间等。
5. 数据降噪方法常见 PCA、SVD(一般去除前k个noise)。一般应用这类方法的时候,也就可能去除掉common CNV的信号,所以会看到有些软件在检测common的性能上不太好。commom CNV有考虑的,比如CLAMMS的一个主要优化点就是同时考虑的common的CNV的特征,做批次效应去除的时候不用深度文件,而是用picard产生的metrics。另外CODEX2,在无正常对照的时候「也需要一堆样本同时检测」可以检测所有样本的common CNV,文章数据表现很好。
6. CNV检测算法常见HMM,CBS,新一点的方法还会用机器学习,其他的使用比较少也看不太懂~检测区间一般是跨越多个外显子,也有能做到单外显子水平的,但是比较少且recall不太好(deletion相对更容易做到)~
题目:
1. CopyDetective: Detection threshold-aware copy number variant calling in whole-exome sequencing data.
2. Detection of copy-number variations from NGS data using read depth information: a diagnostic performance evaluation.
3. Copy Number Variation Detection Using Total Variation.
4. A highly sensitive and specific workflow for detecting rare copy-number variants from exome sequencing data.
5. Copy number variation profiling in pharmacogenes using panel-based exome resequencing and correlation to human liver expression.
6. A machine-learning approach for accurate detection of copy number variants from exome sequencing.
7. Atlas-CNV: a validated approach to call single-exon CNVs in the eMERGESeq gene panel.
8. CODEX2: full-spectrum copy number variation detection by high-throughput DNA sequencing.
9. Clinical analysis of germline copy number variation in DMD using a non-conjugate hierarchical Bayesian model.
10. Preprocessing Sequence Coverage Data for More Precise Detection of Copy Number Variations.
11. Integrative DNA copy number detection and genotyping from sequencing and array-based platforms.
12. WISExome: a within-sample comparison approach to detect copy number variations in whole exome sequencing data.
13. Anaconda: AN automated pipeline for somatic COpy Number variation Detection and Annotation from tumor exome sequencing data.
14. ExCNVSS: A Noise-Robust Method for Copy Number Variation Detection in Whole Exome Sequencing Data.
15. Accurate clinical detection of exon copy number variants in a targeted NGS panel using DECoN
16. Homozygous and hemizygous CNV detection from exome sequencing data in a Mendelian disease cohort.
17. CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing
18. CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data.
19. A Sparse Model Based Detection of Copy Number Variations From Exome Sequencing Data.
20. DeAnnCNV: a tool for online detection and annotation of copy number variations from whole-exome sequencing data.
21. CopywriteR: DNA copy number detection from off-target sequence data.
22. Allele-specific copy-number discovery from whole-genome and whole-exome sequencing.
23. CODEX: a normalization and copy number variation detection method for whole exome sequencing.
24. Combinatorial approach to estimate copy number genotype using whole-exome sequencing data.
25. Assessing copy number from exome sequencing and exome array CGH based on CNV spectrum in a large clinical cohort.
26. Detection of internal exon deletion with exon Del.
27. cnvCapSeq: detecting copy number variation in long-range targeted resequencing data.
28. Inferring copy number and genotype in tumour exome data
29. cnvOffSeq: detecting intergenic copy number variation using off-target exome sequencing data.
30. Identification of copy number variants from exome sequence data.
31. PatternCNV: a versatile tool for detecting copy number changes from exome sequencing data.
32. EXCAVATOR: detecting copy number variants from whole-exome sequencing data
33. CoNVEX: copy number variation estimation in exome sequencing data using HMM.
34. Improving detection of copy-number variation by simultaneous bias correction and read-depth segmentation
35. Modeling read counts for CNV detection in exome sequencing data
36. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth.
37. An exome sequencing pipeline for identifying and genotyping common CNVs associated with disease with application to psoriasis.
38. A robust model for read count data in exome sequencing experiments and implications for copy number variant calling.
39. Copy number variation detection and genotyping from exome sequence data
40. CONTRA: copy number analysis for targeted resequencing
41. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate
42. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing
43. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data
44. Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV.
45. CNV-seq, a new method to detect copy number variation using high-throughput sequencing
根据癌变细胞的外显子数据获取肿瘤的“进化树”史
#######################################################
#######################################################
####### #######
####### CNA and SNA input #######
####### #######
#######################################################
#######################################################
library(Canopy)
data("MDA231")
projectname = MDA231$projectname ## name of project
R = MDA231$R; R ## mutant allele read depth (for SNAs)
X = MDA231$X; X ## total depth (for SNAs)
WM = MDA231$WM; WM ## observed major copy number (for CNA regions)
Wm = MDA231$Wm; Wm ## observed minor copy number (for CNA regions)
epsilonM = MDA231$epsilonM ## standard deviation of WM, pre-fixed here
epsilonm = MDA231$epsilonm ## standard deviation of Wm, pre-fixed here
## whether CNA regions harbor specific CNAs (only needed for overlapping CNAs)
C = MDA231$C; C
Y = MDA231$Y; Y ## whether SNAs are affected by CNAs
#######################################################
#######################################################
####### #######
####### MCMC sampling #######
####### #######
#######################################################
#######################################################
K = 3:6 # number of subclones
numchain = 20 # number of chains with random initiations
sampchain = canopy.sample(R = R, X = X, WM = WM, Wm = Wm, epsilonM = epsilonM,
epsilonm = epsilonm, C = C, Y = Y, K = K,
numchain = numchain, simrun = 100000, writeskip = 200,
projectname = projectname, cell.line = TRUE,
plot.likelihood = TRUE)
save.image(file = paste(projectname, ‘_postmcmc_image.rda‘,sep=‘‘),
compress = ‘xz‘)
#######################################################
#######################################################
####### #######
####### BIC to determine number of subclones #######
####### #######
#######################################################
#######################################################
library(Canopy)
projectname=‘MDA231‘
load(paste(projectname, ‘_postmcmc_image.rda‘, sep=‘‘))
burnin = 100
thin = 10
# If pdf = TRUE, a pdf will be generated.
bic = canopy.BIC(sampchain = sampchain, projectname = projectname, K = K,
numchain = numchain, burnin = burnin, thin = thin, pdf = TRUE)
optK = K[which.max(bic)]
#######################################################
#######################################################
####### #######
####### posterior tree evaluation #######
####### #######
#######################################################
#######################################################
post = canopy.post(sampchain = sampchain, projectname = projectname, K = K,
numchain = numchain, burnin = burnin, thin = thin,
optK = optK, C = C, post.config.cutoff = 0.05)
samptreethin = post[[1]] # list of all post-burnin and thinning trees
samptreethin.lik = post[[2]] # likelihoods of trees in samptree
config = post[[3]]
config.summary = post[[4]]
print(config.summary)
# first column: tree configuration
# second column: posterior configuration probability in the entire tree space
# third column: posterior configuration likelihood in the subtree space
# note: if modes of posterior probabilities aren‘t obvious, run sampling longer.
#######################################################
#######################################################
####### #######
####### Tree output and plot #######
####### #######
#######################################################
#######################################################
# choose the configuration with the highest posterior likelihood
config.i = config.summary[which.max(config.summary[,3]),1]
cat(‘Configuration‘, config.i, ‘has the highest posterior likelihood.\n‘)
output.tree = canopy.output(post, config.i, C)
pdf.name = paste(projectname, ‘_config_highest_likelihood.pdf‘, sep=‘‘)
canopy.plottree(output.tree, pdf = TRUE, pdf.name = pdf.name)
# plot posterior tree with second configuration
output.tree = canopy.output(post, 1, C)
canopy.plottree(output.tree, pdf=TRUE, pdf.name = paste(projectname, ‘_second_config.pdf‘, sep = ‘‘))
以上是关于外显子WES数据检测CNV方法梳理与软件汇总的主要内容,如果未能解决你的问题,请参考以下文章