外显子WES数据检测CNV方法梳理与软件汇总

Posted 2023-05-16

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了外显子WES数据检测CNV方法梳理与软件汇总相关的知识，希望对你有一定的参考价值。

参考技术A 总结（看的有限）：

1. 大部分方法为基于深度（划分区间检测）的方法（FPKM、区域碱基数的类似FPKM算法）。

2. 大部分软件会使用对照样本作为reference作为基线，但是reference本身的方差可能会比较大，方差可能是bias也可能是CNV等，方差的差异在算法处理过程中是不能消除的。并且大部分使用对照的算法在检测common CNV的时候可能都不准确，比如人群频率50%的CNV，reference处理的时候，可能会把单拷贝作为正常二倍体处理，这样正常二倍体可能会被作为三倍体检出。

3. 外显子数据矫正常利用GC含量、mappability对覆盖深度进行矫正。比如GC矫正，一般是对于一个窗口w，标准化后的深度，等于窗口原始深度值/具有相同GC含量窗口的深度值。

4. 检测CNV之前一般会有质控，去除一些bias较大的区间，比如考虑区间覆盖度，样本整体覆盖情况，GC含量极端区间等。

5. 数据降噪方法常见 PCA、SVD（一般去除前k个noise）。一般应用这类方法的时候，也就可能去除掉common CNV的信号，所以会看到有些软件在检测common的性能上不太好。commom CNV有考虑的，比如CLAMMS的一个主要优化点就是同时考虑的common的CNV的特征，做批次效应去除的时候不用深度文件，而是用picard产生的metrics。另外CODEX2，在无正常对照的时候「也需要一堆样本同时检测」可以检测所有样本的common CNV，文章数据表现很好。

6. CNV检测算法常见HMM，CBS，新一点的方法还会用机器学习，其他的使用比较少也看不太懂~检测区间一般是跨越多个外显子，也有能做到单外显子水平的，但是比较少且recall不太好（deletion相对更容易做到）~

题目：
1. CopyDetective: Detection threshold-aware copy number variant calling in whole-exome sequencing data.

2. Detection of copy-number variations from NGS data using read depth information: a diagnostic performance evaluation.

3. Copy Number Variation Detection Using Total Variation.

4. A highly sensitive and specific workflow for detecting rare copy-number variants from exome sequencing data.

5. Copy number variation profiling in pharmacogenes using panel-based exome resequencing and correlation to human liver expression.

6. A machine-learning approach for accurate detection of copy number variants from exome sequencing.

7. Atlas-CNV: a validated approach to call single-exon CNVs in the eMERGESeq gene panel.

8. CODEX2: full-spectrum copy number variation detection by high-throughput DNA sequencing.

9. Clinical analysis of germline copy number variation in DMD using a non-conjugate hierarchical Bayesian model.

10. Preprocessing Sequence Coverage Data for More Precise Detection of Copy Number Variations.

11. Integrative DNA copy number detection and genotyping from sequencing and array-based platforms.

12. WISExome: a within-sample comparison approach to detect copy number variations in whole exome sequencing data.

13. Anaconda: AN automated pipeline for somatic COpy Number variation Detection and Annotation from tumor exome sequencing data.

14. ExCNVSS: A Noise-Robust Method for Copy Number Variation Detection in Whole Exome Sequencing Data.

15. Accurate clinical detection of exon copy number variants in a targeted NGS panel using DECoN

16. Homozygous and hemizygous CNV detection from exome sequencing data in a Mendelian disease cohort.

17. CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing

18. CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data.

19. A Sparse Model Based Detection of Copy Number Variations From Exome Sequencing Data.

20. DeAnnCNV: a tool for online detection and annotation of copy number variations from whole-exome sequencing data.

21. CopywriteR: DNA copy number detection from off-target sequence data.

22. Allele-specific copy-number discovery from whole-genome and whole-exome sequencing.

23. CODEX: a normalization and copy number variation detection method for whole exome sequencing.

24. Combinatorial approach to estimate copy number genotype using whole-exome sequencing data.

25. Assessing copy number from exome sequencing and exome array CGH based on CNV spectrum in a large clinical cohort.

26. Detection of internal exon deletion with exon Del.

27. cnvCapSeq: detecting copy number variation in long-range targeted resequencing data.

28. Inferring copy number and genotype in tumour exome data

29. cnvOffSeq: detecting intergenic copy number variation using off-target exome sequencing data.

30. Identification of copy number variants from exome sequence data.

31. PatternCNV: a versatile tool for detecting copy number changes from exome sequencing data.

32. EXCAVATOR: detecting copy number variants from whole-exome sequencing data

33. CoNVEX: copy number variation estimation in exome sequencing data using HMM.

34. Improving detection of copy-number variation by simultaneous bias correction and read-depth segmentation

35. Modeling read counts for CNV detection in exome sequencing data

36. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth.

37. An exome sequencing pipeline for identifying and genotyping common CNVs associated with disease with application to psoriasis.

38. A robust model for read count data in exome sequencing experiments and implications for copy number variant calling.

39. Copy number variation detection and genotyping from exome sequence data

40. CONTRA: copy number analysis for targeted resequencing

41. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate

42. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing

43. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data

44. Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV.

45. CNV-seq, a new method to detect copy number variation using high-throughput sequencing

根据癌变细胞的外显子数据获取肿瘤的“进化树”史

#######################################################
#######################################################
####### #######
####### CNA and SNA input #######
####### #######
#######################################################
#######################################################

library(Canopy)
data("MDA231")
projectname = MDA231$projectname ## name of project
R = MDA231$R; R ## mutant allele read depth (for SNAs)

X = MDA231$X; X ## total depth (for SNAs)
WM = MDA231$WM; WM ## observed major copy number (for CNA regions)
Wm = MDA231$Wm; Wm ## observed minor copy number (for CNA regions)
epsilonM = MDA231$epsilonM ## standard deviation of WM, pre-fixed here
epsilonm = MDA231$epsilonm ## standard deviation of Wm, pre-fixed here
## whether CNA regions harbor specific CNAs (only needed for overlapping CNAs)
C = MDA231$C; C
Y = MDA231$Y; Y ## whether SNAs are affected by CNAs

#######################################################
#######################################################
####### #######
####### MCMC sampling #######
####### #######
#######################################################
#######################################################

K = 3:6 # number of subclones
numchain = 20 # number of chains with random initiations
sampchain = canopy.sample(R = R, X = X, WM = WM, Wm = Wm, epsilonM = epsilonM,
epsilonm = epsilonm, C = C, Y = Y, K = K,
numchain = numchain, simrun = 100000, writeskip = 200,
projectname = projectname, cell.line = TRUE,
plot.likelihood = TRUE)
save.image(file = paste(projectname, ‘_postmcmc_image.rda‘,sep=‘‘),
compress = ‘xz‘)

#######################################################
#######################################################
####### #######
####### BIC to determine number of subclones #######
####### #######
#######################################################
#######################################################
library(Canopy)
projectname=‘MDA231‘
load(paste(projectname, ‘_postmcmc_image.rda‘, sep=‘‘))
burnin = 100
thin = 10
# If pdf = TRUE, a pdf will be generated.
bic = canopy.BIC(sampchain = sampchain, projectname = projectname, K = K,
numchain = numchain, burnin = burnin, thin = thin, pdf = TRUE)
optK = K[which.max(bic)]

#######################################################
#######################################################
####### #######
####### posterior tree evaluation #######
####### #######
#######################################################
#######################################################

post = canopy.post(sampchain = sampchain, projectname = projectname, K = K,
numchain = numchain, burnin = burnin, thin = thin,
optK = optK, C = C, post.config.cutoff = 0.05)
samptreethin = post[[1]] # list of all post-burnin and thinning trees
samptreethin.lik = post[[2]] # likelihoods of trees in samptree
config = post[[3]]
config.summary = post[[4]]
print(config.summary)
# first column: tree configuration
# second column: posterior configuration probability in the entire tree space
# third column: posterior configuration likelihood in the subtree space
# note: if modes of posterior probabilities aren‘t obvious, run sampling longer.

#######################################################
#######################################################
####### #######
####### Tree output and plot #######
####### #######
#######################################################
#######################################################

# choose the configuration with the highest posterior likelihood
config.i = config.summary[which.max(config.summary[,3]),1]
cat(‘Configuration‘, config.i, ‘has the highest posterior likelihood.\n‘)
output.tree = canopy.output(post, config.i, C)
pdf.name = paste(projectname, ‘_config_highest_likelihood.pdf‘, sep=‘‘)
canopy.plottree(output.tree, pdf = TRUE, pdf.name = pdf.name)
# plot posterior tree with second configuration
output.tree = canopy.output(post, 1, C)
canopy.plottree(output.tree, pdf=TRUE, pdf.name = paste(projectname, ‘_second_config.pdf‘, sep = ‘‘))

以上是关于外显子WES数据检测CNV方法梳理与软件汇总的主要内容，如果未能解决你的问题，请参考以下文章