2019-04-28samtools

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了2019-04-28samtools相关的知识,希望对你有一定的参考价值。

参考技术A samtools的说明文档: http://samtools.sourceforge.net/samtools.shtml
samtools是一个用于操作sam和bam文件的工具合集。包含有许多命令。以下是常用命令的介绍

view命令的主要功能是:将sam文件转换成bam文件;然后对bam文件进行各种操作,比如数据的排序(不属于本命令的功能)和提取(这些操作 是对bam文件进行的,因而当输入为sam文件的时候,不能进行该操作);最后将排序或提取得到的数据输出为bam或sam(默认的)格式。

bam文件优点:bam文件为二进制文件,占用的磁盘空间比sam文本文件小;利用bam二进制文件的运算速度快。

view命令中,对sam文件头部的输入(-t或-T)和输出(-h)是单独的一些参数来控制的。
默认情况下不加 region,则是输出所有的 region.

提取比对到参考序列上的比对结果

提取paired reads中两条reads都比对到参考序列上的比对结果,只需要把两个4+8的值12作为过滤参数即可

提取没有比对到参考序列上的比对结果

提取bam文件中比对到caffold1上的比对结果,并保存到sam文件格式

提取scaffold1上能比对到30k到100k区域的比对结果

根据fasta文件,将 header 加入到 sam 或 bam 文件中

sort对bam文件进行排序。

将2个或2个以上的已经sort了的bam文件融合成一个bam文件。融合后的文件不需要则是已经sort过了的。

必须对bam文件进行默认情况下的排序后,才能进行index。否则会报错。

建立索引后将产生后缀为.bai的文件,用于快速的随机处理。很多情况下需要有bai文件的存在,特别是显示序列比对情况下。比如samtool的tview命令就需要;gbrowse2显示reads的比对图形的时候也需要。

以下两种命令结果一样

$ samtools faidx genome.fasta

$ samtools faidx genome.fasta scffold_10 > scaffold_10.fasta</pre>

Usage: samtools tview <aln.bam> [ref.fasta]

Usage: samtools flagstat <in.bam>

11945742 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates
7536364 + 0 mapped (63.09%:-nan%)

11945742 + 0 paired in sequencing

5972871 + 0 read1

5972871 + 0 read2

6412042 + 0 properly paired (53.68%:-nan%)

6899708 + 0 with itself and mate mapped

636656 + 0 singletons (5.33%:-nan%)

469868 + 0 with mate mapped to a different chr

243047 + 0 with mate mapped to a different chr (mapQ>=5)</pre>

<>$ samtools reheader <in.header.sam> <in.bam></pre>

http://www.hudsonalpha.org/gsl/information/software/bam2fastq
tar zxf bam2fastq-1.1.0.tgz
make
$ ./bam2fastq <in.bam></pre>

samtools还有个非常重要的命令mpileup,以前为pileup。该命令用于生成bcf文件,再使用bcftools进行SNP和Indel的分析。bcftools是samtool中附带的软件,在samtools的安装文件夹中可以找到。

最常用的参数有2: -f 来输入有索引文件的fasta参考序列; -g 输出到bcf格式。用法和最简单的例子如下

samtools mpileup -gSDf genome.fasta abc.bam > abc.bcf
$ samtools mpileup -guSDf genome.fasta abc.bam |
bcftools view -cvNg - > abc.vcf</pre>

">scaffold_1 2841 A 11 ,,,...,.... BHIGDGIJ?FF
scaffold_1 2842 C 12 , !!!!!!!!
scaffold_1 2849 A 11 c$,...,..... !0000000000
scaffold_1 2850 A 10 ,...,..... 353333333</pre>

-6 Assume the quality is in the Illumina 1.3+ encoding. -A Do not skip anomalous read pairs in variant calling.
-B Disable probabilistic realignment for the computation of base alignment quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments.
-b FILE List of input BAM files, one file per line [null]
-C INT Coefficient for downgrading mapping quality for reads containing excessive mismatches. Given a read with a phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this functionality; if enabled, the recommended value for BWA is 50. [0]
-d INT At a position, read maximally INT reads per input BAM. [250]
-E Extended BAQ computation. This option helps sensitivity especially for MNPs, but may hurt specificity a little bit.
-f FILE The faidx-indexed reference file in the FASTA format. The file can be optionally compressed by razip. [null]
-l FILE BED or position list file containing a list of regions or sites where pileup or BCF should be generated [null]
-M INT cap mapping quality at INT [60]
-q INT Minimum mapping quality for an alignment to be used [0]
-Q INT Minimum base quality for a base to be considered [13]
-r STR Only generate pileup in region STR [all sites]

输出参数
-D Output per-sample read depth (require -g/-u)
-g Compute genotype likelihoods and output them in the binary call format (BCF).
-S Output per-sample Phred-scaled strand bias P-value (require -g/-u)
-u Similar to -g except that the output is uncompressed BCF, which is preferred for piping.

Options for Genotype Likelihood Computation (for -g or -u):
-e INT Phred-scaled gap extension sequencing error probability. Reducing INT leads to longer indels. [20]
-h INT Coefficient for modeling homopolymer errors. Given an l-long homopolymer run, the sequencing error of an indel of size s is modeled as INT*s/l. [100]
-I Do not perform INDEL calling
-L INT Skip INDEL calling if the average per-sample depth is above INT. [250]
-o INT Phred-scaled gap open sequencing error probability. Reducing INT leads to more indel calls. [40]
-P STR Comma dilimited list of platforms (determined by @RG-PL) from which indel candidates are obtained. It is recommended to collect indel candidates from sequencing technologies that have low indel error rate such as ILLUMINA. [all]</pre>

$ bcftools view -cvNg abc.bcf > snp_indel.vcf</pre>

">Tag Format Description
AF1 double Max-likelihood estimate of the site allele frequency (AF) of the first ALT allele
DP int Raw read depth (without quality filtering)
DP4 int[4] # high-quality reference forward bases, ref reverse, alternate for and alt rev bases
FQ int Consensus quality. Positive: sample genotypes different; negative: otherwise
MQ int Root-Mean-Square mapping quality of covering reads
PC2 int[2] Phred probability of AF in group1 samples being larger (,smaller) than in group2
PCHI2 double Posterior weighted chi^2 P-value between group1 and group2 samples
PV4 double[4] P-value for strand bias, baseQ bias, mapQ bias and tail distance bias
QCHI2 int Phred-scaled PCHI2
RP int # permutations yielding a smaller PCHI2
CLR int Phred log ratio of genotype likelihoods with and without the trio/pair constraint
UGT string Most probable genotype configuration without the trio constraint
CGT string Most probable configuration with the trio constraint</pre>

">Input/Output Options:
-A Retain all possible alternate alleles at variant sites. By default, the view command discards unlikely alleles.
-b Output in the BCF format. The default is VCF.
-D FILE Sequence dictionary (list of chromosome names) for VCF->BCF conversion [null]
-F Indicate PL is generated by r921 or before (ordering is different).
-G Suppress all individual genotype information.
-l FILE List of sites at which information are outputted [all sites]
-N Skip sites where the REF field is not A/C/G/T
-Q Output the QCALL likelihood format
-s FILE List of samples to use. The first column in the input gives the sample names and the second gives the ploidy, which can only be 1 or 2. When the 2nd column is absent, the sample ploidy is assumed to be 2. In the output, the ordering of samples will be identical to the one in FILE. [null]
-S The input is VCF instead of BCF.
-u Uncompressed BCF output (force -b).

Consensus/Variant Calling Options:
-c Call variants using Bayesian inference. This option automatically invokes option -e.
-d FLOAT When -v is in use, skip loci where the fraction of samples covered by reads is below FLOAT. [0]

-e Perform max-likelihood inference only, including estimating the site allele frequency, testing Hardy-Weinberg equlibrium and testing associations with LRT.
-g Call per-sample genotypes at variant sites (force -c)
-i FLOAT Ratio of INDEL-to-SNP mutation rate [0.15]
-p FLOAT A site is considered to be a variant if P(ref|D)
-t FLOAT Scaled muttion rate for variant calling [0.001]
-T STR Enable pair/trio calling. For trio calling, option -s is usually needed to be applied to configure the trio members and their ordering. In the file supplied to the option -s, the first sample must be the child, the second the father and the third the mother. The valid values of STR are ‘pair’, ‘trioauto’, ‘trioxd’ and ‘trioxs’, where ‘pair’ calls differences between two input samples, and ‘trioxd’ (‘trioxs’) specifies that the input is from the X chromosome non-PAR regions and the child is a female (male). [null]
-v Output variant sites only (force -c)

Contrast Calling and Association Test Options:
-1 INT Number of group-1 samples. This option is used for dividing the samples into two groups for contrast SNP calling or association test. When this option is in use, the following VCF INFO will be outputted: PC2, PCHI2 and QCHI2. [0]
-U INT Number of permutations for association test (effective only with -1) [0]
-X FLOAT Only perform permutations for P(chi^2)</pre>

<> _ if /DP4=(\d+),(\d+),(\d+),(\d+)/ && ( 4)>=10 && ( 4)/( 2+ 4)>=0.8' snp_indel.vcf > snp_indel.final.vcf</pre>

$ samtools input.sorted.bam output.bam

http://www.chenlianfu.com/?p=1399
http://en.wikipedia.org/wiki/SAMtools
http://www.htslib.org/doc/samtools-1.1.html
http://sourceforge.net/projects/samtools/files/ http://samtools.sourceforge.net/samtools.shtml ** </pre>

(基因分析黑科技)基于基数排序算法实现的samtools sort

        samtools sort的默认内排序算法是来自klib中ksort.c里的ks_mergesort。其实有很大改进提速空间。本人的小伙伴xubl实现了一个基数排序算法,本人在此做了代码整理,既有独立的bamRadisSort,同时也实现了与samtools的无缝整合(可将本人github下的bam_sort.c替换原生samtools的bam_sort.c然后再编译samtools,这样编译出来的samtools sort就是基于基数排序),编译后相比samtools 至少提速30%以上,结果无差别。


        代码已开源在github上,有兴趣的可下载编译后测试下,测试的同学注意下内存分配问题,原则上内存足够的话,内存排序速度是要远大于外排序的。



https://github.com/xiongxu/bamRadixSort


        

以上是关于2019-04-28samtools的主要内容,如果未能解决你的问题,请参考以下文章

[转帖]深入理解 MySQL—锁事务与并发控制

hdu2222 (AC自动机模板)