StringTie用法详解

Posted 2021-02-03 emanlee

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了StringTie用法详解相关的知识，希望对你有一定的参考价值。

StringTie

参考链接：

https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#input

https://www.cnblogs.com/adawong/articles/7977314.html

参数简介

StringTie的基本用法： stringtie <aligned_reads.bam> [options]*

其中，aligned_reads.bam 是输入文件，该输入文件要求必须按其基因组位置排序， HISAT2的输出文件则需经过samtools sort生成的bam文件才可当做输入文件。

其他可选参数：

-h/--help   帮助信息

-v  打开详细模式，打印程序处理的详细信息。

-o [<path/>]<out.gtf> 设置StringTie组装转录本的输出GTF文件的路径和文件名。此处可指定完整路径，在这种情况下，将根据需要创建目录。默认情况下，StringTie将GTF写入标准输出。

-p <int>    指定组装转录本的线程数（CPU）。默认值是1

-G <ref_ann.gff>    使用参考注释基因文件指导组装过程，格式GTF/GFF3。输出文件中既包含已知表达的转录本，也包含新的转录本。选项-B，-b，-e，-C需要此选项（详情如下）

--rf    链特异性建库方式：fr-firststrand(最常用的是dUTP测序方式，其他有NSR，NNSR).

--fr    链特异性建库方式：fr-secondstrand(如 Ligation,Standard SOLiD).

-l <label>  将<label>设置为输出转录本名称的前缀。默认：STRG

-f <0.0-1.0>    将预测转录本的最低isoform的丰度设定为在给定基因座处组装的丰度最高的转录本的一部分。较低丰度的转录物通常是经加工的转录本的不完全剪接前体的artifacts。默认值为0.1。

-m <int>    设置预测的转录本所允许的最小长度.默认值为200

-A <gene_abund.tab> 输出基因丰度的文件（制表符分隔格式）

-C <cov_refs.gtf>   输出所有转录本对应的reads覆盖度的文件，此处的转录本是指参考注释基因文件中提供的转录本。(需要参数 -G).

-a <int>    Junctions that don‘t have spliced reads that align across them with at least this amount of bases on both sides are filtered out. Default: 10

-j <float>  连接点的覆盖度，即设置至少有这么多的spliced reads 比对到连接点(align across a junction)。 这个数字可以是分数, 因为有些reads可以比对到多个地方。 当一个read 比对到 n 个地方是，则此处连接点的覆盖度为1/n 。默认值为1。

-t  该参数禁止修剪组装的转录本的末端。默认情况下，StringTie会根据组装的转录本的覆盖率的突然下降来调整预测的转录本的开始和/或停止坐标。

-c <float>  设置预测转录本所允许的最小read 覆盖度。 当一个转录本的覆盖度低于阈值，则输出文件中不含该转录本。默认值为 2.5

-g <int>    设置ga最小值。 Reads that are mapped closer than this distance are merged together in the same processing bundle. Default: 50 (bp)

-B  应用该选项，则会输出Ballgown输入表文件（* .ctab），其中包含用-G选项给出的参考转录本的覆盖率数据。（有关这些文件的说明，请参阅Ballgown文档。）
    如果选项-o 给出输出转录文件的完整路径，则* .ctab文件与输出GTF文件在相同的目录下。
    
-b <path>   指定 *.ctab 文件的输出路径, 而非由-o选项指定的目录。
    注意: 建议在使用-B/-b选项中同时使用-e选项，除非StringTie GTF输出文件中仍需要新的转录本。
    
-e  限制reads比对的处理，仅估计和输出与用-G选项给出的参考转录本匹配的组装转录本。使用该选项，则会跳过处理与参考转录本不匹配的组装转录本，这将大大的提升了处理速度。

-M <0.0-1.0>    设定。默认值为0.95.
-x <seqid_list> 忽略所有比对到指定的参考序列上的reads，因此这部分的reads不需要组装转录本。 参数 <seqid_list>可以是单个参考序列名称 (如： -x chrM)，也可以是逗号分隔的序列名称列表 (如： -x ‘chrM,chrX,chrY‘)。这可以加快StringTie的组装分析的速度，特别是在排除线粒体基因组的情况下，在某些情况下，线粒体的基因可能具有非常高的覆盖率，但是它们对于特定的RNA-Seq分析可能不感兴趣的。

--merge 转录本合并模式。 在合并模式下，StringTie将所有样品的GTF/GFF文件列表作为输入，并将这些转录本合并/组装成非冗余的转录本集合。这种模式被用于新的差异分析流程中，用以生成一个跨多个RNA-Seq样品的全局的、统一的转录本。
    如果提供了-G选项（参考注释基因组文件），则StringTie将从输入的GTF文件中将参考转录本组装到transfrags中。(个人理解：transfrags可能指的是拼接成更大的转录本片段，tanscript fragments)

在此模式下可以使用以下附加选项：
-G <guide_gff>  参考注释基因组文件(GTF/GFF3)
-o <out_gtf>    指定输出合并的GTF文件的路径和名称 (默认值：标准输出)
-m <min_len>    合并文件中，指定允许最小输入转录本的长度 (默认值: 50)
-c <min_cov>    合并文件中，指定允许最低输入转录本的覆盖度(默认值: 0)
-F <min_fpkm>   合并文件中，指定允许最低输入转录本的FPKM值 (默认值: 0)
-T <min_tpm>    合并文件中，指定允许最低输入转录本的TPM值  (默认值: 0)
-f <min_iso>    minimum isoform fraction (默认值: 0.01)
-i  合并后，保留含retained introns的转录本 (默认值: 除非有强有力的证据，否则不予保留)
-l <label>  输出转录本的名称前缀 (默认值: MSTRG)

输入文件

其中，aligned_reads.bam 是输入文件，该输入文件要求必须按其基因组位置排序，如TopHat的输出文件accepted_hits.bam可直接当做输入文件，而 HISAT2的输出文件则需经过samtools sort生成的bam文件才可当做输入文件。

输入BAM文件中的每个 spliced read 比对（即跨越至少一个连接点的比对）必须包含标签XS，用以指示测序产生的read是来源于基因组序列上的哪条链产生的RNA。由TopHat和 HISAT2 (需参数 --dta，该参数用于发现剪接位点) 产生的比对结果中已经包含标签XS。但是，有的mapping程序(read mapper)未必含有标签XS，所以，用户在进行下一步分析时需要进行检查。

注意：一定要使用-dta选项来运行HISAT2，否则结果将会受到影响。

作为选项，可以向StringTie提供GTF / GFF3格式的参考注释基因组文件。在这种情况下，StringTie更喜欢使用注释文件中的这些“已知”基因，对于那些被表达的基因，它将计算coverage，TPM和FPKM值。它还会产生额外的转录本，而注释文件中并没有这些转录本。请注意，如果不使用选项-e，那么参考转录本就需要被reads 完全覆盖，以便包含在StringTie的输出中。在这种情况下，其他通过StringTie从数据中组装的转录本，且不在注释文件中的转录本也会输出。

注意：如果用户正在分析注释较好的基因组，例如人类，小鼠或其他模型生物，则强烈建议您提供注释文件。

输出文件

主要输出文件有：

1、 GTF文件：记录组装的转录本信息

2、 Tab文件：记录基因丰度信息

3、 GTF文件：完全覆盖与参考注释基因组文件所匹配的转录本信息

4、 *.ctab文件：用于下游Ballgown软件做差异表达分析的输入文件

5、 GTF文件：在合并模式下，生成一个合并的GTF文件

GTF文件：记录组装的转录本信息

seqname: 染色体，contig, 或 scaffold
source: GTF文件的源文件。
feature: 特征类型；如：exon, transcript, mRNA, 5‘UTR。
start: 开始位置，使用基于1的索引
end: 结束位置，使用基于1的索引
score: 组装的转录本的可信度分数。目前这个字段没有被使用，并且如果转录本与a read alignment bundle

有连接，则StringTie输出常数值1000。
strand: 正向链： ‘+‘；反向链： ‘-‘.
frame: CDS特征的 Frame or phase 。 StringTie不使用该字段，只记录一个“.”。
attributes:
- gene_id: A unique identifier for a single gene and its child transcript and exons based on the alignments‘ file name.
- transcript_id: A unique identifier for a single transcript and its child exons based on the alignments‘ file name.
- exon_number: A unique identifier for a single exon, starting from 1, within a given transcript.
- reference_id: The transcript_id in the reference annotation (optional) that the instance matched.
- ref_gene_id: The gene_id in the reference annotation (optional) that the instance matched.
- ref_gene_name: The gene_name in the reference annotation (optional) that the instance matched.
- cov: The average per-base coverage for the transcript or exon.
- FPKM: Fragments per kilobase of transcript per million read pairs. This is the number of pairs of reads aligning to this feature, normalized by the total number of fragments sequenced (in millions) and the length of the transcript (in kilobases).
- TPM: Transcripts per million. This is the number of transcripts from this particular gene normalized first by gene length, and then by sequencing depth (in millions) in the sample. A detailed explanation and a comparison of TPM and FPKM can be found here, and TPM was defined by B. Li and C. Dewey here.

Tab文件：记录基因丰度信息

如果StringTie使用-A <gene_abund.tab>选项运行，则返回包含基因丰度的文件。

Column 1 / Gene ID: The gene identifier comes from the reference annotation provided with the -G option. If no reference is provided this field is replaced with the name prefix for output transcripts (-l).
Column 2 / Gene Name: This field contains the gene name in the reference annotation provided with the -G option. If no reference is provided this field is populated with ‘-‘.
Column 3 / Reference: Name of the reference sequence that was used in the alignment of the reads. Equivalent to the 3rd column in the .SAM alignment.
Column 4 / Strand: ‘+‘ denotes that the gene is on the forward strand, ‘-‘ for the reverse strand.
Column 5 / Start: Start position of the gene (1-based index).
Column 6 / End: End position of the gene (1-based index).
Column 7 / Coverage: Per-base coverage of the gene.
Column 8 / FPKM: normalized expression level in FPKM units (see previous section).
Column 9 / TPM: normalized expression level in RPM units (see previous section).