HTseq-count
Posted 一周一paper,一周一技术
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了HTseq-count相关的知识,希望对你有一定的参考价值。
HTSeq:一个用于处理高通量数据(High-throughout sequencing)的python包。
HTSeq包有很多功能类,熟悉python脚本的可以自行编写数据处理脚本。
另外,HTSeq也提供了两个脚本文件能够直接处理数据:htseq-qa(检测数据质量)和htseq-count(reads计数)。
用法:htseq-count [options] <alignment_file> <gff_file>
<alignment_file> :
contains the aligned reads in the SAM format.
Make sure to use a splicing-aware aligner such as TopHat.
To read from standard input, use - as <alignment_file>.
{options}
- -f <format>, --format=<format>
-
Format of the input data. Possible values are sam (for text SAM files) and bam (for binary BAM files). Default is sam.
- -r <order>, --order=<order>
-
For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for <order> to indicate how the input data has been sorted. The default is name.
If name is indicated, htseq-count expects all the alignments for the reads of a given read pair to appear in adjacent records in the input data. For pos, this is not expected; rather, read alignments whose mate alignment have not yet been seen are kept in a buffer in memory until the mate is found. While, strictly speaking, the latter will also work with unsorted data, sorting ensures that most alignment mates appear close to each other in the data and hence the buffer is much less likely to overflow.
- -s <yes/no/reverse>, --stranded=<yes/no/reverse>
-
whether the data is from a strand-specific assay (default: yes)
For stranded=no, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed.
If your RNA-Seq data has not been made with a strand-specific protocol, this causes half of the reads to be lost. Hence, make sure to set the option --stranded=no unless you have strand-specific data!
- -a <minaqual>, --a=<minaqual>
-
skip all reads with alignment quality lower than the given minimum value (default: 10 — Note: the default used to be 0 until version 0.5.4.)
- -t <feature type>, --type=<feature type>
-
feature type (3rd column in GFF file) to be used, all features of other type are ignored (default, suitable for RNA-Seq analysis using an Ensembl GTF file: exon)
- -i <id attribute>, --idattr=<id attribute>
-
GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table. The default, suitable for RNA-Seq analysis using an Ensembl GTF file, is gene_id.
- -m <mode>, --mode=<mode>
-
Mode to handle reads overlapping more than one feature. Possible values for <mode> are union, intersection-strict and intersection-nonempty(default: union)
- -o <samout>, --samout=<samout>
-
write out all SAM alignment records into an output SAM file called <samout>, annotating each line with its assignment to a feature or a special counter (as an optional field with tag ‘XF’)
- -q, --quiet
-
suppress progress report and warnings
以上是关于HTseq-count的主要内容,如果未能解决你的问题,请参考以下文章