基因组注释文件(GTF/GFF)格式介绍

Posted 2023-05-13

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了基因组注释文件(GTF/GFF)格式介绍相关的知识，希望对你有一定的参考价值。

参考技术A 基因组注释文件GTF/GFF格式的介绍

GFF 2 -> GTF -> GFF 3 The GTF (General Transfer Format) is identical to GFF version 2

GTF其实就是GFF版本2

其格式为（每个数字代表一列，总共9列）

1. seqname - （染色体名称） name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix.

2. source - （用什么软件产生的）name of the program that generated this feature, or the data source (database or project name)

3. feature - （是转录本/外显子/内含子等）feature type name, e.g. Gene, Variation, Similarity

4. start -（起始点） Start position of the feature, with sequence numbering starting at 1.

5. end - （终止点）End position of the feature, with sequence numbering starting at 1.

6. score - A floating point value.

7. strand (正链还是负链)- defined as + (forward) or - (reverse).

8. frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..

9. attribute - （特性，比如编码的蛋白等）A semicolon-separated list of tag-value pairs, providing additional information about each feature.

举例：

transcribed_pseudogene ------> gene ------> 11869 ------> 14409 ------> .------> + ------>. ------> gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";

参考：

https://www.biostars.org/p/99462/

http://www.ensembl.org/info/website/upload/gff.html

scikit-bio 从 gff3 文件中提取基因组特征

【中文标题】scikit-bio 从 gff3 文件中提取基因组特征【英文标题】：scikit-bio extract genomic features from gff3 file 【发布时间】：2016-07-11 07:59:43 【问题描述】：

scikit-bio 是否可以从基因组 fasta 文件中提取存储在 gff3 格式文件中的基因组特征？

例子：

基因组.fasta

>sequence1
ATGGAGAGAGAGAGAGAGAGGGGGCAGCATACGCATCGACATACGACATACATCAGATACGACATACTACTACTATGA

annotation.gff3

#gff-version 3
sequence1   source  gene    1   78  .   +   .   ID=gene1
sequence1   source  mRNA    1   78  .   +   .   ID=transcript1;parent=gene1
sequence1   source  CDS 1   6   .   +   0   ID=CDS1;parent=transcript1
sequence1   source  CDS 73  78  .   +   0   ID=CDS2;parent=transcript1

mRNA 特征 (transcript1) 的所需序列将是两个子 CDS 特征的串联。所以在这种情况下，这将是'ATGGAGCTATGA'。

【问题讨论】：

从 scikit-bio 0.5.0 开始，不支持读取 gff3 文件。如果这是您希望添加到项目中的功能，请考虑在问题跟踪器上提交功能请求：github.com/biocore/scikit-bio/issues 【参考方案1】：

此功能已添加到 scikit-bio，但 bioconda 中可用的版本尚不支持（2017-12-15）。 gff3 的格式文件存在于Github repository。

您可以使用以下方法克隆 repo 并在本地安装它：

$ git clone https://github.com/biocore/scikit-bio.git
$ cd scikit-bio
$ python setup.py install

按照文件中给出的示例，以下代码应该可以工作：

import io
from skbio.metadata import IntervalMetadata
from skbio.io import read

gff = io.StringIO(open("annotations.gff3", "r").read())
im = read(gff, format='gff3', into=IntervalMetadata, seq_id="sequence1")

print(im)

对我来说，这会引发FormatIdentificationWarning，但条目报告正确：

4 interval features
-------------------
Interval(interval_metadata=<140154121000104>, bounds=[(0, 78)], fuzzy=[(False, False)], metadata='source': 'source', 'type': 'gene', 'score': '.', 'strand': '+', 'ID': 'gene1')
Interval(interval_metadata=<140154121000104>, bounds=[(0, 78)], fuzzy=[(False, False)], metadata='source': 'source', 'type': 'mRNA', 'score': '.', 'strand': '+', 'ID': 'transcript1', 'parent': 'gene1')
Interval(interval_metadata=<140154121000104>, bounds=[(0, 6)], fuzzy=[(False, False)], metadata='source': 'source', 'type': 'CDS', 'score': '.', 'strand': '+', 'phase': 0, 'ID': 'CDS1', 'parent': 'transcript1')
Interval(interval_metadata=<140154121000104>, bounds=[(72, 78)], fuzzy=[(False, False)], metadata='source': 'source', 'type': 'CDS', 'score': '.', 'strand': '+', 'phase': 0, 'ID': 'CDS2', 'parent': 'transcript1')

在代码中的示例中，GFF3 和 FASTA 文件在用于读取功能的输入字符串中连接。也许这可以解决这个问题。此外，我不是 100% 确定如何使用返回的间隔来提取特征。

【讨论】：

以上是关于基因组注释文件(GTF/GFF)格式介绍的主要内容，如果未能解决你的问题，请参考以下文章

gff/gtf格式

GTF与GFF

读取gff/gtf文件的内容

R语言可视化展示gff3格式基因组注释文件简单小例子

探索gff/gtf格式

快速计算基因表达软件：Salmon