统计细菌基因组ORF

Posted orange1002

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了统计细菌基因组ORF相关的知识,希望对你有一定的参考价值。

提取细菌基因组ORF思路:

1.通过FNA文件得到细菌基因组序列

2.分正负链和三个相位共6种情况统计ORF

3.写入文件

转载请保留出处!

统计细菌基因组ORF

贴上Python代码(版本:3.6)

 1 # -*- coding: utf-8 -*-
 2 """
 3 Created on Thu Dec 14 13:19:00 2017
 4 
 5 @author: zxzhu
 6 """
 7 
 8 import re
 9 def N2M(sequence):                    #正负链转换
10     hash = {\'A\': \'T\', \'T\': \'A\', \'C\': \'G\', \'G\': \'C\',\'N\':\'N\'}
11     sequence = \'\'.join([hash[i] for i in sequence])     
12     return sequence[::-1]
13 
14 def translate(seq):                    #将序列转换为起始,终止,其他密码子
15     pa1 = re.compile(r\'TAA|TAG|TGA\')
16     after_trans = \'\'
17     for i in range(0,len(seq),3):
18         if seq[i:i+3]==\'ATG\':
19             after_trans+=\'I\'
20         elif pa1.match(seq[i:i+3]):
21             after_trans+=\'T\'
22         else:
23             after_trans+=\'O\'
24     return after_trans
25 
26 def get_orf(seq,length=90):
27     pa2 = re.compile(r\'I[IO]+?T\')   #匹配模式:起始1非终止1~N终止1
28     trans_seq = translate(seq)
29     m = pa2.finditer(trans_seq)     #所有匹配结果的迭代
30     index = []
31     orf = []
32     for i in m:
33         index.append(i.span())     #序列起始,终止位置
34     for i in index:
35         orf_start = i[0]*3
36         orf_end = i[1]*3
37         #print(orf_start,orf_end)
38         if orf_end - orf_start >= length:   #不小于90bp
39             orf.append(seq[orf_start:orf_end])
40     return orf
41 
42 def Seq2AA(sequence,hash):    #翻译为AA序列
43     AA=\'\'                                 
44     for i in range(0, len(sequence) - 3, 3):
45         AA += hash[sequence[i:i + 3]]
46     return AA
47 
48 def main(fna,length=90):
49     fn = open(fna)
50     pa = re.compile(r\'\\s+\')
51     hash_seq = {}  # CDS hash,CDS2sequence
52     result1 = open(\'orf_seq.txt\',\'w\')
53     result2 = open(\'orf_AA.txt\',\'w\')
54     start = [0,1,2]       #相位
55     strain = \'+-\'         #正负链
56     hash_AA = {}  # AA hash,sequence2AA
57     with open(\'AA.txt\', \'r\') as f:                         #AA.txt 为密码子表
58         for line in f:
59             line = line.strip()
60             if line:
61                 line = pa.split(line)
62                 hash_AA[line[0]] = line[1]      #AA hash
63     
64     for line in fn:                             #获取序列
65         line = line.strip()
66         if line.startswith(\'>\'):
67             A = pa.split(line)[0].replace(\'>\', \'\')
68             hash_seq[A] = \'\'
69         else:
70             hash_seq[A] += line
71     
72     for key in hash_seq.keys():             #分+-链,3个相位统计ORF
73         seq = hash_seq[key]
74         for r in strain:
75             if r == \'-\':
76                 seq = N2M(seq)
77             for s in start:
78                 seq = seq[s:]
79                 #trans_seq = translate(seq)
80                 orf = get_orf(seq)
81                 for i in orf:
82                     if \'N\' not in i:      #去除N
83                         AA =Seq2AA(i,hash_AA) 
84                         result1.write(\'>\'+key+\'\\t\'+r+\'\\t\'+str(s)+\'\\n\'+i+\'\\n\')
85                         result2.write(\'>\'+key+\'\\t\'+r+\'\\t\'+str(s)+\'\\n\'+AA+\'\\n\')
86     fn.close()
87     result1.close()
88     result2.close()
89 
90     
91 fna = \'GCA_000160075.2_ASM16007v2_genomic.fna\'
92 main(fna)

NCBI可以找ORF,很方便。码一下:ORFfinder

以上是关于统计细菌基因组ORF的主要内容,如果未能解决你的问题,请参考以下文章

请教ORF 和 gene 的区别

DNA sequence open reading frames (ORFs) | DNA序列的开放阅读框ORF预测

如何在genbank查找某个细菌的基因序列?

基因组文库名词解释

CRISPR/Cas9|InParanoid|orthoMCL|PanOCT|pan genome|meta genome|Core gene|CVTree3|

易基因:全基因组ChIP-seq分析揭示细菌转录因子PhoB的基因内结合位点|mBio