从文件中提取特定范围的 fasta 序列

Posted 2023-03-24

技术标签:

【中文标题】从文件中提取特定范围的 fasta 序列【英文标题】：extract a specific range of fasta sequences from a file 【发布时间】：2020-04-05 11:08:37 【问题描述】：

我正在尝试从特定范围中提取序列。我使用的命令只能提取 fasta 序列中的前 n 行

awk "/^>/ n++ n>2000 exit print" Name.faa > Name_2k_cds.faa

如果我想从特定范围（例如 2000 到 3000）中提取序列，我该怎么做？我现有的代码中是否有一个简单的编辑。

谢谢！

【问题讨论】：

欢迎来到 SO，请在您的问题中发布输入和预期输出示例并让我们知道。不过，特别感谢您在问题中发表您的努力。能否请您检查一下我的回答，让我知道这是否对您有帮助？ 【参考方案1】：

你可以试试这个：

sed -n '2000,3000p' Name.faa > Name_2k_to_3k_cds.faa

解释：

sed -n       # suppress automatic printing of pattern space
'2000,3000p' # print only line 2000 to 3000

【讨论】：

恕我直言，如果我的问题正确，OP 不想从行号打印。 OP 想要计算从 > 开始的行数，并且该计数的行号从 2000 到他/她需要的 3000。【参考方案2】：

请您尝试关注一下。

awk '/^>/n++ n>=2000 && n<=3000;n==3000exit' Name.faa > Name_2k_cds.faa

说明：在此处添加对上述代码的说明。

awk '                             ##Starting awk program from here.
/^>/n++                         ##Checking condition if a line starts from > then do following.
n>=2000 && n<=3000                ##Checking condition if value of n is greater than or equal than 2000 AND lesser than or equal to 3000 then print that line.
n==3000                          ##Checking condition if value of n is 3000 then exit from this program, NO NEED to read whole Input_file since we need only 2000 to 3000 lines only.   
  exit                            ##Using exit to exit from code.

' Name.faa > Name_2k_cds.faa      ##Mentioning Input_file name and re-directing its output to another output file.

【讨论】：

【参考方案3】：

对@RavinderSingh13 提出的解决方案稍作补充

awk '/^>/n++ n>=2000 && n<=3000;n==3001exit' Name.faa > Name_2k_cds.faa

这确保序列 3000 也存储在新文件中，而原始解决方案的输出提取序列 3000 的标题，而不是序列本身。

【讨论】：

以上是关于从文件中提取特定范围的 fasta 序列的主要内容，如果未能解决你的问题，请参考以下文章