一步一步教你写perl

Posted 2021-04-17 生信人

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了一步一步教你写perl相关的知识，希望对你有一定的参考价值。

本次主要练习数组与哈希用法以及正则匹配与捕获

1、根据基因ID提取对应的GFF（二维数组+正则匹配+hash）

GFF文件格式

提取的gene ID列表

gene0

gene1

代码

#! /usr/bin/perl

my $gff=$ARGV[0];

my $id=$ARGV[1];

my $out=$ARGV[2];

my @genegff;

my %hash;

open (IN1, "<$id")|| die "cannot open $id:$!";

open (OUT1, ">$out")|| die "cannot open $out:$!";

open (IN2, "<$gff")|| die "cannot open $gff:$!";

my @ID;

my @GFF;

my %hash;

while(<IN1>){

chomp;###去掉末尾的换行符

next if(/^\#/);##跳过开始为#的行，^表示匹配每行的最前端

next if(/^$/);###跳过空行。$匹配每行末端字符

push @ID, $_;###将上述一维数组存入

}

while(<IN2>){

chomp;

next if(/^\#/);

next if(/^$/);

my @tmp=split/\t/;##默认按照空格分隔，此处按照\t分隔，将gff此行元素中存入一维数组

push @GFF, [@tmp];###将@tmp一维数组存入到二维数组中@GFF

}

my $geneID;

for(my $i=0;$i<=$#GFF;$i++){

if($GFF[$i][2]=~/gene/){

($geneID)=$GFF[$i][8]=~/ID=([^;]+);/;

###正则匹配：[^;]非分号字符，其外边加上一个括号表示我们要匹配提取的部分，匹配后的内容存储到$geneID中，注意$geneID外边必须加上括号，这是正则匹配后抓取匹配内容的一种方法

}

push @{$hash{$geneID}},[@{$GFF[$i]}];###每个ID对应一个数组，注意表示二维数组某一行的写法@{$GFF[$i]}‘；

}

for(my $i=0;$i<=$#ID;$i++){

my @array=@{$hash{$ID[$i]}};###取出对应的某个ID的全部注释信息

foreach (@array){###二维数组输出示例

print OUT1 join("\t",@$_)."\n";###列与列之间用\t分隔

}

2、正则匹配

数据集

一个LACHESIS组装后的fasta序列文件，从每一个序列ID中识别出每一个Lachesis_group有多少条contig以及对应的长度，并输出所有group的总长度。

代码

#!/usr/bin/perl -w

use strict;

use Cwd;

use Bio::SeqIO;

my $fa=$ARGV[0];

my $out=$ARGV[1];

open (OUT, ">$out")|| die "cannot open $out:$!";

my %hash;

my %hashN;

my %hashorder;

my %hashorderN;

my $order;

my $orderN;

my $ina = Bio::SeqIO->new(-file =>$fa, -format => 'fasta');

while(my $obj = $ina->next_seq()){

my $id = $obj->id;

if($id=~/Lachesis_group(\d+)\|(\d+)_contigs__length_(\d+)/){

######匹配写法：=~，第一个括号内匹配的内容用$1表示，其中\d+匹配一个或多个数字。$2,$3依次存储第2个括号和第3个括号中内容

$hashorder{$1}=$3;

$hashorderN{$1}=$2;

}

print OUT "Group\tSequence Number\tSequence Length (bp)\n";###输出文件标题头，\t输出内容用tab键分隔

foreach my $key(sort {$a<=>$b} keys %hashorder){###对哈希按照键排序，排序方法是按照数字大小排序

$order+= $hashorder{$key};

$orderN+= $hashorderN{$key};

print OUT "Lachesis Group$key\t$hashorderN{$key}\t$hashorder{$key}\n";

}

print OUT "Total Sequences Ordered\t$orderN\t$order\n";

3、习题

想练习的小伙欢迎把答案留言！

把括号中物种名字输出出来，并计算含有的物种个数

GCA_001310045.1 Carica papaya (papaya)

GCA_001611255.1 Oryza sativa Indica Group (rice)

GCA_000236765.1 Sorghum bicolor (sorghum)

GCF_000715135.1 Nicotiana tabacum (common tobacco)

GCA_000147395.2 Oryza glaberrima (rice)

结果示例：

papaya 1

rice 2

sorghum 1

common tobacco 1

以上是关于一步一步教你写perl的主要内容，如果未能解决你的问题，请参考以下文章