利用R语言及perl语言提取基因CDS区实战

Posted 医知圈

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了利用R语言及perl语言提取基因CDS区实战相关的知识,希望对你有一定的参考价值。

医知圈



最近挺忙,好久没写了,今天来给大家介绍下利用R语言及perl语言提取基因CDS区。为何要提取CDS区,因为我们在做基因进化分析时,特别是病毒相关的,都要用到,而手工提取太麻烦,今天我给大学分享一下R语言及perl语言代码脚本提取。

利用R语言及perl语言提取基因CDS区实战

R语言代码如下 :


## ---------------------------------------------------------------------

## 1. A TOY EXAMPLE

## ---------------------------------------------------------------------


library(Biostrings)


## A chromosome of length 30:

x <- DNAString("ATTTAGGACACTCCCTGAGGACAAGACCCC")


## 2 transcripts on 'x':

tx1 <- IRanges(1, 8)            # 1 exon

tx2 <- c(tx1, IRanges(12, 30))  # 2 exons

transcripts <- IRangesList(tx1=tx1, tx2=tx2)

extractTranscriptSeqs(x, transcripts)


## By default, all the exons are considered to be on the plus strand.

## We can use the 'strand' argument to tell extractTranscriptSeqs()

## to extract them from the minus strand.


## Extract all the exons from the minus strand:

extractTranscriptSeqs(x, transcripts, strand="-")


## Note that, for a transcript located on the minus strand, the exons

## should typically be ordered by descending position on the reference

## genome in order to reflect their rank in the transcript:

extractTranscriptSeqs(x, IRangesList(tx1=tx1, tx2=rev(tx2)), strand="-")


## Extract the exon of the 1st transcript from the minus strand:

extractTranscriptSeqs(x, transcripts, strand=c("-", "+"))


## Extract the 2nd exon of the 2nd transcript from the minus strand:

extractTranscriptSeqs(x, transcripts, strand=list("-", c("+", "-")))


## ---------------------------------------------------------------------

## 2. A REAL EXAMPLE

## ---------------------------------------------------------------------


## Load a genome:

library(BSgenome.Hsapiens.UCSC.hg19)

genome <- BSgenome.Hsapiens.UCSC.hg19


## Load a TxDb object:

txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite",

                         package="GenomicFeatures")

txdb <- loadDb(txdb_file)


## Check that 'txdb' is based on the hg19 assembly:

txdb


## Extract the exon ranges grouped by transcript from 'txdb':

transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)


## Extract the transcript sequences from the genome:

tx_seqs <- extractTranscriptSeqs(genome, transcripts)

tx_seqs


## A sanity check:

stopifnot(identical(width(tx_seqs), unname(sum(width(transcripts)))))


## Note that 'tx_seqs' can also be obtained with:

extractTranscriptSeqs(genome, txdb, use.names=TRUE)


## ---------------------------------------------------------------------

## 3. USING extractTranscriptSeqs() TO EXTRACT CDS SEQUENCES

## ---------------------------------------------------------------------


cds <- cdsBy(txdb, by="tx", use.names=TRUE)

cds_seqs <- extractTranscriptSeqs(genome, cds)

cds_seqs


## A sanity check:

stopifnot(identical(width(cds_seqs), unname(sum(width(cds)))))


## Note that, alternatively, the CDS sequences can be obtained from the

## transcript sequences by removing the 5' and 3' UTRs:

tx_lens <- transcriptLengths(txdb, with.utr5_len=TRUE, with.utr3_len=TRUE)

stopifnot(identical(tx_lens$tx_name, names(tx_seqs)))  # sanity

## Keep the rows in 'tx_lens' that correspond to a sequence in 'cds_seqs'

## and put them in the same order as in 'cds_seqs':

m <- match(names(cds_seqs), names(tx_seqs))

tx_lens <- tx_lens[m, ]

utr5_width <- tx_lens$utr5_len

utr3_width <- tx_lens$utr3_len

cds_seqs2 <- narrow(tx_seqs[m],

                    start=utr5_width+1L, end=-(utr3_width+1L))

stopifnot(identical(as.character(cds_seqs2), as.character(cds_seqs)))


## ---------------------------------------------------------------------

## 4. TRANSLATE THE CDS SEQUENCES

## ---------------------------------------------------------------------


prot_seqs <- translate(cds_seqs, if.fuzzy.codon="solve")


## Note that, by default, translate() uses The Standard Genetic Code to

## translate codons into amino acids. However, depending on the organism,

## a different genetic code might be needed to translate CDS sequences

## located on the mitochodrial chromosome. For example, for vertebrates,

## the following code could be used to correct 'prot_seqs':

SGC1 <- getGeneticCode("SGC1")

chrM_idx <- which(all(seqnames(cds) == "chrM"))

prot_seqs[chrM_idx] <- translate(cds_seqs[chrM_idx], genetic.code=SGC1,

                                 if.fuzzy.codon="solve")

perl语言代码如下 :

#!/usr/bin/perl

use strict;

use warnings;


my @allcdss=(2,8,12,16,18,25,28,35,80,90,102,110,120,125);

my @allcuts;

{

    local $/ = 100;

    my ($min,$max) = (0,0);

    my $left = '';

    while (defined (my $line = <DATA>)) {

        $line =~ s/ //g;

        $min  = $max - length($left);

        $max  = $max + length($line);

        $line = $left . $line;

        

        print "$min $max ";

        my @allcdss_tmp  = grep { $_ > $min && $_ <= $max } @allcdss;

        if (@allcdss_tmp) {

            my $template = cut2fmt($min,@allcdss_tmp);

            print $template," ";

            my @fields   = unpack($template,$line);

            if ($allcdss_tmp[-1] != $max) {

                $left = pop @fields;

            }

            push @allcuts,@fields;

        } else {

            $left = $line;

            next;

        }

        

        last if $allcdss_tmp[-1] == $allcdss[-1];

    }

}

print "@allcuts ";


sub cut2fmt {

    my $minpos = shift;

    my(@positions) = @_;

    print "@positions ";

    my $template   = ($positions[0] != $minpos) ? 'x' . ($positions[0] - $minpos - 1) : '';

    my $cut_string = 1;

    

    my $lastpos = shift @positions;

    foreach my $place (@positions) {

        if ($cut_string % 2 == 1) {

            my $dist   = $place - $lastpos;

            $template .= ($dist > 0) ? 'A' . ($dist + 1) : '';

        } else {

            my $dist   = $place - $lastpos;

            $template .= ($dist > 0) ? 'x' . ($dist - 1) : '';

        }

        $lastpos = $place;

        $cut_string++;

    }

       

    $template  .= 'A*';

    return $template;

}


利用R语言及perl语言提取基因CDS区实战

                               


以上是关于利用R语言及perl语言提取基因CDS区实战的主要内容,如果未能解决你的问题,请参考以下文章

利用Perl或者R语言对TCGA肿瘤数据lincRNA提取

R语言绘制火山图(volcano plot)实战:为差异表达基因(DEGs)添加颜色基于显著性阈值进行点的颜色美化为选定基因添加标签

R语言DESeq2基因差异表达分析

R语言dplyr包的slice函数提取数据集的数据行实战

R语言数据挖掘实战系列

R语言实战-如何分析QQ群记录2-每日信息数量图