构建miRNA-seq数据分析环境
Posted 生信技能树
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了构建miRNA-seq数据分析环境相关的知识,希望对你有一定的参考价值。
最近在生信技能树分享了几个miRNA的靶向基因的查询工具,分别是:
-
自学miRNA-seq分析第八讲~miRNA-mRNA表达相关下游分析 | 生信菜鸟团 -
自学miRNA-seq分析第七讲~miRNA样本配对mRNA表达量获取 | 生信菜鸟团 -
自学miRNA-seq分析第六讲~miRNA表达量差异分析 | 生信菜鸟团 -
自学miRNA-seq分析第五讲~miRNA表达量获取 | 生信菜鸟团 -
自学miRNA-seq分析第四讲~测序数据比对 | 生信菜鸟团 -
自学miRNA-seq分析第三讲~公共测序数据下载 | 生信菜鸟团 -
自学miRNA-seq分析第二讲~学习资料的搜集 | 生信菜鸟团 -
自学miRNA-seq分析第一讲~文献选择与解读 | 生信菜鸟团
首先下载miRBase的miRNA参考序列文件
mkdir -p ~/reference/miRNA
cd ~/reference/miRNA
wget ftp://mirbase.org/pub/mirbase/CURRENT/hairpin.fa.gz ## 38589 reads
wget ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.zip ## 48885 reads
wget ftp://mirbase.org/pub/mirbase/CURRENT/hairpin.fa.zip
wget ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3
unzip hairpin.fa.zip
unzip mature.fa.zip
grep sapiens mature.fa |wc -l # 2656
grep sapiens hairpin.fa |wc # 1917
## Homo sapiens
perl -alne '{if(/^>/){if(/Homo/){$tmp=1}else{$tmp=0}};next if $tmp!=1;s/U/T/g if !/>/;print }' hairpin.fa > hairpin.human.fa
perl -alne '{if(/^>/){if(/Homo/){$tmp=1}else{$tmp=0}};next if $tmp!=1;s/U/T/g if !/>/;print }' mature.fa > mature.human.fa
# 这里值得一提的是miRBase数据库下载的序列,居然都是用U表示的,也就是说就是miRNA序列,而不是转录成该miRNA的基因序列,而我们测序的都是基因序列。
5 .9M Mar 12 2018 hairpin .fa
1 .5M Apr 29 15 :09 hairpin .fa .gz
1 .5M Apr 29 15 :11 hairpin .fa .zip
263 K Apr 29 15 :13 hairpin .human .fa
523 K Apr 29 15 :11 hsa .gff3
3 .7M Mar 12 2018 mature .fa
786 K Apr 29 15 :09 mature .fa .zip
196 K Apr 29 15 :13 mature .human .fa
然后使用conda配置好软件环境
conda create -n miRNA
conda activate miRNA
conda install -y -c bioconda hisat2 bowtie samtools fastp bowtie2 fastx_toolkit fastqc
# sra-toolkits,subreads 也可以一并下载
# conda search fastx_toolkit
# 耗费约1.2G的磁盘空间
针对miRBase数据库下载的序列构建bowtie索引
-
http://bowtie-bio.sourceforge.net/index.shtml -
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
bowtie-build mature.human.fa hsa-mature-bowtie-index
bowtie-build hairpin.human.fa hsa-hairpin-bowtie-index
4 .2M Apr 29 15 :42 hsa-hairpin-bowtie-index .1 .ebwt
20 K Apr 29 15 :42 hsa-hairpin-bowtie-index .2 .ebwt
17 K Apr 29 15 :42 hsa-hairpin-bowtie-index .3 .ebwt
39 K Apr 29 15 :42 hsa-hairpin-bowtie-index .4 .ebwt
4 .2M Apr 29 15 :42 hsa-hairpin-bowtie-index .rev .1 .ebwt
20 K Apr 29 15 :42 hsa-hairpin-bowtie-index .rev .2 .ebwt
4 .2M Apr 29 15 :41 hsa-mature-bowtie-index .1 .ebwt
7 .1K Apr 29 15 :41 hsa-mature-bowtie-index .2 .ebwt
24 K Apr 29 15 :41 hsa-mature-bowtie-index .3 .ebwt
15 K Apr 29 15 :41 hsa-mature-bowtie-index .4 .ebwt
4 .2M Apr 29 15 :41 hsa-mature-bowtie-index .rev .1 .ebwt
7 .1K Apr 29 15 :41 hsa-mature-bowtie-index .rev .2 .ebwt
下载测试数据
GSM1470353 control-CM, experiment1
GSM1470354 ET1-CM, experiment1
GSM1470355 control-CM, experiment2
GSM1470356 ET1-CM, experiment2
GSM1470357 control-CM, experiment3
GSM1470358 ET1-CM, experiment3
SRR1542714
SRR1542715
SRR1542716
SRR1542717
SRR1542718
SRR1542719
ftp: //ftp.sra.ebi.ac.uk /vol1/srr/SRR154/004/SRR1542714;ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/ 004/SRR1542714/SRR1542714.fastq.gz
# 从14到19
# 可以对这6个样本,分开prefetch
~/biosoft/sratoolkit/sratoolkit .2 .8 .2 -1-centos_linux64/bin/prefetch SRR1542714
mkdir -p ~/reference/miRNA
cd ~/reference/miRNA
# step1 : download raw data
mkdir miRNA_test && cd miRNA_test
echo {14..19} |sed 's/ /\n/g' | while read id; \
do (~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/prefetch SRR15427 ${id} ) ;\
done
# 下面是另外一个课题,可以参考比较
echo {44..59} |sed 's/ /\n/g' | while read id; \
do (~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/prefetch SRR77077 ${id} ) ;\
done
# 另外,这样的下载方式,在中国大陆会非常慢!!!
# 建议换成aspera哈
33 M Apr 29 16 :07 SRR1542714 .sra
31 M Apr 29 16 :07 SRR1542715 .sra
50 M Apr 29 16 :07 SRR1542716 .sra
24 M Apr 29 16 :07 SRR1542717 .sra
52 M Apr 29 16 :07 SRR1542718 .sra
34 M Apr 29 16 :07 SRR1542719 .sra
dump=/home/yb77613/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/fastq-dump
echo {14..19} |sed 's/ /\n/g' | while read id; \
do ( $dump -O ./ --gzip --split-3 SRR15427 ${id} ) ;\
done
50 M Apr 29 16 :14 SRR1542714 .fastq .gz
47 M Apr 29 16 :15 SRR1542715 .fastq .gz
75 M Apr 29 16 :15 SRR1542716 .fastq .gz
35 M Apr 29 16 :15 SRR1542717 .fastq .gz
81 M Apr 29 16 :16 SRR1542718 .fastq .gz
50 M Apr 29 16 :16 SRR1542719 .fastq .gz
质控和清洗测序数据
ls *gz | while read id
do
echo $id
# fastqc $id
zcat $id| fastq_quality_filter -v -q 20 -p 80 -Q33 -i - -o tmp ;
fastx_trimmer -v -f 1 -l 27 -m 15 -i tmp -Q33 -z -o ${id%%.*}_clean.fq.gz ;
# fastqc ${id%%.*}_clean.fq.gz
done
-
fastq_quality_filter 代表根据质量过滤序列 -
fastx_trimmer 代表缩短FASTQ或FASTQ文件中的读数,根据质量修剪(剪切)序列。
#运行一次查看是否可以成功运行
#fastq_quality_filter 代表根据质量过滤序列
#-v,即verbose,报告序列的数目
#-q,即quality,保留碱基所要求的最小质量评分,低于此值将会去除
#-p,即percent,即序列中超过最小质量评分的碱基数目的最小百分率,低于这个比率将删除
#-Q33,即phred33,默认是按照phred64作为参考的
#-i,即infile,代表输入文件,注意不能是压缩文件,可以是FASTA/FASTQ都行。默认是STDIN,即标准输入
#-o,即outfile,代表输出文件(注意不是output dir即输出目录,此处输出是一个文件而不是文件夹),默认是STDOUT即标准输出,指定则输出到指定文件
fastq_quality_filter -v -q 20 -p 80 -Q33 -i SRR1542714. 1.fastq -o tmp #生成第一步处理的临时文件
# 中间文件是 tmp ,到时候可以删除掉。
#fastx_trimmer 代表缩短FASTQ或FASTQ文件中的读数,根据质量修剪(剪切)序列。
#-v,即verbose,报告序列的数目
#-f,即first,代表保留第几个碱基,默认是保留第一个
#-l,即last,代表保留最后的碱基,默认是整个reads
#-i,即infile,代表输入文件,注意不能是压缩文件,可以是FASTA/FASTQ都行。默认是STDIN,即标准输入
#-z,即compress,代表的是使用Gzip压缩输出
#-o,即outfile,代表输出文件(注意不是output dir即输出目录,此处输出是一个文件而不是文件夹),默认是STDOUT即标准输出,指定则输出到指定文件
fastx_trimmer -v -f 1 -l 27 -i tmp -Q33 -z -o SRR1542714.1_clean.fq.gz #生成最后的文件
39 M Apr 29 16 :28 SRR1542714_clean .fq .gz
37 M Apr 29 16 :28 SRR1542715_clean .fq .gz
49 M Apr 29 16 :29 SRR1542716_clean .fq .gz
18 M Apr 29 16 :29 SRR1542717_clean .fq .gz
68 M Apr 29 16 :30 SRR1542718_clean .fq .gz
30 M Apr 29 16 :30 SRR1542719_clean .fq .gz
比对 miRBase数据库下载的序列 +定量
mature=/home/yb77613/reference/miRNA/hsa-mature-bowtie-index
hairpin=/home/yb77613/reference/miRNA/hsa-hairpin-bowtie-index
ls *_clean.fq.gz | while read id
do
echo $id
bowtie -n 0 -m1 --best --strata $mature $id -S ${id}_matrue.sam
bowtie -n 0 -m1 --best --strata $hairpin $id -S ${id}_hairpin.sam
done
ls *.sam| while read id ; do (samtools sort -O bam -@ 5 -o $(basename ${id} ".sam").bam ${id}); done
rm *.sam
SRR1542714_clean.fq.gz
# reads processed: 1520320
# reads with at least one reported alignment: 331645 (21.81%)
# reads that failed to align: 1145386 (75.34%)
# reads with alignments suppressed due to -m: 43289 (2.85%)
Reported 331645 alignments
# reads processed: 1520320
# reads with at least one reported alignment: 331482 (21.80%)
# reads that failed to align: 1008271 (66.32%)
# reads with alignments suppressed due to -m: 180567 (11.88%)
Reported 331482 alignments
29 M Apr 29 20 :15 SRR1542714_clean .fq .gz_hairpin .bam
28 M Apr 29 20 :15 SRR1542714_clean .fq .gz_matrue .bam
27 M Apr 29 20 :15 SRR1542715_clean .fq .gz_hairpin .bam
26 M Apr 29 20 :15 SRR1542715_clean .fq .gz_matrue .bam
36 M Apr 29 20 :15 SRR1542716_clean .fq .gz_hairpin .bam
35 M Apr 29 20 :15 SRR1542716_clean .fq .gz_matrue .bam
13 M Apr 29 20 :15 SRR1542717_clean .fq .gz_hairpin .bam
13 M Apr 29 20 :15 SRR1542717_clean .fq .gz_matrue .bam
49 M Apr 29 20 :15 SRR1542718_clean .fq .gz_hairpin .bam
48 M Apr 29 20 :15 SRR1542718_clean .fq .gz_matrue .bam
22 M Apr 29 20 :15 SRR1542719_clean .fq .gz_hairpin .bam
22 M Apr 29 20 :15 SRR1542719_clean .fq .gz_matrue .bam
ls *.bam |xargs -i samtools index {}
ls *.bam| while read id ; do (samtools idxstats ${id} > ${id}.txt ); done
# samtools view matrue.bam |cut -f 3 |sort |uniq -c
比对参考基因组+定量
index=/home/yb77613/reference/index/bowtie/hg38
ls *_clean.fq.gz | while read id
do
echo $id
bowtie2 -p 4 -x $index -U $id | samtools sort -@ 4 -o ${id}_genome.bam -
done
SRR1542715_clean .fq .gz
1520320 reads; of these:
1520320 (100 .00%) were unpaired; of these:
123178 (8 .10%) aligned 0 times
333700 (21 .95%) aligned exactly 1 time
1063442 (69 .95%) aligned >1 times
91 .90% overall alignment rate
34 M Apr 29 20 :30 SRR1542714_clean .fq .gz_genome .bam
31 M Apr 29 20 :30 SRR1542715_clean .fq .gz_genome .bam
42 M Apr 29 20 :31 SRR1542716_clean .fq .gz_genome .bam
15 M Apr 29 20 :31 SRR1542717_clean .fq .gz_genome .bam
57 M Apr 29 20 :32 SRR1542718_clean .fq .gz_genome .bam
25 M Apr 29 20 :32 SRR1542719_clean .fq .gz_genome .bam
gtf= /home/yb77613/reference/miRNA/hsa.gff3
bin_featureCounts= "/home/yb77613/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/featureCounts";
$bin_featureCounts -T 4 -F gff -M -t miRNA -g Name -a $gtf -o all.counts.mature.txt *genome* 1>counts.mature.log 2>& 1
$bin_featureCounts -T 4 -F gff -M -t miRNA_primary_transcript -g Name -a $gtf -o all.counts.hairpin.txt *genome* 1>counts.hairpin.log 2>& 1
比较两个定量方式的表达矩阵差异
hsa-miR-12136 chr1 632668 632685 - 18 26 21 50 6 78 19
hsa-miR-34a-5p chr1 9151735 9151756 - 22 3009 9494 3467 4136 5299 7097
hsa-miR-34a-3p chr1 9151693 9151714 - 22 88 153 100 86 126 132
SRR1542714_clean .fq .gz_matrue .bam .txt :hsa-miR-12136 18 104 0
SRR1542715_clean .fq .gz_matrue .bam .txt :hsa-miR-12136 18 182 0
SRR1542716_clean .fq .gz_matrue .bam .txt :hsa-miR-12136 18 109 0
SRR1542717_clean .fq .gz_matrue .bam .txt :hsa-miR-12136 18 44 0
SRR1542718_clean .fq .gz_matrue .bam .txt :hsa-miR-12136 18 235 0
SRR1542719_clean .fq .gz_matrue .bam .txt :hsa-miR-12136 18 92 0
SRR1542714_clean .fq .gz_matrue .bam .txt :hsa-miR-34a-5p 22 1745 0
SRR1542715_clean .fq .gz_matrue .bam .txt :hsa-miR-34a-5p 22 5825 0
SRR1542716_clean .fq .gz_matrue .bam .txt :hsa-miR-34a-5p 22 2015 0
SRR1542717_clean .fq .gz_matrue .bam .txt :hsa-miR-34a-5p 22 2605 0
SRR1542718_clean .fq .gz_matrue .bam .txt :hsa-miR-34a-5p 22 3081 0
SRR1542719_clean .fq .gz_matrue .bam .txt :hsa-miR-34a-5p 22 4413 0
SRR1542714_clean .fq .gz_matrue .bam .txt :hsa-miR-34a-3p 22 25 0
SRR1542715_clean .fq .gz_matrue .bam .txt :hsa-miR-34a-3p 22 69 0
SRR1542716_clean .fq .gz_matrue .bam .txt :hsa-miR-34a-3p 22 21 0
SRR1542717_clean .fq .gz_matrue .bam .txt :hsa-miR-34a-3p 22 27 0
SRR1542718_clean .fq .gz_matrue .bam .txt :hsa-miR-34a-3p 22 37 0
SRR1542719_clean .fq .gz_matrue .bam .txt :hsa-miR-34a-3p 22 57 0
老规矩,我们的拉群小助手会协助大家进入一个miRNA-seq数据分析交流群哈, 跟我们之前的其它群类似:
还是老规矩,18.8元进群,一个简单的门槛,隔绝那些营销号!同时,我们也会在群里共享一些miRNA-seq数据分析相关资料,仅此而已,考虑清楚哦!
支付18.8元入学习群
烦请备注姓名学校单位信息
以上是关于构建miRNA-seq数据分析环境的主要内容,如果未能解决你的问题,请参考以下文章