RNA-Seq分析RPKM, FPKM, TPM, 计算对比

Posted 2023-04-28

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了RNA-Seq分析RPKM, FPKM, TPM, 计算对比相关的知识，希望对你有一定的参考价值。

参考技术A 在高通量测序当中，很重要的一块就是检测基因的表达量，它是差异分析和转录组数据分析的基础。与q-PCR相似，基因表达量的衡量也是采取相对定量的方法。
落在一个基因区域内的read counts数目取决于基因长度和测序深度。

在同一个样本中，基因越长，随机打断得到的片段就越多，该基因被测到的概率就越大，比对到该基因的reads就越多。

不同样本里，样本的测序深度越高，同一基因被测到的次数越多，比对到该基因的reads数就越多。
由1和2可知一个基因越长，测序深度越高，落在其内部的read counts数目就会相对越多。因此我们想比较不同基因的表达量，就要进行数据标准化。

看上图，rep3和rep相比，无论哪一个基因，rep3的计数都高于rep1，说明rep3的测序深度高于rep1；而基因B与基因A相比，无论在哪一个rep里，基因B的计数都高于基因A，说明基因B的长度大于基因A。

RPKM：Reads Per Kilobase Million
先将测序深度标准化，然后将基因长度标准化。
计算公式：RPKM= total exon reads/(mapped reads(millions) x exon length(KB))
total exon reads:某个样本mapping到特定基因外显子上所有的reads.
mapped reads(millions):某个样本所有的reads总和.
exon length(KB):某个基因的长度（外显子长度总和，以KB为单位）.

如上图所示，Rep1 RPKM=10/(35x2)=1.43

FPKM：Fragments Per Kilobase Million
RPKM is for single end RNA-seq.
FPKM is very similar to RPKM, but for paired end RNA-seq.
看下图理解reads和fragment的区别，以及为何RPKM for SE and FPKM for PE.

对于PE，如果一对paired-read都比对上了，那么这一对pair-read称为一个fragment；如果一个比对上了，另一个没比对上，那么这个比对上的read就称为一个fragment。

TPM: Transcripts Per Kilobase of exon model per Million mapped reads (每千个碱基的转录每百万映射读取的Transcripts)
TPM和RPKM以及FPKM最主要的区别：different order.
TPM先将基因长度标准化，然后将测序深度标准化

可以看出TPM是先对基因长度标准化，再对测序深度标准化，这与FPKM正好相反。

个人理解：由于标准化顺序的不同，导致TPM的pie是一样的，而RPKM的pie是不一样的。
statquest：with TPM, everyone gets the same sized pie. since RNA-seq is all about comparing relative proportions of reads, this metric seems more appropriate.

参考： https://www.jianshu.com/p/879db8f94a34

39count_rpkm_fpkm_TPM

参考：https://f1000research.com/articles/4-1521/v1

https://www.biostars.org/p/171766/

http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

It used to be when you did RNA-seq, you reported your results in RPKM (Reads Per Kilobase Million) or FPKM (Fragments Per Kilobase Million). However, TPM (Transcripts Per Kilobase Million) is now becoming quite popular.

============================fpkm====================================

rate = geneA_count / geneA_length

fpkm = rate / (sum(gene*_count) /10^6)

即： fpkm = 10^6 * (geneA_count / geneA_length) / sum(gene*_length) ##sum(gene*_length) 没有标准化处理的所有基因的count总和。

============================TPM====================================

rate = geneA_count / geneA_length

tpm = rate / (sum(rate) /10^6)

即： tpm = 10^6 * (geneA_count / geneA_length) / sum(rate) ##sum(gene*_length)

====================================================================

These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM:

Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).

TPM is very similar to RPKM and FPKM. The only difference is the order of operations. Here’s how you calculate TPM:

Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).
Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
Divide the RPK values by the “per million” scaling factor. This gives you TPM.

So you see, when calculating TPM, the only difference is that you normalize for gene length first, and then normalize for sequencing depth second. However, the effects of this difference are quite profound.

When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly.

Here’s an example. If the TPM for gene A in Sample 1 is 3.33 and the TPM in sample B is 3.33, then I know that the exact same proportion of total reads mapped to gene A in both samples. This is because the sum of the TPMs in both samples always add up to the same number (so the denominator required to calculate the proportions is the same, regardless of what sample you are looking at.)

With RPKM or FPKM, the sum of normalized reads in each sample can be different. Thus, if the RPKM for gene A in Sample 1 is 3.33 and the RPKM in Sample 2 is 3.33, I would not know if the same proportion of reads in Sample 1 mapped to gene A as in Sample 2. This is because the denominator required to calculate the proportion could be different for the two samples.

以上是关于RNA-Seq分析RPKM, FPKM, TPM, 计算对比的主要内容，如果未能解决你的问题，请参考以下文章