Manifold learning-based methods for analyzing single-cell RNA-sequencing data
Posted beckygogogo
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Manifold learning-based methods for analyzing single-cell RNA-sequencing data相关的知识,希望对你有一定的参考价值。
https://doi.org/10.1016/j.coisb.2017.12.008
Yale university 2017年12月发布的基于机器学习中流形学习的单细胞降维降噪处理优化。
The manifold learning:
假设数据是均匀采样于一个高维欧氏空间中的低维流形,流形学习就是从高维采样数据中恢复低维流形结构,即找到高维空间中的低维流形,并求出相应的嵌入映射,以实现维数约简或者数据可视化。它是从观测到的现象中去寻找事物的本质,找到产生数据的内在规律。
常见的MFL:PCA、MDS、diffusion mapping等,图下为不同方法的优劣简介。
本文关键词:MFL(Manifold models can also be useful for analyzing data generated from disparate dynamics or profiles as the data can be modeled with several disconnected mani- folds)、DPT(a pseudotime trajectory through the data to describe a latent axis of development or cell state transition)、DPT method(to find a major axis of variability in the data, DPT defines a distance from a source cell to all other cells over a modified transition operator that includes only non- trivial diffusion components. This produces trajec- tories of nonlinear variation across a dataset)
而本文的思路是在分析scRNAseq的数据的第二步使用到了MFL:
gene selection,
manifold learning,
cell organization,
Dimensionality reduction and visualization,
Density estimation and clustering。
而整体的前三步统称为pseudotime methods。
下图清晰的展示出了文章的分析思路,图也草鸡美。我觉得我还要修炼些时日再做图,分析分析思路比较拿手哈哈哈:
每个plot都会有对应的一个subtitle,理解作者在做什么足够。
其中,
主要的文章算法核心在下图:
Comparison of pseudotime methods. Pseudotime methods(four kinds of method) may generally be broken down into three stages: gene selection, manifold learning, and cell organization.
从而作者提出了一些现存方法的局限性,
A current limitation of these methods is their reliance to varying degrees on assumptions about the underlying shape of the data (数据潜在形态的假设几何对后期分型影响很大)(e.g. a tree, bifurcating trajectory, etc.)
而他们开发的DPT,也就是最后一种方法:provideing two significant advantages over other pseudotemporal techniques. First, working directly on a diffusion map does not require any greedy computational steps(层级聚类的经典算法,每一步都是贪婪模型,也就是局部最优而不是树的全局最优). Second and most importantly, because DPT operates directly on the diffusion space, it features the least coarse graining or over-fitting of data into low-dimensional assumptions(DPT的工作对象是整体的扩散空间,而不是二分支结构以及树状结构,所以可以以最小的粗粒度过拟合到低维空间).
文章最后的验证:
三种降维分析的验证以及模拟数据点的jaccard index similarity validation in jaccard graph ,I mentioned in one piece of previous blog
文章整篇都是叙述性的算法介绍,而没有任何公示和代码stick up。就本人拙见,比较重要的机器学习思维是其中的manifold learning,pseudotime method,以及根据MFL衍生出来的降维分析方法。
在这里贴一个MFL的CSDN博文,人家讲的贼好。
https://blog.csdn.net/chl033/article/details/6107042
以上是关于Manifold learning-based methods for analyzing single-cell RNA-sequencing data的主要内容,如果未能解决你的问题,请参考以下文章
当 `skbio 的 pcoa` 不是时,为啥 `sklearn.manifold.MDS` 是随机的?
菜鸡读论文Learning-based Video Motion Magnification