python分析单细胞数据，多细胞去除的模块

Posted 2023-03-21

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python分析单细胞数据，多细胞去除的模块相关的知识，希望对你有一定的参考价值。

参考技术A

hi，各位道友，上次我们介绍了R包DoubletFinder用于去除多细胞 那么python是否也有类似的模块去除多细胞呢，答案是有的。这次我们就来使用一下python模块去除多细胞

Single-Cell Remover of Doublets
Python code for identifying doublets in single-cell RNA-seq data
给定一个原始的（未归一化的）UMI，以细胞为行，基因为列的矩阵counts_matrix计数，计算每个单元的多细胞得分。

scr.scrub_doublets（）从观察到的数据模拟双峰，并使用k最近邻分类器为每个转录组计算一个连续doublet_score（介于0和1之间）。分数将自动设置为阈值以生成predicted_doublets，这是一个布尔数组，对于预测的doublets为True，否则为False。
最佳做法：
一、处理来自多个样本的数据时，请分别对每个样本运行Scrublet。因为Scrublet旨在检测由两个细胞的随机共封装形成的多细胞捕获，所以它在多个样本的合并数据集上可能表现不佳（原因大家都懂的）。
二、检查doublet分数阈值是否合理（在理想情况下，如本例所示，将双峰模拟doublet分数直方图的两个峰分开），并在必要时进行手动调整。例子在本文的后面展示。
三、可视化二维嵌入中的多细胞预测（例如UMAP或t-SNE）。预测的双峰应该大体上共定位（可能在多个群集中）。如果不是，则可能需要调整doublet得分阈值，或更改预处理参数以更好地解析数据中存在的单元格状态。

接下来我们看一下如何使用
第一步，导入必要的模块

第二步：读入矩阵，要求如上述所讲,计算多细胞比率

这一步包括
Initialize Scrublet object
相关参数是：
expected_doublet_rate:预期多细胞的比率，通常为0.05-0.1。结果对该参数不是特别敏感。
sim_doublet_ratio：相对于观察到的转录组数量，要模拟的双峰数量。此值应该足够高，以使所有的doublet状态都能通过模拟doublet很好地表示。设置得太高在计算上是耗时的。默认值是2，尽管低至0.5的值会为已测试的数据集提供非常相似的结果。
n_neighbors：用于构造观察到的转录组和模拟多细胞的KNN分类器的邻居数。通常，round（0.5 * sqrt（n_cells））的默认值效果很好。
运行默认pipeline，其中包括：
双重模拟
标准化，基因过滤，重新缩放，PCA
多细胞计算
多细胞得分阈值检测和双峰调用

绘制观察到的转录组和模拟多细胞的多细胞得分直方图
模拟的多细胞直方图通常是双峰的。左模式对应于由具有相似基因表达的两个细胞产生的“嵌入”多细胞。右边的的模式对应于“新型”多细胞，其由具有不同基因表达的细胞产生。 Scrublet只能检测”新型“双峰，这一点和doubleFinder的R包一样。
要比较单细胞与多细胞，我们必须设置一个阈值多细胞得分，理想情况下，应在模拟的双峰直方图的两种模式之间设置最小值。 scrub_doublets（）尝试自动识别这一点，并且在本示例中做得很好。但是，如果自动阈值检测效果不佳，则可以使用call_doublets（）函数调整阈值。例如：
scrub.call_doublets(threshold=0.25)
接下来我们画一下这个多细胞分布的直方图：

获取二维嵌入以可视化结果 (Tsne同理)

10X单细胞（10X空间转录组）多样本批次效应去除分析之RCA2

参考技术A

Data processing can also be carried out with Seurat. Here is an example how you can combine a RCA analysis with data preprocessed in Seurat.

Using the same 10x data as before, we generate a Seurat object and perform an initial analysis:

To run RCA, no further processing steps would be needed. However, we want to also compare the RCA result to the Seurat based clustering, therefore we first go on with a Seurat based analysis:

Based on the Elbowplot (not shown here), we use 20 PCs for further analysis.

We generate a UMAP of the data stored in the Seurat object using the umap R package:

We use the RCA function createRCAObject to generate a RCA object from the raw and optionally also the normalized data stored in our Seurat object.

Next, we can compute the projection, cluster the data, and estimate the most likely cell type for each cell as above:

Using the RCA cell type labels, RCA and Seurat clusters, we generate two new UMAPs whose coordinates are based on the PCs derived from HVGs and that are colored according to RCA clusters and cell type labels.

The RCA clusters show a high concordance to the Seurat clusters shown in the previous UMAP.

For greater convenience the results of RCA can be saved within the Seurat object for further analysis.

Also, a UMAP reduction based on the projection space can be added to the Seurat object:

RNA velocity describes the rate of gene expression change for an individual gene at a given time point based on the ratio of its spliced and unspliced messenger RNA (mRNA). Here, we describe how one can use the scvelo package, in Python, to visualize RNA velocity on the RCA generated result.

To transfer spliced RNA counts to scvelo, first transpose the raw RCA data matrix to get a cells x genes matrix, and export it to a CSV file.

In addition, export the RCA projection and UMAP embeddings to respective CSV files too.

Create an iPython notebook in the same folder and import the required packages as below.

Then, create a Scanpy object using the raw counts from the CSV file.

Populate the PCA slot in the Scanpy object as the projection data from RCA.

Populate the UMAP slot in the Scanpy object as the umap coordinates from RCA.

Load the unspliced loom object generated by velocyto .

Then, merge the spliced and unspliced objects together as described below:

As recommended by the scvelo tutorial, perform the following steps to compute RNA velocity:

It is possible that not all barcodes had sufficient quality of both spliced and unspliced reads, and thus some cells may have been discarded during the merging process. To ensure your cell type labels are still maintained, export the merged data observations from the merged scvelo object to a CSV file.

In R, load this CSV file in and extract the RCA labels and filter only those which were considered in the merged data by scvelo.

Note: If your cell names have underscores in them, scanpy will automatically split the cell name into barcode and sample_batch.

In this case, replace the last line of the above block of code with the following:

Now export these cluster labels to a CSV file.

Back in the scvelo iPynb, load this RCA cluster annotation table and set it as the observation slot of your merged data.

以上是关于python分析单细胞数据，多细胞去除的模块的主要内容，如果未能解决你的问题，请参考以下文章