2019-05-21-多基因联合建树软件astral方法

Posted 2023-05-05

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了2019-05-21-多基因联合建树软件astral方法相关的知识，希望对你有一定的参考价值。

参考技术A astral是基于java开发的根据一组无根的genetrees建立speciestree。

运行astral不需要安装，但是需要在java环境下运行。

astral没有图形界面，需要在命令行下运行。

运行后可以看到astral的options。如果运行没有错误，说明安装成功了。

-o 输出

输入文件是含有所有genetrees的Newick格式的文件。输入的genetree被当做unrooted tree,不管他们是否有根。astral的输出也是被当做unrooted tree。输入的genetree支持多分支。

输出的结果是Newick格式，可以用很多软件查看。

astral测量branch length 是用coalscent units。不是我们通常认为的boostrap value

-q参数
得到的是quartet score 和branch length 和 branch support values。0.9表示genetree产生的quartet tree的90%存在于species tree中。为树打分的命令如下：

与simulated_14taxon.gene.tre进行比较后，会为物种树simulated_14taxon.default.tre打分。

表示来自genetree的4803个quartet trees存在于species trees中。4803个quartet trees占所有的quartet trees的47.98%。这个数据集的ILS水平很高。导致这个结果，也就是genetree和species tree的不一致性很高。

当你得到一个species tree或者用-q参数为树打分，你将会得到每一个分支的branch length 和local posterior support 。除了这些默认的参数，还可以输出其他branch 信息。无根树的每一个branch都有四组。分别是first child (L), the second child (R), the sister group (S), and everything else (O)。两两配对，可以得到三种拓扑结构。其中一种就是当前树的拓扑结构。剩下的就是可选的两种拓扑结构。astral可以不仅仅得到当前树的local posterior probability，也能计算剩下的两种拓扑结构的。-t 参数

命令如下

阅读几个分支给出的所有值，并理解他们。

用Yule prior model 计算speciestree 的branch length的local posterior probabilities 和branch lengths。Yule process的物种形成速率（in coalscent units）默认值设置为0.5，导致quartet 频率在[1/3,1]之间是平稳的。（并不理解）用-c的选项可以调节hyper-parameter。

astral 可以不通过bootstrapping输出branch support value.这种support比bootstrapping更加可靠（在作者的数据下）。尽管，你可能还是想得到bootstrapping。astral可以进行multi-locus bootstrapping。为了开展multi-locus bootstrapping，astral需要访问每一个gene的boostrap replicate trees。
例如：
你需要提供所有gene tree bootstrap replicates的位置。在测试数据中进行bootstrapping。
1.进入test_data目录
2.解压called song_mammals.424genes.bs-trees.zip.
3.然后运行

然后会run100次bootstrapping。

1.-i 包括所有的MLgenetrees（就像不计算bootstrap也要输入的）
2.-b 告诉astral 需要计算bootstrap value。-b 后面的文件 bs-files 包含了genetree bootstrap files的文件路径，一行一个gene。例如：
424genes/100/raxmlboot.gtrgamma/RAxML_bootstrap.allbs

1.100 bootstrapped replicate trees，每一个都是对一组bootstrap gene trees进行running astral 的结果。
2.A greedy consensus of the 100 bootstrapped replicate trees; this tree has support values drawn on branches based on the bootstrap replicate trees. Support values show the percentage of bootstrap replicates that contain a branch.
3.The “main” ASTRAL tree; this is the results of running ASTRAL on the best_ml input gene trees. This main tree also includes support values, which are again drawn based on the 100 bootstrap replicate trees.（不懂）
注意：support value以百分数的形式展示。而local posterior probabilities是0-1之间的数。当astral 计算bootstrapping时，它会持续输出每一个重复的bootstrapped astral tree.因此，如果replicate 被输入成100，它将会输出100个数，然后，输出100 bootstrapped trees 的greedy consensus。（不懂）最后，它会开展主要的分析（-i参数的文件）然后计算主要树的branch support。这个示例中就是102trees。

默认值是100，-r 参数可以设置任何数量的重复。但是要保证你的genetree的bootstrap file 的bootstrap replicates 要多于你的-r参数后面的设置。

astral 开展site-only的resampling，可以用-g参数。

这时候我们需要更多的genetree replicates。如果是-g -r 100，对于某些gene那可能需要150 replicates。因为在genes resampled的时候，一些gene抽到的概率会比其他的gene更多。

astral展开gene-only bootstrapping 用--gene-only的option。这个只要one inputfile。用-i 参数就可以了，对于这个就不要使用-b参数。

由于引导涉及一个随机的过程，我们可以提供一个seed number给astral 保证重复性。seed number 可以有-s进行设置。默认的参数是692.

astral 有exact 和heuristic 的version。当taxa的数目较少的时候，exact version 会节约时间。但是分类不能超过37个。

-x参数就是开启exact version。大约30秒。同样的，我们可以使用默认的heuristic启发式搜索法

这就只有1秒，那么他们的运行结果有何不同呢？其实是一致的

The default primate dataset we used in the previous step had 424 genes and 14 taxa. Since we have a relatively large number of gene trees, we could reasonably expect the exact and heuristic versions to generate identical output. The key point here is that as the number of genes increases, the probability that each bipartition of the species tree appears in at least one input gene tree increases. Thus, with 424 genes all bipartitions from the species tree are in at least one input gene tree, and therefore, the exact and the heuristic versions are identical.

We tried hard to find a subset of genes in the biological primates dataset where the exact and the heuristic versions did not match. We couldn't! So we had to resort to simulations. We simulated a 14-taxon dataset with extreme levels of ILS (average 87% RF between gene trees and the species tree). Now, with this simulated dataset, if you take only 10 genes, something interesting happens.
运行：

这时得分会有一点不同，topology也会不同。因此，在极端的情况下（ILS水平较高，genetree错误较多或者较分类来说可用的genetrees较少比如14类群只有10个gene，较之前的424gene就是较少）。那么就可以观察到两种算法的差异。

为了expand search space ，运行：

这里的-e参数用于输入一组extra trees 用于扩展astral的搜索空间。这个文件为10个simulated genes提供了200 bootstrap replicates 。-f 用于当input tree 有species labels代替gene label 的时候。

大数据集（>500taxa）增加memory available to java。
run

-m: 移除含有少于指定叶子数量的gene。对于需要一定分类级别的taxon occupancy 是有用的。后面设置数量。
-k completed : To build the set X (and not to score the species tree), ASTRAL internally completes the gene trees. To see these completed gene trees, run this option. This option is usable only when you also have -o（不懂）
-k bootstrapped 和-k bootstraps_norun:these options output the bootstrap replicate inputs to ASTRAL. These are useful if you want to run ASTRAL separately on each bootstrap replicate on a cluster.
-k searchspace_norun:输出search space然后退出。
----polylimit：
--samplingrounds：For multi-individual datasets, this option controls how many rounds of individual sampling is used in building the constraint set. Adjust to reduce/increase the search space for multi-individual datasets
文章参考：[ https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial.md#running-on-a-multi-individual-datasets]

借助NMF的力量对单细胞RNA和单细胞ATAC进行联合分析

参考技术A

联合分析 scRNA-seq 和 scATAC-seq 的流程类似于整合多个 scRNA-seq 数据集的流程，都依赖于联合矩阵分解和分位数归一化。主要区别在于： （1）scATAC-seq数据需要处理成基因级别的值； (2) 仅对 scRNA-seq 数据进行基因选择； (3) 下游分析可以使用基因水平和基因间信息 。

为了联合分析 scRNA 和 scATAC-seq 数据， 首先需要将 scATAC-seq 数据（一种全基因组表观基因组测量）转换为与来自 scRNA-seq 的基因表达数据相当的基因水平计数 。大多数以前的单细胞研究都使用了一种受传统bulk ATAC-seq 分析启发的方法：识别染色质可及峰，然后将与每个基因重叠的所有峰相加。这种策略也很有吸引力，因为 10X CellRanger pipeline是一种常用的商业软件包，可以自动输出这样的峰值计数。然而，分析发现这种策略不太理想，因为：（1）使用所有细胞执行峰值调用，这会偏向稀有细胞群； (2) 基因可及性通常比特定调控元件更分散，因此可能会被峰值调用算法遗漏； (3) 峰值之外的读数信息被丢弃，进一步减少了已经稀疏的测量中的数据量。发现最简单的策略似乎效果很好，而不是求和峰值计数：计算每个细胞中每个基因的基因体和启动子区域（通常为上游 3 kb）内的 ATAC-seq 读数的总数。

必须首先使用 sort 命令行实用程序按染色体、开始和结束位置对 Fragments.tsv 进行排序。 -k 选项让用户在某个列上对文件进行排序；包括多个 -k 选项允许同时按多列排序。 -k 后面的 n 代表“数字排序”。这里排序的 .bed 文件顺序首先由字典染色体顺序定义（使用参数 -k1,1），然后是升序整数起始坐标顺序（使用参数 -k2,2n），最后是升序整数结束坐标顺序（使用参数-k3,3n）。

Important flags of bedmap command are as follows:

We can then use LIGER’s makeFeatureMatrix function to calculate accessibility counts for gene body and promoter individually. This function takes the output from bedmap and efficiently counts the number of fragments overlapping each gene and promoter. We could count the genes and promoters in a single step, but choose to calculate them separately in case it is necessary to look at gene or promoter accessibility individually in downstream analyses.

Next, these two count matrices need to be re-sorted by gene symbol. We then add the matrices together, yielding a single matrix of gene accessibility counts in each cell.

5. Once the gene-level scATAC-seq counts are generated, the read10X function from LIGER can be used to read scRNA-seq count matrices output by CellRanger. You can pass in a directory (or a list of directories) containing raw outputs (for example, “/Sample_1/outs/filtered_feature_bc_matrix”) to the parameter sample.dirs. Next, a vector of names to use for the sample (or samples, corresponding to sample.dirs) should be passed to parameter sample.names as well. LIGER can also use data from any other protocol, as long as it is provided in a genes x cells R matrix format.

6. We can now create a LIGER object with the createLiger function. We also remove unneeded variables to conserve memory.

7. Preprocessing steps are needed before running iNMF. Each dataset is normalized to account for differences in total gene-level counts across cells using the normalize function. Next, highly variable genes from each dataset are identified and combined for use in downstream analysis. Note that by setting the parameter datasets.use to 2, genes will be selected only from the scRNA-seq dataset (the second dataset) by the selectGenes function. We recommend not using the ATAC-seq data for variable gene selection because the statistical properties of the ATAC-seq data are very different from scRNA-seq, violating the assumptions made by the statistical model we developed for selecting genes from RNA data. Finally, the scaleNotCenter function scales normalized datasets without centering by the mean, giving the nonnegative input data required by iNMF.

8. We next perform joint matrix factorization (iNMF) on the normalized and scaled RNA and ATAC data. This step calculates metagenes–sets of co-expressed genes that distinguish cell populations–containing both shared and dataset-specific signals. The cells are then represented in terms of the “expression level” of each metagene, providing a low-dimensional representation that can be used for joint clustering and visualization. To run iNMF on the scaled datasets, we use the optimizeALS function with proper hyperparameter settings.

To run iNMF on the scaled datasets, use optimizeALS function with proper hyperparameters setting:

Important parameters are as follows:

9. Using the metagene factors calculated by iNMF, we then assign each cell to the factor on which it has the highest loading, giving joint clusters that correspond across datasets. We then perform quantile normalization by dataset, factor, and cluster to fully integrate the datasets. To perform this analysis, typing in:

Important parameters of quantile_norm are as follows:

10. The quantile_norm function gives joint clusters that correspond across datasets, which are often completely satisfactory and sufficient for downstream analyses. However, if desired, after quantile normalization, users can additionally run the Louvain algorithm for community detection, which is widely used in single-cell analysis and excels at merging small clusters into broad cell classes. This can be achieved by running the louvainCluster function. Several tuning parameters, including resolution, k, and prune control the number of clusters produced by this function. For this dataset, we use a resolution of 0.2, which yields 16 clusters (see below).

11. In order to visualize the clustering results, the user can use two dimensionality reduction methods supported by LIGER: t-SNE and UMAP. We find that often for datasets containing continuous variation such as cell differentiation, UMAP better preserves global relationships, whereas t-SNE works well for displaying discrete cell types, such as those in the brain. The UMAP algorithm (called by the runUMAP function) scales readily to large datasets. The runTSNE function also includes an option to use FFtSNE, a highly scalable implementation of t-SNE that can efficiently process huge datasets. For the BMMC dataset, we expect to see continuous lineage transitions among the differentiating cells, so we use UMAP to visualize the data in two dimensions:

12. We can then visualize each cell, colored by cluster or dataset.

13. LIGER employs the Wilcoxon rank-sum test to identify marker genes that are differentially expressed in each cell type using the following settings. We provide parameters that allow the user to select which datasets to use (data.use) and whether to compare across clusters or across datasets within each cluster (compare.method). To identify marker genes for each cluster combining scATAC and scRNA profiles, typing in:

Important parameters of runWilcoxon are as follows:

14. The number of marker genes identified by runWilcoxon varies and depends on the datasets used. The function outputs a data frame that the user can then filter to select markers which are statistically and biologically significant. For example, one strategy is to filter the output by taking markers which have padj (Benjamini-Hochberg adjusted p-value) less than 0.05 and logFC (log fold change between observations in group versus out) larger than 3:

You can then re-sort the markers by its padj value in ascending order and choose the top 100 for each cell type. For example, we can subset and re-sort the output for Cluster 1 and take the top 20 markers by typing these commands:

15. We also provide functions to check these markers by visualizing their expression and gene loadings across datasets. You can use the plotGene to visualize the expression or accessibility of a marker gene, which is helpful for visually confirming putative marker genes or investigating the distribution of known markers across the cell sequenced. Such plots can also confirm that divergent datasets are properly aligned.

For instance, we can plot S100A9, which the Wilcoxon test identified as a marker for Cluster 1, and MS4A1, a marker for Cluster 4:

These plots indicate that S100A9 and MS4A1 are indeed specific markers for Cluster 1 and Cluster 4, respectively, with high values in these cell groups and low values elsewhere. Furthermore, we can see that the distributions are strikingly similar between the RNA and ATAC datasets, indicating that LIGER has properly aligned the two data types.

16. A key advantage of using iNMF instead of other dimensionality reduction approaches such as PCA is that the dimensions are individually interpretable. For example, a particular cell type is often captured by a single dimension of the space. Furthermore, iNMF identifies both shared and dataset-specific features along each dimension, giving insight into exactly how corresponding cells across datasets are both similar and different. The function plotGeneLoadings allows visual exploration of such information. It is recommended to call this function into a PDF file due to the large number of plots produced.

Alternatively, the function can return a list of plots. For example, we can visualize the factor loading of Factor 7 typing in:

These plots confirm that the expression and accessibility of these genes show clear differences. CCR6 shows nearly ubiquitous chromatin accessibility but is expressed only in clusters 2 and 4. The accessibility is highest in these clusters, but the ubiquitous accessibility suggests that the expression of CCR6 is somewhat decoupled from its accessibility, likely regulated by other factors. Conversely, NCF1 shows high expression in clusters 1, 3, 4, 9 and 11, despite no clear enrichment in chromatin accessibility within these clusters. This may again indicate decoupling between the expression and chromatin accessibility of NCF1. Another possibility is that the difference is due to technical effects–the gene body of NCF1 is short (~15KB), and short genes are more difficult to capture in scATAC-seq than in scRNA-seq because there are few sites for the ATAC-seq transposon to insert.

17. Single-cell measurements of chromatin accessibility and gene expression provide an unprecedented opportunity to investigate epigenetic regulation of gene expression. Ideally, such investigation would leverage paired ATAC-seq and RNA-seq from the same cells, but such simultaneous measurements are not generally available. However, using LIGER, it is possible to computationally infer “pseudo-multi-omic” profiles by linking scRNA-seq profiles–using the jointly inferred iNMF factors–to the most similar scATAC-seq profiles. After this imputation step, we can perform downstream analyses as if we had true single-cell multi-omic profiles. For example, we can identify putative enhancers by correlating the expression of a gene with the accessibility of neighboring intergenic peaks across the whole set of single cells.

Again, for convenience, we have prepared the pre-processed peak-level count data which is ready to use. The data can be downloaded here .

You can also follow the following tutorial to start from the beginning.

To achieve this, we first need a matrix of accessibility counts within intergenic peaks. The CellRanger pipeline for scATAC-seq outputs such a matrix by default, so we will use this as our starting point. The count matrix, peak genomic coordinates, and source cell barcodes output by CellRanger are stored in a folder named filtered_peak_matrix in the output directory. The user can load these and convert them into a peak-level count matrix by typing these commands:

18. The peak-level count matrix is usually large, containing hundreds of thousands of peaks. We next filter this set of peaks to identify those showing cell-type-specific accessibility. To do this, we perform the Wilcoxon rank-sum test and pick those peaks which are differentially accessible within a specific cluster. Before running the test, however, we need to: (1) subset the peak-level count matrix to include the same cells as the gene-level counts matrix; (2) replace the original gene-level counts matrix in the LIGER object by peak-level counts matrix; and (3) normalize peak counts to sum to 1 within each cell.

Now we can perform the Wilcoxon test:

19. We can now use the results of the Wilcoxon test to retain only peaks showing differential accessibility across our set of joint clusters. Here we kept peaks with Benjamini-Hochberg adjusted p-value < 0.05 and log fold change > 2.

20. Using this set of differentially accessible peaks, we now impute a set of “pseudo-multi-omic” profiles by inferring the intergenic peak accessibility for scRNA-seq profiles based on their nearest neighbors in the joint LIGER space. LIGER provides a function named imputeKNN that performs this task, yielding a set of profiles containing both gene expression and chromatin accessibility measurements for the same single cells:

Important parameters of imputeKNN are as follows:

21. Now that we have both the (imputed) peak-level counts matrix and the (observed) gene expression counts matrix for the same cells, we can evaluate the relationships between pairs of genes and peaks, linking genes to putative regulatory elements. We use a simple strategy to identify such gene-peak links: Calculate correlation between gene expression and peak accessibility of all peaks within 500 KB of a gene, then retain all peaks showing statistically significant correlation with the gene. The linkGenesAndPeaks function performs this analysis:

Important parameters of linkGenesAndPeaks are as follows:

22. The output of this function is a sparse matrix with peak names as rows and gene symbols as columns, with each element indicating the correlation between peak i and gene j (or 0 if the gene and peak are not significantly linked). For example, we can subset the results for marker gene S100A9, which is a marker gene of Cluster 1 identified in the previous section, and rank these peaks by their correlation:

We also provide a function to transform the peaks-gene correlation matrix into an Interact Track supported by UCSC Genome Browser for visualizing the calculated linkage between genes and correlated peaks. To do this, tying in:

Important parameters of makeInteractTrack are as follows:

The output of this function will be a UCSC Interact Track file named ‘Interact_Track.bed’ containing linkage information of the specified genes and correlated peaks stored in given directory. The user can then upload this file as a custom track using this page https://genome.ucsc.edu/cgi-bin/hgCustom and display it in the UCSC Genome browser.

As an example, the three peaks most correlated to S100A9 expression are shown below in the UCSC genome browser. One of the peaks overlaps with the TSS of S100A8, a neighboring gene that is co-expressed with S100A9, while another peak overlaps with the TSS of S100A9 itself. The last peak, chr1:153358896-153359396, does not overlap with a gene body and shows strong H3K27 acetylation across ENCODE cell lines, indicating that this is likely an intergenic regulatory element.

If we plot the accessibility of this peak and the expression of S100A9, we can see that the two are indeed very correlated and show strong enrichment in clusters 1 and 3. Thus, the intergenic peak likely serves as a cell-type-specific regulator of S100A9.

生活很好，有你更好

以上是关于2019-05-21-多基因联合建树软件astral方法的主要内容，如果未能解决你的问题，请参考以下文章