r 环境未知与MAGs

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了r 环境未知与MAGs相关的知识,希望对你有一定的参考价值。

# Environmnetal unknowns in Tom's MAGs

We will screen Tom's MAGs for environmental unknowns. 

## Preparing TOM's data

```bash
mkdir /scratch/eus_vs_mags
cd eus_vs_mags
wget http://files.metagenomics.eu/TARA_non_redundant_MAGs_AA.fa.gz
wget http://files.metagenomics.eu/TARA_MAGs_v3_metadata.txt
wget http://files.metagenomics.eu/orf2mag.tsv.gz
wget http://files.metagenomics.eu/tara_delmont_taxids.csv
```

## Preparing profiles for MMSEQS2

We need the ffindex files with the HHM profiles, and convert them to MMSEQS2 format. First we copy the HHM profiles to **scratch** and we convert them:

```bash
cd /scratch/eus_vs_mags

cp /bioinf/projects/megx/UNKNOWNS/2017_11/classification/results/ffindex_data/marine_hmp_db_03112017_eu_hhm.ffdata marine_hmp_db_03112017_eu_hhm

cp /bioinf/projects/megx/UNKNOWNS/2017_11/classification/results/ffindex_data/marine_hmp_db_03112017_eu_hhm.ffindex marine_hmp_db_03112017_eu_hhm.index

~/opt/MMseqs2_uv/build/bin/mmseqs convertprofiledb marine_hmp_db_03112017_eu_hhm marine_hmp_db_03112017_eu_hhm_profileDB --threads 16
```

## Searching the MAGs

We will create a DB for MMSEQS2 using the AA sequences from Tom and do a profile search:

```bash
mmseqs createdb TARA_non_redundant_MAGs_AA.fa.gz magsDB
mmseqs search magsDB marine_hmp_db_03112017_eu_hhm_profileDB mag_results tmp -c 0.8 --cov-mode 0 -e 1e-5 --threads 64
mmseqs convertalis magsDB marine_hmp_db_03112017_eu_hhm_profileDB mag_results mag_results.m8 --format-mode 2
```
## Parse and plot the results

For plotting use the script [here](https://gist.github.com/genomewalker/5b314838d226572c2fcb8ff9cd861c2d#file-eu_mags-r)
library(tidyverse)
# Let's read the files
m_vs_u <- read_tsv("mags_eu_results.m8", col_names = FALSE, progress = TRUE) %>%
  select(X1, X2, X3, X11) %>%
  rename(gene_callers_id = X1, unk_id = X2, id = X3, evalue = X11) %>%
  mutate(gene_callers_id = as.character(gene_callers_id),  unk_id = as.character(unk_id)) %>%
  filter(evalue <= 1e-25)
m_genes <- read_tsv("orf2mag.tsv.gz", col_names = TRUE, progress = TRUE) %>%
  select(gene_callers_id, contig) %>%
  mutate(gene_callers_id = as.character(gene_callers_id))
mag_cdata <- read_tsv("TARA_MAGs_v3_metadata.txt", col_names = TRUE)
mag_tax <- read_csv("tara_delmont_taxids.csv", col_names = FALSE) %>%
  separate(X3,into = c("domain","phylum","class", "order", "family", "genus"), sep = ";", fill = "right", extra = "drop") %>%
  rename(MAG=X1)

# Some data massaging
# Combine tables and clean MAG names
m_vs_u_mag <- m_vs_u %>%
  left_join(m_genes) %>%
  tidyr::extract(contig, into = paste("V", 1:4, sep = ""), regex = "([[:alnum:]]+)_([[:alnum:]]+)_([[:alnum:]]+)_([[:alnum:]]+)") %>%
  tidyr::unite(col = MAG, V1,V2,V3,V4, sep = "_", remove = TRUE) 

mag_n_orfs <- m_genes %>%
  tidyr::extract(contig, into = paste("V", 1:4, sep = ""), regex = "([[:alnum:]]+)_([[:alnum:]]+)_([[:alnum:]]+)_([[:alnum:]]+)") %>%
  unite(col = MAG, V1,V2,V3,V4, sep = "_", remove = TRUE) %>%
  group_by(MAG) %>%
  count() %>%
  rename(n_orf = n)

# Count proportion of ORFs in each MAG
m_vs_u_mag_n <- m_vs_u_mag %>%
  select(MAG, unk_id) %>%
  unique() %>%
  group_by(MAG) %>%
  count() %>%
  ungroup() %>%
  #tidyr::complete(uclass, MAG, fill = list(n = 0)) %>%
  left_join(mag_n_orfs) %>%
  mutate(prop = n/n_orf) %>%
  left_join(mag_tax)

# Get the 25 more abundants
m_vs_u_mag_n_25 <- m_vs_u_mag_n %>%
  dplyr::top_n(25) %>%
  .$MAG


m_vs_u_mag_n %>% group_by(phylum) %>% count

# Plot them
ggplot(m_vs_u_mag_n , aes(MAG, n, color = phylum)) +
  geom_bar(stat = "identity") +
  theme_light() +
  xlab("") +
  ylab("# hits") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Calculate histograms
all <- ggplot(m_vs_u_mag_n, aes(prop)) +
  geom_histogram(alpha = 0.5, fill = "#2E3239") +
  theme_light() +
  ylab("Frequency") +
  xlab("Proportion of unknown ORFs") +
  scale_x_continuous(labels = scales::percent)

eupc <- ggplot(m_vs_u_mag_n %>% filter(uclass == "eupc"), aes(prop)) +
  geom_histogram(alpha = 0.5, fill = "#2E3239") +
  theme_light() +
  ylab("Frequency") +
  xlab("Proportion of env. unknowns ORFs") +
  scale_x_continuous(labels = scales::percent)

gupc <- ggplot(m_vs_u_mag_n %>% filter(uclass == "gupc"), aes(prop)) +
  geom_histogram(alpha = 0.5, fill = "#2E3239") +
  theme_light() +
  ylab("Frequency") +
  xlab("Proportion of gen. unknowns ORFs") +
  scale_x_continuous(labels = scales::percent)

# Plot them
ggpubr::ggarrange(all, gupc, eupc, nrow = 1, ncol = 3)

m_vs_u_mag %>%
  group_by(MAG, uclass) %>%
  count %>% View

以上是关于r 环境未知与MAGs的主要内容,如果未能解决你的问题,请参考以下文章

将具有未知列数的数据导入R?

R中随机森林中的二元分类或未知类

Apriori 中 R 中的未知项目标签

R - 修改未知函数

ifelse 按 r 中的列位置 - 列名未知 [重复]

打印机*****所需的驱动程序未知,登陆之前,请与系统管理员联系