r 计算单一和多域蛋白质

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了r 计算单一和多域蛋白质相关的知识,希望对你有一定的参考价值。

---
title: "Single vs multidomain proteins in refseq"
output:
  html_document:
    df_print: paged
editor_options:
  chunk_output_type: console
---

What is the number of single versus multidomain proteins in the non-redundant refseq database?

Step 1: Remove first column with genome name

Step 2: Run [HMMsearch parsing script](https://github.com/ChiaraVanni/unknown_protein_clusters/blob/master/scripts/pfam_annotation/hmmsearch-parser.sh)

Step 3: Count number of single versus multidomain proteins.


# Step 1: Remove first column with genome name
```{bash}
cd /bioinf/projects/megx/UNKNOWNS/chiara/NETWORK/RefSeq82.ref.gen/annotation

sed -e 's/|/\t/' refseq_pfam_hmm_res.tsv | awk '{$1=""}1' > refseq_pfam_hmm_res_ed.tsv
```

# Step 2: Run [HMMsearch parsing script](https://github.com/ChiaraVanni/unknown_protein_clusters/blob/master/scripts/pfam_annotation/hmmsearch-parser.sh)

```{bash}
/scratch/mschecht/unknown_protein_clusters/scripts/pfam_annotation/hmmsearch-parser.sh /bioinf/projects/megx/UNKNOWNS/chiara/NETWORK/RefSeq82.ref.gen/annotation/refseq_pfam_hmm_res_ed.tsv 1e-5 0.4 > refseq_pfam_hmm_res_ed_parsed.tsv
```

# Step 3: Count number of single versus multidomain proteins with this [script](https://gist.github.com/mschecht/2c2b6147048895d388618f4ee963631e)
```{bash}
./singlevsmulti.sh refseq_pfam_hmm_res_ed_parsed.tsv
```


No. of single domain proteins = 15622490                                                                │
No. of multidomain proteins = 6031644

```{r}
6031644/15622490
```

```{bash}
awk '{print $3,$1,$8,$9}' refseq_pfam_hmm_res_ed_parsed.tsv | sort -k1,1 -k3,4g | awk '{print $1"\t"$2}' | sort -k1,1 | awk 'BEGIN { getline; id=$1; l1=$1;l2=$2;} { if ($1 != id) { print l1,l2; l1=$1;l2=$2;} else { l2=l2"|"$2} id=$1; } END { print l1,l2; }' > refseq_pfam_archit.tsv
```



以上是关于r 计算单一和多域蛋白质的主要内容,如果未能解决你的问题,请参考以下文章

单一登录的多域,

计算机视觉热门科研!基于深度神经网络的蛋白质智能显微分类系统,寒假报名已开启!

R中的apply族函数和多线程计算

碧云天bca蛋白定量标准曲线公式

r 蛋白质簇的ERGM

Java并发性和多线程介绍