---
title: "Single vs multidomain proteins in refseq"
output:
html_document:
df_print: paged
editor_options:
chunk_output_type: console
---
What is the number of single versus multidomain proteins in the non-redundant refseq database?
Step 1: Remove first column with genome name
Step 2: Run [HMMsearch parsing script](https://github.com/ChiaraVanni/unknown_protein_clusters/blob/master/scripts/pfam_annotation/hmmsearch-parser.sh)
Step 3: Count number of single versus multidomain proteins.
# Step 1: Remove first column with genome name
```{bash}
cd /bioinf/projects/megx/UNKNOWNS/chiara/NETWORK/RefSeq82.ref.gen/annotation
sed -e 's/|/\t/' refseq_pfam_hmm_res.tsv | awk '{$1=""}1' > refseq_pfam_hmm_res_ed.tsv
```
# Step 2: Run [HMMsearch parsing script](https://github.com/ChiaraVanni/unknown_protein_clusters/blob/master/scripts/pfam_annotation/hmmsearch-parser.sh)
```{bash}
/scratch/mschecht/unknown_protein_clusters/scripts/pfam_annotation/hmmsearch-parser.sh /bioinf/projects/megx/UNKNOWNS/chiara/NETWORK/RefSeq82.ref.gen/annotation/refseq_pfam_hmm_res_ed.tsv 1e-5 0.4 > refseq_pfam_hmm_res_ed_parsed.tsv
```
# Step 3: Count number of single versus multidomain proteins with this [script](https://gist.github.com/mschecht/2c2b6147048895d388618f4ee963631e)
```{bash}
./singlevsmulti.sh refseq_pfam_hmm_res_ed_parsed.tsv
```
No. of single domain proteins = 15622490 │
No. of multidomain proteins = 6031644
```{r}
6031644/15622490
```
```{bash}
awk '{print $3,$1,$8,$9}' refseq_pfam_hmm_res_ed_parsed.tsv | sort -k1,1 -k3,4g | awk '{print $1"\t"$2}' | sort -k1,1 | awk 'BEGIN { getline; id=$1; l1=$1;l2=$2;} { if ($1 != id) { print l1,l2; l1=$1;l2=$2;} else { l2=l2"|"$2} id=$1; } END { print l1,l2; }' > refseq_pfam_archit.tsv
```