系统生物学大肠杆菌蛋白互作网络分析(系统与合成生物学第三次作业)

Posted Dream of Grass

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了系统生物学大肠杆菌蛋白互作网络分析(系统与合成生物学第三次作业)相关的知识,希望对你有一定的参考价值。

摘要

下载大肠杆菌蛋白互作网络(Ecoli PPI network)数据,使用Python对大肠杆菌蛋白互作网络进行筛选,并使用Cytoscape进行圆形布局可视化。此外,还绘制度分布函数并用幂函数进行拟合。

大肠杆菌蛋白互作网络

数据下载

我们从string(https://cn.string-db.org/)中下载ppi数据。

  • 大肠杆菌蛋白互作网络下载地址:https://stringdb-static.org/download/protein.links.v11.5/511145.protein.links.v11.5.txt.gz
  • 蛋白质信息下载地址:https://stringdb-static.org/download/protein.info.v11.5/511145.protein.info.v11.5.txt.gz

数据处理

将蛋白质的原始id转换为蛋白质名,并筛选结合分数大于800的蛋白质对。

import pandas as pd
import networkx as nx

# 处理Ecoli的ppi数据
filepath = "D:/000大三下/系统与合成生物学/"
link_filename = '511145.protein.links.v11.5.txt'
info_filename = '511145.protein.info.v11.5.txt'
ppi_link = pd.read_csv(filepath + link_filename, sep=' ', header=0)
ppi_info = pd.read_csv(filepath + info_filename, sep='\\t', header=0)
# 筛选结合分数大于800的protein pair
ppi_link = ppi_link[ppi_link['combined_score'] > 800]
# 把stiring_protein_id替换成preferred_name
ppi = pd.merge(ppi_link, ppi_info[['#string_protein_id', 'preferred_name']], left_on='protein1',
               right_on='#string_protein_id')
ppi = pd.merge(ppi, ppi_info[['#string_protein_id', 'preferred_name']], left_on='protein2',
               right_on='#string_protein_id')
ppi = ppi[['preferred_name_x', 'preferred_name_y', 'combined_score']]
ppi.columns = ['protein1', 'protein2', 'combined_score']
# 保存
ppi.to_csv(filepath + 'Ecoli_ppi.csv', index=False, header=1)

处理后的数据如下:

表1 处理后的蛋白互作网络
protein1protein2combined_score
thrAdapA945
dapBdapA999
dapDdapA966
dapEdapA948
bamCdapA945
lysAdapA984
asddapA994
dapFdapA968
metLdapA938
lysCdapA858

圆形布局可视化蛋白互作网络

图1 蛋白互作网络全貌
图2 蛋白互作网络微观

计算中心性指标

使用networkx,计算Degree CentralityCloseness CentralityBetweenness CentralityEigenvector Centrality。计算结束后对其求和并降序排列。

net = ppi[['protein1', 'protein2']]
net.drop_duplicates()
G = nx.from_pandas_edgelist(net, "protein1", "protein2")
# 计算中心性
degree_centrality = nx.degree_centrality(G)
closeness_centrality = nx.closeness_centrality(G)
betweenness_centrality = nx.betweenness_centrality(G)
eigenvector_centrality = nx.eigenvector_centrality(G)
# 将指标转换为DataFrame格式
centrality_df = pd.DataFrame('Degree Centrality': degree_centrality,
                              'Closeness Centrality': closeness_centrality,
                              'Betweenness Centrality': betweenness_centrality,
                              'Eigenvector Centrality': eigenvector_centrality)
# 求和并排序
centrality_df['Sum'] = centrality_df.sum(axis=1)
centrality_df = centrality_df.sort_values(by='Sum', ascending=False)
# 保存
centrality_df.to_csv(filepath + 'Ecoli_ppi_centrality.csv')
表2 蛋白互作网络中心性指标
ProteinDegree CentralityCloseness CentralityBetweenness CentralityEigenvector CentralitySum
rpoB0.0325830.2749430.0185490.1050430.431118
rpsL0.0292220.2608880.0114650.1155070.417082
rplB0.0292220.2614920.002940.1216120.415266
rpmA0.0274110.2634950.0069030.1173470.415155
rplM0.0292220.259260.0021290.1216990.41231
rplE0.0289630.2578170.0023430.1222010.411324
rpsB0.0279290.2609830.0026640.1178170.409392
rplD0.0281870.2562630.0010580.1224610.407969
rplV0.0279290.2556280.0011120.1211780.405846
rplC0.027670.2557190.001020.121370.405779

绘制度分布函数并用幂函数拟合

使用python进行绘制并拟合。幂指数为-0.599。

# 计算度分布函数
# 即每个度数对应的节点数除以总节点数。
degree_sequence = sorted([d for n, d in G.degree()], reverse=True)
degree_count = np.zeros(max(degree_sequence)+1)
for degree in degree_sequence:
    degree_count[degree] += 1
degree_distribution = degree_count/sum(degree_count)

# 绘制度分布函数的双对数轴图
x = np.arange(1, len(degree_distribution)+1)
y = degree_distribution
# plt.loglog(x, y, color = 'chocolate', marker = '.')
plt.plot(x, y, color = 'chocolate', marker = '.')
# 使用幂函数拟合,计算幂指数的值
def power_law(x, a, b):
    return a * x ** (b)

popt, pcov = curve_fit(power_law, x, y)
a, b = popt
# plt.loglog(x, power_law(x, a, b), 'c--', label='fit: y=%5.3f*x^(%5.3f)' % (a, b))
plt.plot(x, power_law(x, a, b), 'c--', label='fit: y=%5.3f*x^(%5.3f)' % (a, b))
plt.xlabel('Degree)')
plt.ylabel('Frequency)')
# plt.xlabel('log(Degree)')
# plt.ylabel('log(Frequency)')
plt.legend()
plt.savefig(filepath+'Degree_distribution.png')
plt.show()
print('幂指数的值为:', b)







图3 横纵坐标对数转换前图4 横纵坐标对数转换后

完整代码

# _*_ coding: utf-8 _*_
# @Time : 2023-03-28 23:03 
# @Author : YingHao Zhang(池塘春草梦)
# @Version:v0.1
# @File : network.py
# @desc : 系统与合成生物学作业。读取Ecoli的ppi数据,将蛋白质的原始id转换为蛋白质名,并计算四种中心性指标。输入:ppi的link和info文件。输出:①处理后的ppi网络(包括蛋白质对和结合分数);②蛋白质的四种中心性指标和四种中心性指标的和。绘制度分布曲线,并用幂函数拟合。

import pandas as pd
import networkx as nx
import time

start_time = time.time()
# 处理Ecoli的ppi数据
filepath = "D:/000大三下/系统与合成生物学/"
link_filename = '511145.protein.links.v11.5.txt'
info_filename = '511145.protein.info.v11.5.txt'
ppi_link = pd.read_csv(filepath + link_filename, sep=' ', header=0)
ppi_info = pd.read_csv(filepath + info_filename, sep='\\t', header=0)
# 筛选结合分数大于800的protein pair
ppi_link = ppi_link[ppi_link['combined_score'] > 800]
# 把stiring_protein_id替换成preferred_name
ppi = pd.merge(ppi_link, ppi_info[['#string_protein_id', 'preferred_name']], left_on='protein1',
               right_on='#string_protein_id')
ppi = pd.merge(ppi, ppi_info[['#string_protein_id', 'preferred_name']], left_on='protein2',
               right_on='#string_protein_id')
ppi = ppi[['preferred_name_x', 'preferred_name_y', 'combined_score']]
ppi.columns = ['protein1', 'protein2', 'combined_score']
# 保存
ppi.to_csv(filepath + 'Ecoli_ppi.csv', index=False, header=1)
net = ppi[['protein1', 'protein2']]
net.drop_duplicates()
G = nx.from_pandas_edgelist(net, "protein1", "protein2")
# 计算中心性
degree_centrality = nx.degree_centrality(G)
closeness_centrality = nx.closeness_centrality(G)
betweenness_centrality = nx.betweenness_centrality(G)
eigenvector_centrality = nx.eigenvector_centrality(G)
# 将指标转换为DataFrame格式
centrality_df = pd.DataFrame('Degree Centrality': degree_centrality,
                              'Closeness Centrality': closeness_centrality,
                              'Betweenness Centrality': betweenness_centrality,
                              'Eigenvector Centrality': eigenvector_centrality)
# 求和并排序
centrality_df['Sum'] = centrality_df.sum(axis=1)
centrality_df = centrality_df.sort_values(by='Sum', ascending=False)
# 保存
centrality_df.to_csv(filepath + 'Ecoli_ppi_centrality.csv')

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

# 计算度分布函数
# 即每个度数对应的节点数除以总节点数。
degree_sequence = sorted([d for n, d in G.degree()], reverse=True)
degree_count = np.zeros(max(degree_sequence)+1)
for degree in degree_sequence:
    degree_count[degree] += 1
degree_distribution = degree_count/sum(degree_count)

# 绘制度分布函数的双对数轴图
x = np.arange(1, len(degree_distribution)+1)
y = degree_distribution
# plt.loglog(x, y, color = 'chocolate', marker = '.')
plt.plot(x, y, color = 'chocolate', marker = '.')
# 使用幂函数拟合,计算幂指数的值
def power_law(x, a, b):
    return a * x ** (b)

popt, pcov = curve_fit(power_law, x, y)
a, b = popt
# plt.loglog(x, power_law(x, a, b), 'c--', label='fit: y=%5.3f*x^(%5.3f)' % (a, b))
plt.plot(x, power_law(x, a, b), 'c--', label='fit: y=%5.3f*x^(%5.3f)' % (a, b))
plt.xlabel('Degree)')
plt.ylabel('Frequency)')
# plt.xlabel('log(Degree)')
# plt.ylabel('log(Frequency)')
plt.legend()
plt.savefig(filepath+'Degree_distribution.png')
plt.show()

print('幂指数的值为:', b)

end_time = time.time()
run_time = end_time - start_time
print("Done (/▽\)")
print("代码运行时间为:分秒".format(int(run_time / 60), round(run_time % 60)))

以上是关于系统生物学大肠杆菌蛋白互作网络分析(系统与合成生物学第三次作业)的主要内容,如果未能解决你的问题,请参考以下文章

原核生物蛋白质翻译的起始过程?

什么是生物芯片,有什么用处?

基因差异表达分析方法

高中生物 在线等! 求教各位高手!

安捷伦推出2100生物分析仪手机应用程序——BioAnalyzer Buddy(生物分析仪好伙伴)

BioGRID 互作数据库