Python根据CID获取化合物数据(调用Pubchem官方API)
Posted Xavier Jiezou
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python根据CID获取化合物数据(调用Pubchem官方API)相关的知识,希望对你有一定的参考价值。
简介
根据CID从PubChem爬取化合物的数据(基于PubChem PUG REST API),2~3秒即可实现对上千条CID对应的化合物数据的抓取。
下载
小编已将程序打包为可执行文件,下载即可使用:pubchem-1.0.2-win64.zip
演示
非开发人员直接下载打包好的软件使用即可,无需继续往下看(以此为分界线),如有问题请联系我。
安装
pip install requests
用法
- 克隆仓库。
git clone https://github.com/XavierJiezou/python-pubchem-api.git
Cd
到根目录。
cd python-pubchem-api
- 将
cid
列表复制到cid.txt
。 - 运行命令
python pubchem.py
. - 爬取结果保存在
data.json
或者data.csv
. - 你也可以根据下面的化合物属性表修改pubchem.py中的变量
self.property_list
self.property_list = [
'IUPACName',
'IsomericSMILES',
'MolecularFormula',
'MolecularWeight',
'HBondDonorCount',
'HBondAcceptorCount'
]
相关
化合物属性表
如果将以逗号分隔的属性标签列表写入URL中,则可以请求多个属性。属性表的有效输出格式为:XML、ASNT/B、JSON§、CSV和TXT(仅限于单个属性)。可用的属性包括:
属性 | 描述 |
---|---|
MolecularFormula | Molecular formula. |
MolecularWeight | The molecular weight is the sum of all atomic weights of the constituent atoms in a compound, measured in g/mol. In the absence of explicit isotope labelling, averaged natural abundance is assumed. If an atom bears an explicit isotope label, 100% isotopic purity is assumed at this location. |
CanonicalSMILES | Canonical SMILES (Simplified Molecular Input Line Entry System) string. It is a unique SMILES string of a compound, generated by a “canonicalization” algorithm. |
IsomericSMILES | Isomeric SMILES string. It is a SMILES string with stereochemical and isotopic specifications. |
InChI | Standard IUPAC International Chemical Identifier (InChI). It does not allow for user selectable options in dealing with the stereochemistry and tautomer layers of the InChI string. |
InChIKey | Hashed version of the full standard InChI, consisting of 27 characters. |
IUPACName | Chemical name systematically determined according to the IUPAC nomenclatures. |
Title | The title used for the compound summary page. |
XLogP | Computationally generated octanol-water partition coefficient or distribution coefficient. XLogP is used as a measure of hydrophilicity or hydrophobicity of a molecule. |
ExactMass | The mass of the most likely isotopic composition for a single molecule, corresponding to the most intense ion/molecule peak in a mass spectrum. |
MonoisotopicMass | The mass of a molecule, calculated using the mass of the most abundant isotope of each element. |
TPSA | Topological polar surface area, computed by the algorithm described in the paper by Ertl et al. |
Complexity | The molecular complexity rating of a compound, computed using the Bertz/Hendrickson/Ihlenfeldt formula. |
Charge | The total (or net) charge of a molecule. |
HBondDonorCount | Number of hydrogen-bond donors in the structure. |
HBondAcceptorCount | Number of hydrogen-bond acceptors in the structure. |
RotatableBondCount | Number of rotatable bonds. |
HeavyAtomCount | Number of non-hydrogen atoms. |
IsotopeAtomCount | Number of atoms with enriched isotope(s) |
AtomStereoCount | Total number of atoms with tetrahedral (sp3) stereo [e.g., ®- or (S)-configuration] |
DefinedAtomStereoCount | Number of atoms with defined tetrahedral (sp3) stereo. |
UndefinedAtomStereoCount | Number of atoms with undefined tetrahedral (sp3) stereo. |
BondStereoCount | Total number of bonds with planar (sp2) stereo [e.g., (E)- or (Z)-configuration]. |
DefinedBondStereoCount | Number of atoms with defined planar (sp2) stereo. |
UndefinedBondStereoCount | Number of atoms with undefined planar (sp2) stereo. |
CovalentUnitCount | Number of covalently bound units. |
Volume3D | Analytic volume of the first diverse conformer (default conformer) for a compound. |
XStericQuadrupole3D | The x component of the quadrupole moment (Qx) of the first diverse conformer (default conformer) for a compound. |
YStericQuadrupole3D | The y component of the quadrupole moment (Qy) of the first diverse conformer (default conformer) for a compound. |
ZStericQuadrupole3D | The z component of the quadrupole moment (Qz) of the first diverse conformer (default conformer) for a compound. |
FeatureCount3D | Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D) |
FeatureAcceptorCount3D | Number of hydrogen-bond acceptors of a conformer |
FeatureDonorCount3D | Number of hydrogen-bond donors of a conformer. |
FeatureAnionCount3D | Number of anionic centers (at pH 7) of a conformer. |
FeatureCationCount3D | Number of cationic centers (at pH 7) of a conformer. |
FeatureRingCount3D | Number of rings of a conformer. |
FeatureHydrophobeCount3D | Number of hydrophobes of a conformer. |
ConformerModelRMSD3D | Conformer sampling RMSD in |
EffectiveRotorCount3D | Total number of 3D features (the sum of FeatureAcceptorCount3D, FeatureDonorCount3D, FeatureAnionCount3D, FeatureCationCount3D, FeatureRingCount3D and FeatureHydrophobeCount3D) |
ConformerCount3D | The number of conformers in the conformer model for a compound. |
Fingerprint2D | Base64-encoded PubChem Substructure Fingerprint of a molecule. |
属性API
根据CID获取属性。
同义词API
根据CID获取同义词。
实例:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1,2,3,4,5/synonyms/JSON
打包
git clone https://github.com/XavierJiezou/python-pubchem-api.git
cd python-pubchem-api
pip install pipenv
pipenv install
pipenv shell
pip install requests
pip install pyinstaller
pyinstaller -F -i favicon.ico pubchem.py
源码
import os, csv, json, requests
class PubchemCrawlFast():
def __init__(self, cid_path, out_path):
"""Initialization function.
Args:
cid_path (str): Input file path of cid list
out_path (str): Output file path of crawled data
"""
self.cid_path = cid_path
self.out_path = out_path
self.property_list = [
'IUPACName',
'IsomericSMILES',
'MolecularFormula',
'MolecularWeight',
'HBondDonorCount',
'HBondAcceptorCount'
]
def get_cid_list(self):
"""Get the cid list from the local file
"""
if os.path.exists(self.cid_path):
with open(self.cid_path) as f:
self.cid_list = [i.strip() for i in f.readlines()]
else:
self.cid_list = []
cid = input('Please inpute the CID list below: \\n')
while cid != '':
self.cid_list.append(cid)
cid = input()
self.length = len(self.cid_list)
def get_property_from_cid(self):
"""Get the property from cid
"""
limit = 300
api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/'
property_str = ','.join(self.property_list)
return_type = 'json'
self.prp = []
for i in range(limit, self.length+limit, limit):
cid_str = ','.join(self.cid_list[i-limit:i])
url = f'{api}{cid_str}/property/{property_str}/{return_type}'
res = requests.get(url).json()
self.prp += res['PropertyTable']['Properties']
def get_synonyms_from_cid(self):
"""Get the synonym from cid
"""
limit = 300
api = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/'
return_type = 'json'
self.syn = []
for i in range(limit, self.length+limit, limit):
cid_str = ','.join(self.cid_list[i-limit:i])
url = f'{api}{cid_str}/synonyms/{return_type}'
res = requests.get(url).json()
self.syn += res['InformationList']['Information']
for i in range(len(self.syn)):
if 'Synonym' not in self.syn[i]:
self.syn[i]['Synonym'] = []
def save_as_csv(self, data):
"""Save the crawled data in CSV format
"""
csv_name = self.out_path.split('.')[0]+'.csv'
header_list = ['CID']+self.property_list+['Synonym']
# with open(csv_name, 'w') as f:
# f.write(','.join(header_list)+'\\n')
# with open(csv_name, 'a') as f:
# for item in data:
# line = ['"'+str(item[each])+'"' for each in header_list]
# f.write(','.join(line)+'\\n')
with open(csv_name,'w', newline='') as f:
writer = csv.DictWriter(f, header_list)
writer.writeheader()
writer.writerows(data)
def __main__(self):
print('Getting CID list: ')
self.get_cid_list()
print('CID list acquisition is complete!')
print('--------------------------------------------')
print('Querying property list: ')
self.get_property_from_cid()
print('Property list query is complete!')
print('--------------------------------------------')
print('Querying synonym: ')
self.get_synonyms_from_cid()
print('Synonym query is complete!')
print('--------------------------------------------')
dt = {
'InfoList': {
'Info': [dict(d1, **d2) for d1, d2 in zip(self.prp, self.syn)]
}
}
json_str = json.dumps(dt, indent=2)
print('The data is being written to the JSON file: ')
with open(self.out_path, 'w') as f:
f.write(json_str)
print('Finished writing the JSON file! ')
print('--------------------------------------------')
print('The data is being written to the CSV file: ')
self.save_as_csv(dt['InfoList']['Info'])
print('Finished writing the CSV file! ')
os.system('pause')
if __name__ == '__main__':
PubchemCrawlFast('cid.txt', 'data.json').__main__()
参考
以上是关于Python根据CID获取化合物数据(调用Pubchem官方API)的主要内容,如果未能解决你的问题,请参考以下文章
计算机辅助药物设计(AI)-分子对接-同源建模-药物筛选-先导化合物-机器学习药物发现