利用自动图机器学习AutoGL框架预测分子的溶解度性质

Posted AIQuantum

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了利用自动图机器学习AutoGL框架预测分子的溶解度性质相关的知识,希望对你有一定的参考价值。

原文标题 Python package for Automated Graph Learning #DeepLearning #GraphLearning #Chemoinformatics


原文链接:

https://iwatobipen.wordpress.com/2021/01/04/python-package-for-automated-graph-learning-deeplearning-graphlearning-chemoinformatics/


译者:Taki



多数据可以表示为图。因此,基于图的深度学习(GL)是非常有趣的领域。在化学领域,分子可以抽象为图,因此GL也是引起化学信息学研究者的兴趣。


我之前也发布了关于Deep Graph Library和torch_geometric一些有关GL的主题。两种软件包对于化学信息学领域都非常有用。


今天,我想介绍一款名为AutoGL的GL新软件包,该软件包可让研究人员和开发人员快速上手,对图数据集及其任务进行autoML。该软件包可在以下URL找到:

https://github.com/THUMNLab/AutoGL


想要配置AutoGL,需要先安装好torch_geometric(PyG)。如果要使用AutoGL的当前版本,PyG的版本应小于1.6.1。


该软件包提供了自动的node_classification和graph_classification方法。在化学信息学中,分子被认为是图,所以我对graph_classification感兴趣。


因此,我尝试使用该软件包对分子性质进行预测并将代码上传到了github。我使用了分子溶解度数据用于我的测试。


import torchdevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')from autogl.solver import AutoNodeClassifierfrom autogl.solver import AutoGraphClassifierfrom autogl.module.feature import BaseFeatureEngineerfrom autogl.module.feature import BaseFeatureAtomfrom autogl.datasets import utilsimport osfrom rdkit import Chemfrom rdkit.Chem import RDConfigimport molutilfrom torch_geometric.data import Data, DataLoader, Datasetfrom torch_geometric.data import InMemoryDatasetclass ChemDataset(InMemoryDataset): def __init__(self, datalist) -> None: super().__init__()        self.data, self.slices = self.collate(datalist)               sol_cls_dict = {'(A) low':0'(B) medium':1'(C) high':2}
trainpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')testpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(trainpath)]test_mols = [m for m in Chem.SDMolSupplier(testpath)]
train_X = [molutil.mol2vec(m) for m in train_mols]for i, data in enumerate(train_X): y = sol_cls_dict[train_mols[i].GetProp('SOL_classification')] data.y = torch.tensor([y], dtype=torch.long) test_X = [molutil.mol2vec(m) for m in test_mols]for i, data in enumerate(test_X): y = sol_cls_dict[test_mols[i].GetProp('SOL_classification')] data.y = torch.tensor([y], dtype=torch.long)trainpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')testpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(trainpath)]test_mols = [m for m in Chem.SDMolSupplier(testpath)]
train_X = [molutil.mol2vec(m) for m in train_mols]for i, data in enumerate(train_X): y = sol_cls_dict[train_mols[i].GetProp('SOL_classification')] data.y = torch.tensor([y], dtype=torch.long) test_X = [molutil.mol2vec(m) for m in test_mols]for i, data in enumerate(test_X): y = sol_cls_dict[test_mols[i].GetProp('SOL_classification')] data.y = torch.tensor([y], dtype=torch.long) trainData = ChemDataset(train_X)testData = ChemDataset(test_X)
utils.graph_random_splits(trainData, train_ratio=0.4, val_ratio=0.4)utils.graph_random_splits(testData, train_ratio=0.0, val_ratio=0.0)trainData = ChemDataset(train_X)testData = ChemDataset(test_X)
utils.graph_random_splits(trainData, train_ratio=0.4, val_ratio=0.4)utils.graph_random_splits(testData, train_ratio=0.0, val_ratio=0.0)
config = { 'models':{'gin': None}, 'feature': [{'name': 'NxLargeCliqueSize'}], 'hpo': {'name': 'anneal', 'max_evals': 10}, 'ensemble': {'name': 'voting', 'size': 2}, 'trainer' : [ # trainer hp space {'parameterName': 'max_epoch', 'type': 'INTEGER', 'maxValue': 20, 'minValue': 10, 'scalingType': 'LINEAR'}, {'parameterName': 'batch_size', 'type': 'INTEGER', 'maxValue': 128, 'minValue': 32, 'scalingType': 'LOG'}, {'parameterName': 'early_stopping_round', 'type': 'INTEGER', 'maxValue': 30, 'minValue': 10, 'scalingType': 'LINEAR'}, {'parameterName': 'lr', 'type': 'DOUBLE', 'maxValue': 1e-3, 'minValue': 1e-4, 'scalingType': 'LOG'}, {'parameterName': 'weight_decay', 'type': 'DOUBLE', 'maxValue': 5e-3, 'minValue': 5e-4, 'scalingType': 'LOG'}, ]}
solver = AutoGraphClassifier.from_config(config)config = { 'models':{'gin': None}, 'feature': [{'name': 'NxLargeCliqueSize'}], 'hpo': {'name': 'anneal', 'max_evals': 10}, 'ensemble': {'name': 'voting', 'size': 2}, 'trainer' : [ # trainer hp space {'parameterName': 'max_epoch', 'type': 'INTEGER', 'maxValue': 20, 'minValue': 10, 'scalingType': 'LINEAR'}, {'parameterName': 'batch_size', 'type': 'INTEGER', 'maxValue': 128, 'minValue': 32, 'scalingType': 'LOG'}, {'parameterName': 'early_stopping_round', 'type': 'INTEGER', 'maxValue': 30, 'minValue': 10, 'scalingType': 'LINEAR'}, {'parameterName': 'lr', 'type': 'DOUBLE', 'maxValue': 1e-3, 'minValue': 1e-4, 'scalingType': 'LOG'}, {'parameterName': 'weight_decay', 'type': 'DOUBLE', 'maxValue': 5e-3, 'minValue': 5e-4, 'scalingType': 'LOG'}, ]}
solver = AutoGraphClassifier.from_config(config)


solver.fit(trainData, time_limit=720, train_split=0.9, val_split=0.1, cross_validation=True, cv_split=10, )lb = solver.get_leaderboard()
print('best single model:\n', solver.get_leaderboard().get_best_model(0))lb = solver.get_leaderboard()

'''这里输出best single model: <class 'torch.optim.adam.Adam'>-0.0008968685743489439-15-15-AutoGIN( (model): GIN( (convs): ModuleList( (0): GINConv(nn=Sequential( (0): Linear(in_features=75, out_features=25, bias=True) (1): ELU(alpha=1.0) (2): Linear(in_features=25, out_features=25, bias=True) (3): ELU(alpha=1.0) (4): Linear(in_features=25, out_features=25, bias=True) )) (1): GINConv(nn=Sequential( (0): Linear(in_features=25, out_features=37, bias=True) (1): ELU(alpha=1.0) (2): Linear(in_features=37, out_features=37, bias=True) (3): ELU(alpha=1.0) (4): Linear(in_features=37, out_features=37, bias=True) )) (2): GINConv(nn=Sequential( (0): Linear(in_features=37, out_features=52, bias=True) (1): ELU(alpha=1.0) (2): Linear(in_features=52, out_features=52, bias=True) (3): ELU(alpha=1.0) (4): Linear(in_features=52, out_features=52, bias=True) )) (3): GINConv(nn=Sequential( (0): Linear(in_features=52, out_features=23, bias=True) (1): ELU(alpha=1.0) (2): Linear(in_features=23, out_features=23, bias=True) (3): ELU(alpha=1.0) (4): Linear(in_features=23, out_features=23, bias=True) )) ) (bns): ModuleList( (0): BatchNorm1d(25, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (1): BatchNorm1d(37, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): BatchNorm1d(52, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (3): BatchNorm1d(23, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (fc1): Linear(in_features=24, out_features=8, bias=True) (fc2): Linear(in_features=8, out_features=3, bias=True) ))-cpu|num_layers-6-hidden-[25, 37, 52, 23, 8]-dropout-0.4706157382603818-act-elu-eps-False-mlp_layers-3_cv5_idx0best single model: <class 'torch.optim.adam.Adam'>-0.0008968685743489439-15-15-AutoGIN( (model): GIN( (convs): ModuleList( (0): GINConv(nn=Sequential( (0): Linear(in_features=75, out_features=25, bias=True) (1): ELU(alpha=1.0) (2): Linear(in_features=25, out_features=25, bias=True) (3): ELU(alpha=1.0) (4): Linear(in_features=25, out_features=25, bias=True) )) (1): GINConv(nn=Sequential( (0): Linear(in_features=25, out_features=37, bias=True) (1): ELU(alpha=1.0) (2): Linear(in_features=37, out_features=37, bias=True) (3): ELU(alpha=1.0) (4): Linear(in_features=37, out_features=37, bias=True) )) (2): GINConv(nn=Sequential( (0): Linear(in_features=37, out_features=52, bias=True) (1): ELU(alpha=1.0) (2): Linear(in_features=52, out_features=52, bias=True) (3): ELU(alpha=1.0) (4): Linear(in_features=52, out_features=52, bias=True) )) (3): GINConv(nn=Sequential( (0): Linear(in_features=52, out_features=23, bias=True) (1): ELU(alpha=1.0) (2): Linear(in_features=23, out_features=23, bias=True) (3): ELU(alpha=1.0) (4): Linear(in_features=23, out_features=23, bias=True) )) ) (bns): ModuleList( (0): BatchNorm1d(25, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (1): BatchNorm1d(37, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): BatchNorm1d(52, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (3): BatchNorm1d(23, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (fc1): Linear(in_features=24, out_features=8, bias=True) (fc2): Linear(in_features=8, out_features=3, bias=True) ))-cpu|num_layers-6-hidden-[25, 37, 52, 23, 8]-dropout-0.4706157382603818-act-elu-eps-False-mlp_layers-3_cv5_idx0
'''lb.show()lb.show()''' name acc10 ensemble 0.7669905 <class 'torch.optim.adam.Adam'>-0.000896868574... 0.7572826 <class 'torch.optim.adam.Adam'>-0.000650776957... 0.7378649 <class 'torch.optim.adam.Adam'>-0.000540590597... 0.6893204 <class 'torch.optim.adam.Adam'>-0.000290357625... 0.6699031 <class 'torch.optim.adam.Adam'>-0.000231124807... 0.6504852 <class 'torch.optim.adam.Adam'>-0.000378524042... 0.6504850 <class 'torch.optim.adam.Adam'>-0.000716295596... 0.6310688 <class 'torch.optim.adam.Adam'>-0.000196322231... 0.6019427 <class 'torch.optim.adam.Adam'>-0.000172943198... 0.5825243 <class 'torch.optim.adam.Adam'>-0.000664233713... 0.563107 name acc10 ensemble 0.7669905 <class 'torch.optim.adam.Adam'>-0.000896868574... 0.7572826 <class 'torch.optim.adam.Adam'>-0.000650776957... 0.7378649 <class 'torch.optim.adam.Adam'>-0.000540590597... 0.6893204 <class 'torch.optim.adam.Adam'>-0.000290357625... 0.6699031 <class 'torch.optim.adam.Adam'>-0.000231124807... 0.6504852 <class 'torch.optim.adam.Adam'>-0.000378524042... 0.6504850 <class 'torch.optim.adam.Adam'>-0.000716295596... 0.6310688 <class 'torch.optim.adam.Adam'>-0.000196322231... 0.6019427 <class 'torch.optim.adam.Adam'>-0.000172943198... 0.5825243 <class 'torch.optim.adam.Adam'>-0.000664233713... 0.563107
'''pred = solver.predict(testData, inplaced=False, inplace=False, use_ensemble=True, use_best=True )pred = solver.predict(testData, inplaced=False, inplace=False, use_ensemble=True, use_best=True )
from sklearn.metrics import accuracy_scorefrom sklearn.metrics import accuracy_score                                             accuracy_score(testData.data.y.numpy(), pred)accuracy_score(testData.data.y.numpy(), pred)
'''输出预测精度为,有overfitting0.365758'''


AutoGL提供多种方法定义自动求解器,如从配置文件config file,配置config和临时配置ad hoc中定义。在本次测试中,我使用from_config定义求解器。基于原始存储库中的readme.md可以得知,对于图分类任务,目前支持的图算法有GIN和TopKPool,但不支持GCN。


我的测试结果似乎表明我的模型过拟合。


总之,AutoGL是用于自动图学习有趣且实用的软件包,由于它包装了torch_geometric, 这意味着用户可以非常轻松地优化模型。


但是,如果用户可以直接使用torch_geometic定义模型,那么,使用optuna优化工具似乎是另一种用于自动GL优化的方式。



Github链接:

https://gist.github.com/iwatobipen/827d3921826607663dd50018be903ee7

AutoGL 代码链接:

https://github.com/THUMNLab/AutoGL

http://mn.cs.tsinghua.edu.cn/autogl/

AutoGL 说明文档:

https://autogl.readthedocs.io/en/latest/index.html

图深度学习模型综述:

https://arxiv.org/abs/1812.04202








以上是关于利用自动图机器学习AutoGL框架预测分子的溶解度性质的主要内容,如果未能解决你的问题,请参考以下文章

学习笔记 2022 综述 | 自动图机器学习,阐述 AGML 方法库与方向

MEGNet普适性图神经网络,精确预测分子和晶体性质

利用机器学习预测promiscuity cliffs

01_机器学习介绍

曾坚阳团队开发蛋白-小分子相互作用预测的深度学习模型

人工智能,大数据技术,机器学习,深度学习