利用自动图机器学习AutoGL框架预测分子的溶解度性质
Posted AIQuantum
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了利用自动图机器学习AutoGL框架预测分子的溶解度性质相关的知识,希望对你有一定的参考价值。
原文标题 Python package for Automated Graph Learning #DeepLearning #GraphLearning #Chemoinformatics
原文链接:
https://iwatobipen.wordpress.com/2021/01/04/python-package-for-automated-graph-learning-deeplearning-graphlearning-chemoinformatics/
译者:Taki
许多数据可以表示为图。因此,基于图的深度学习(GL)是非常有趣的领域。在化学领域,分子可以抽象为图,因此GL也是引起化学信息学研究者的兴趣。
我之前也发布了关于Deep Graph Library和torch_geometric一些有关GL的主题。两种软件包对于化学信息学领域都非常有用。
今天,我想介绍一款名为AutoGL的GL新软件包,该软件包可让研究人员和开发人员快速上手,对图数据集及其任务进行autoML。该软件包可在以下URL找到:
https://github.com/THUMNLab/AutoGL
想要配置AutoGL,需要先安装好torch_geometric(PyG)。如果要使用AutoGL的当前版本,PyG的版本应小于1.6.1。
该软件包提供了自动的node_classification和graph_classification方法。在化学信息学中,分子被认为是图,所以我对graph_classification感兴趣。
因此,我尝试使用该软件包对分子性质进行预测并将代码上传到了github。我使用了分子溶解度数据用于我的测试。
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
from autogl.solver import AutoNodeClassifier
from autogl.solver import AutoGraphClassifier
from autogl.module.feature import BaseFeatureEngineer
from autogl.module.feature import BaseFeatureAtom
from autogl.datasets import utilsimport os
from rdkit import Chem
from rdkit.Chem import RDConfigimport molutil
from torch_geometric.data import Data, DataLoader, Dataset
from torch_geometric.data import InMemoryDataset
class ChemDataset(InMemoryDataset):
def __init__(self, datalist) -> None:
super().__init__()
self.data, self.slices = self.collate(datalist)
sol_cls_dict = {'(A) low':0, '(B) medium':1, '(C) high':2}
trainpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
testpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(trainpath)]
test_mols = [m for m in Chem.SDMolSupplier(testpath)]
train_X = [molutil.mol2vec(m) for m in train_mols]
for i, data in enumerate(train_X):
y = sol_cls_dict[train_mols[i].GetProp('SOL_classification')]
data.y = torch.tensor([y], dtype=torch.long)
test_X = [molutil.mol2vec(m) for m in test_mols]
for i, data in enumerate(test_X):
y = sol_cls_dict[test_mols[i].GetProp('SOL_classification')]
data.y = torch.tensor([y], dtype=torch.long)trainpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
testpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(trainpath)]
test_mols = [m for m in Chem.SDMolSupplier(testpath)]
train_X = [molutil.mol2vec(m) for m in train_mols]
for i, data in enumerate(train_X):
y = sol_cls_dict[train_mols[i].GetProp('SOL_classification')]
data.y = torch.tensor([y], dtype=torch.long)
test_X = [molutil.mol2vec(m) for m in test_mols]
for i, data in enumerate(test_X):
y = sol_cls_dict[test_mols[i].GetProp('SOL_classification')]
data.y = torch.tensor([y], dtype=torch.long)
trainData = ChemDataset(train_X)
testData = ChemDataset(test_X)
utils.graph_random_splits(trainData, train_ratio=0.4, val_ratio=0.4)
utils.graph_random_splits(testData, train_ratio=0.0, val_ratio=0.0)trainData = ChemDataset(train_X)
testData = ChemDataset(test_X)
utils.graph_random_splits(trainData, train_ratio=0.4, val_ratio=0.4)
utils.graph_random_splits(testData, train_ratio=0.0, val_ratio=0.0)
config = {
'models':{'gin': None},
'feature': [{'name': 'NxLargeCliqueSize'}],
'hpo': {'name': 'anneal', 'max_evals': 10},
'ensemble': {'name': 'voting', 'size': 2},
'trainer' : [
# trainer hp space
{'parameterName': 'max_epoch', 'type': 'INTEGER', 'maxValue': 20, 'minValue': 10, 'scalingType': 'LINEAR'},
{'parameterName': 'batch_size', 'type': 'INTEGER', 'maxValue': 128, 'minValue': 32, 'scalingType': 'LOG'},
{'parameterName': 'early_stopping_round', 'type': 'INTEGER', 'maxValue': 30, 'minValue': 10, 'scalingType': 'LINEAR'},
{'parameterName': 'lr', 'type': 'DOUBLE', 'maxValue': 1e-3, 'minValue': 1e-4, 'scalingType': 'LOG'},
{'parameterName': 'weight_decay', 'type': 'DOUBLE', 'maxValue': 5e-3, 'minValue': 5e-4, 'scalingType': 'LOG'},
]
}
solver = AutoGraphClassifier.from_config(config)config = {
'models':{'gin': None},
'feature': [{'name': 'NxLargeCliqueSize'}],
'hpo': {'name': 'anneal', 'max_evals': 10},
'ensemble': {'name': 'voting', 'size': 2},
'trainer' : [
# trainer hp space
{'parameterName': 'max_epoch', 'type': 'INTEGER', 'maxValue': 20, 'minValue': 10, 'scalingType': 'LINEAR'},
{'parameterName': 'batch_size', 'type': 'INTEGER', 'maxValue': 128, 'minValue': 32, 'scalingType': 'LOG'},
{'parameterName': 'early_stopping_round', 'type': 'INTEGER', 'maxValue': 30, 'minValue': 10, 'scalingType': 'LINEAR'},
{'parameterName': 'lr', 'type': 'DOUBLE', 'maxValue': 1e-3, 'minValue': 1e-4, 'scalingType': 'LOG'},
{'parameterName': 'weight_decay', 'type': 'DOUBLE', 'maxValue': 5e-3, 'minValue': 5e-4, 'scalingType': 'LOG'},
]
}
solver = AutoGraphClassifier.from_config(config)
solver.fit(trainData,
time_limit=720,
train_split=0.9,
val_split=0.1,
cross_validation=True,
cv_split=10,
)
lb = solver.get_leaderboard()
print('best single model:\n', solver.get_leaderboard().get_best_model(0))lb = solver.get_leaderboard()
'''
这里输出
best single model:
<class 'torch.optim.adam.Adam'>-0.0008968685743489439-15-15-AutoGIN(
(model): GIN(
(convs): ModuleList(
(0): GINConv(nn=Sequential(
(0): Linear(in_features=75, out_features=25, bias=True)
(1): ELU(alpha=1.0)
(2): Linear(in_features=25, out_features=25, bias=True)
(3): ELU(alpha=1.0)
(4): Linear(in_features=25, out_features=25, bias=True)
))
(1): GINConv(nn=Sequential(
(0): Linear(in_features=25, out_features=37, bias=True)
(1): ELU(alpha=1.0)
(2): Linear(in_features=37, out_features=37, bias=True)
(3): ELU(alpha=1.0)
(4): Linear(in_features=37, out_features=37, bias=True)
))
(2): GINConv(nn=Sequential(
(0): Linear(in_features=37, out_features=52, bias=True)
(1): ELU(alpha=1.0)
(2): Linear(in_features=52, out_features=52, bias=True)
(3): ELU(alpha=1.0)
(4): Linear(in_features=52, out_features=52, bias=True)
))
(3): GINConv(nn=Sequential(
(0): Linear(in_features=52, out_features=23, bias=True)
(1): ELU(alpha=1.0)
(2): Linear(in_features=23, out_features=23, bias=True)
(3): ELU(alpha=1.0)
(4): Linear(in_features=23, out_features=23, bias=True)
))
)
(bns): ModuleList(
(0): BatchNorm1d(25, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): BatchNorm1d(37, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): BatchNorm1d(52, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): BatchNorm1d(23, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(fc1): Linear(in_features=24, out_features=8, bias=True)
(fc2): Linear(in_features=8, out_features=3, bias=True)
)
)-cpu|num_layers-6-hidden-[25, 37, 52, 23, 8]-dropout-0.4706157382603818-act-elu-eps-False-mlp_layers-3_cv5_idx0best single model:
<class 'torch.optim.adam.Adam'>-0.0008968685743489439-15-15-AutoGIN(
(model): GIN(
(convs): ModuleList(
(0): GINConv(nn=Sequential(
(0): Linear(in_features=75, out_features=25, bias=True)
(1): ELU(alpha=1.0)
(2): Linear(in_features=25, out_features=25, bias=True)
(3): ELU(alpha=1.0)
(4): Linear(in_features=25, out_features=25, bias=True)
))
(1): GINConv(nn=Sequential(
(0): Linear(in_features=25, out_features=37, bias=True)
(1): ELU(alpha=1.0)
(2): Linear(in_features=37, out_features=37, bias=True)
(3): ELU(alpha=1.0)
(4): Linear(in_features=37, out_features=37, bias=True)
))
(2): GINConv(nn=Sequential(
(0): Linear(in_features=37, out_features=52, bias=True)
(1): ELU(alpha=1.0)
(2): Linear(in_features=52, out_features=52, bias=True)
(3): ELU(alpha=1.0)
(4): Linear(in_features=52, out_features=52, bias=True)
))
(3): GINConv(nn=Sequential(
(0): Linear(in_features=52, out_features=23, bias=True)
(1): ELU(alpha=1.0)
(2): Linear(in_features=23, out_features=23, bias=True)
(3): ELU(alpha=1.0)
(4): Linear(in_features=23, out_features=23, bias=True)
))
)
(bns): ModuleList(
(0): BatchNorm1d(25, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): BatchNorm1d(37, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): BatchNorm1d(52, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): BatchNorm1d(23, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(fc1): Linear(in_features=24, out_features=8, bias=True)
(fc2): Linear(in_features=8, out_features=3, bias=True)
)
)-cpu|num_layers-6-hidden-[25, 37, 52, 23, 8]-dropout-0.4706157382603818-act-elu-eps-False-mlp_layers-3_cv5_idx0
'''
lb.show()lb.show()
'''
name acc
10 ensemble 0.766990
5 <class 'torch.optim.adam.Adam'>-0.000896868574... 0.757282
6 <class 'torch.optim.adam.Adam'>-0.000650776957... 0.737864
9 <class 'torch.optim.adam.Adam'>-0.000540590597... 0.689320
4 <class 'torch.optim.adam.Adam'>-0.000290357625... 0.669903
1 <class 'torch.optim.adam.Adam'>-0.000231124807... 0.650485
2 <class 'torch.optim.adam.Adam'>-0.000378524042... 0.650485
0 <class 'torch.optim.adam.Adam'>-0.000716295596... 0.631068
8 <class 'torch.optim.adam.Adam'>-0.000196322231... 0.601942
7 <class 'torch.optim.adam.Adam'>-0.000172943198... 0.582524
3 <class 'torch.optim.adam.Adam'>-0.000664233713... 0.563107 name acc
10 ensemble 0.766990
5 <class 'torch.optim.adam.Adam'>-0.000896868574... 0.757282
6 <class 'torch.optim.adam.Adam'>-0.000650776957... 0.737864
9 <class 'torch.optim.adam.Adam'>-0.000540590597... 0.689320
4 <class 'torch.optim.adam.Adam'>-0.000290357625... 0.669903
1 <class 'torch.optim.adam.Adam'>-0.000231124807... 0.650485
2 <class 'torch.optim.adam.Adam'>-0.000378524042... 0.650485
0 <class 'torch.optim.adam.Adam'>-0.000716295596... 0.631068
8 <class 'torch.optim.adam.Adam'>-0.000196322231... 0.601942
7 <class 'torch.optim.adam.Adam'>-0.000172943198... 0.582524
3 <class 'torch.optim.adam.Adam'>-0.000664233713... 0.563107
'''
pred = solver.predict(testData,
inplaced=False,
inplace=False,
use_ensemble=True,
use_best=True
)pred = solver.predict(testData,
inplaced=False,
inplace=False,
use_ensemble=True,
use_best=True
)
from sklearn.metrics import accuracy_scorefrom sklearn.metrics import accuracy_score
accuracy_score(testData.data.y.numpy(), pred)accuracy_score(testData.data.y.numpy(), pred)
'''
输出预测精度为,有overfitting
0.365758
'''
AutoGL提供多种方法定义自动求解器,如从配置文件config file,配置config和临时配置ad hoc中定义。在本次测试中,我使用from_config定义求解器。基于原始存储库中的readme.md可以得知,对于图分类任务,目前支持的图算法有GIN和TopKPool,但不支持GCN。
我的测试结果似乎表明我的模型过拟合。
总之,AutoGL是用于自动图学习有趣且实用的软件包,由于它包装了torch_geometric, 这意味着用户可以非常轻松地优化模型。
但是,如果用户可以直接使用torch_geometic定义模型,那么,使用optuna优化工具似乎是另一种用于自动GL优化的方式。
Github链接:
https://gist.github.com/iwatobipen/827d3921826607663dd50018be903ee7
AutoGL 代码链接:
https://github.com/THUMNLab/AutoGL
http://mn.cs.tsinghua.edu.cn/autogl/
AutoGL 说明文档:
https://autogl.readthedocs.io/en/latest/index.html
图深度学习模型综述:
https://arxiv.org/abs/1812.04202
以上是关于利用自动图机器学习AutoGL框架预测分子的溶解度性质的主要内容,如果未能解决你的问题,请参考以下文章