使用过去运行的预训练节点 - Pytorch Biggraph

Posted

技术标签:

【中文标题】使用过去运行的预训练节点 - Pytorch Biggraph【英文标题】:Use pre trained Nodes from past runs - Pytorch Biggraph 【发布时间】:2021-02-05 13:34:33 【问题描述】:

在与这个惊人的 facebookresearch / PyTorch-BigGraph 项目及其不可能的 API 苦苦挣扎之后,我设法掌握了如何运行它(感谢 stand alone simple example)

我的系统限制不允许我训练所有边的密集(嵌入)表示,我需要不时上传过去的嵌入并使用新边和现有节点训练模型,请注意过去的节点和新的边列表不一定重叠。

我试图从这里了解:see the context section 怎么做,到目前为止没有成功。

以下是一个独立的 PGD 代码,它将 batch_edges 转换为嵌入节点列表,但是,我需要它来使用预训练的节点列表 past_trained_nodes

import os
import shutil
from pathlib import Path

from torchbiggraph.config import parse_config
from torchbiggraph.converters.importers import TSVEdgelistReader, convert_input_data
from torchbiggraph.train import train
from torchbiggraph.util import SubprocessInitializer, setup_logging

DIMENSION = 4
DATA_DIR = 'data'
GRAPH_PATH = DATA_DIR + '/output1.tsv'
MODEL_DIR = 'model'


raw_config = dict(
        entity_path=DATA_DIR,
        edge_paths=[DATA_DIR + '/edges_partitioned', ],
        checkpoint_path=MODEL_DIR,
        entities="n": "num_partitions": 1,
        relations=["name": "doesnt_matter", "lhs": "n", "rhs": "n", "operator": "complex_diagonal", ],
        dynamic_relations=False, dimension=DIMENSION, global_emb=False, comparator="dot",
        num_epochs=7, num_uniform_negs=1000, loss_fn="softmax", lr=0.1, eval_fraction=0.,)

batch_edges = [["A", "B"], ["B", "C"],  ["C", "D"], ["D", "B"], ["B", "D"]]


# I want the model to use these pretrained nodes, Notice that Node A exist, And F Does not
#I dont have all past nodes, as some are gained from data
past_trained_nodes = 'A': [0.5, 0.3, 1.5, 8.1], 'F': [3, 0.6, 1.2, 4.3]


try:
    shutil.rmtree('data')
except:
    pass
try:
    shutil.rmtree(MODEL_DIR)
except:
    pass

os.makedirs(DATA_DIR, exist_ok=True)
with open(GRAPH_PATH, 'w') as f:
    for edge in batch_edges:
        f.write('\t'.join(edge) + '\n')

setup_logging()
config = parse_config(raw_config)
subprocess_init = SubprocessInitializer()
input_edge_paths = [Path(GRAPH_PATH)]

convert_input_data(config.entities, config.relations, config.entity_path, config.edge_paths,
                   input_edge_paths, TSVEdgelistReader(lhs_col=0, rel_col=None, rhs_col=1),
                   dynamic_relations=config.dynamic_relations, )

train(config, subprocess_init=subprocess_init)

如何在当前模型中使用我预训练的节点?

提前致谢!

【问题讨论】:

【参考方案1】:

由于torchbiggraph 是基于文件的,您可以修改保存的文件以加载预训练的嵌入并添加新节点。我写了一个函数来实现这个

import json
    
def pretrained_and_new_nodes(pretrained_nodes,new_nodes,entity_name,data_dir,embeddings_path):
    """
    pretrained_nodes: 
        A dictionary of nodes and their embeddings 
    new_nodes:
        A list of new nodes,each new node must have an embedding in pretrained_nodes. 
        If no new nodes, use []
    entity_name: 
        The entity's name, for example, WHATEVER_0
    data_dir: 
        The path to the files that record graph nodes and edges 
    embeddings_path: 
        The path to the .h5 file of embeddings
    """
    with open('%s/entity_names_%s.json' % (data_dir,entity_name),'r') as source:
        nodes = json.load(source)
    dist = item:ind for ind,item in enumerate(nodes)
    
    if len(new_nodes) > 0:
        # modify both the node names and the node count 
        extended = nodes.copy()
        extended.extend(new_nodes)
        with open('%s/entity_names_%s.json' % (data_dir,entity_name),'w') as source:
            json.dump(extended,source)
        with open('%s/entity_count_%s.txt' % (data_dir,entity_name),'w') as source:
            source.write('%i' % len(extended))
    
    if len(new_nodes) == 0:
        # if no new nodes are added, we won't bother create a new .h5 file, but just modify the original one 
        with h5py.File(embeddings_path,'r+') as source:

            for node,embedding in pretrained_nodes.items():
                if node in nodes:
                    source['embeddings'][dist[node]] = embedding
    else:
        # if there are new nodes, then we must create a new .h5 file 
        # see https://***.com/a/47074545/8366805 
        with h5py.File(embeddings_path,'r+') as source:
            embeddings = list(source['embeddings'])
            optimizer = list(source['optimizer'])
        for node,embedding in pretrained_nodes.items():
            if node in nodes:
                embeddings[dist[node]] = embedding
        # append new nodes in order 
        for node in new_nodes:
            if node not in list(pretrained_nodes.keys()):
                raise ValueError 
            else:
                embeddings.append(pretrained_nodes[node])
        # write a new .h5 file for the embedding 
        with h5py.File(embeddings_path,'w') as source:
            source.create_dataset('embeddings',data=embeddings,)
            optimizer = [item.encode('ascii') for item in optimizer]
            source.create_dataset('optimizer',data=optimizer)

在您训练了一个模型(假设您在帖子中链接的简单示例)之后,您想要将节点 A 的学习嵌入更改为 [0.5, 0.3, 1.5, 8.1]。此外,您还想通过嵌入[3, 0.6, 1.2, 4.3] 向图中添加一个新节点F(这个新添加的节点F 与其他节点没有连接)。你可以运行我的函数

past_trained_nodes = 'A': [0.5, 0.3, 1.5, 8.1], 'F': [3, 0.6, 1.2, 4.3]
pretrained_and_new_nodes(pretrained_nodes=past_trained_nodes,
                         new_nodes=['F'],
                         entity_name='WHATEVER_0',
                         data_dir='data/example_1',
                         embeddings_path='model_1/embeddings_WHATEVER_0.v7.h5')

运行该函数后,可以查看修改后的embeddings文件embeddings_WHATEVER_0.v7.h5

filename = "model_1/embeddings_WHATEVER_0.v7.h5" 

with h5py.File(filename, "r") as source:
    embeddings = list(source['embeddings'])

embeddings

你会看到,A的embedding被改变了,F的embedding也被添加了(embeddings的顺序与entity_names_WHATEVER_0.json中节点的顺序一致)。

修改文件后,您可以在新的训练会话中使用预训练的节点。

【讨论】:

@YehosaphatSchellekens 不客气!我很高兴能帮上忙:)

以上是关于使用过去运行的预训练节点 - Pytorch Biggraph的主要内容,如果未能解决你的问题,请参考以下文章

PyTorch 的预训练,是时候学习一下了

Pytorch - 跳过计算每个时期的预训练模型的特征

Pytorch 中的预训练模型

如果我们扩展或减少同一模型的层,我们仍然可以从 Pytorch 中的预训练模型进行训练吗?

如何使用 Pytorch 中的预训练权重修改具有 4 个通道作为输入的 resnet 50?

有啥方法可以将 PyTorch 中可用的预训练模型下载到特定路径?