GraphSAGE 代码解析
Posted shiyublog
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了GraphSAGE 代码解析相关的知识,希望对你有一定的参考价值。
安装Docker与程序运行
1. requirements.txt
Problem:
Downloading https://files.pythonhosted.org/packages/69/cb/f5be453359271714c01b9bd06126eaf2e368f1fddfff30818754b5ac2328/funcsigs-1.0.2-py2.py3-none-any.whl Collecting futures==3.2.0 (from -r requirements.txt (line 8)) Could not find a version that satisfies the requirement futures==3.2.0 (from -r requirements.txt (line 8)) (from versions: 0.2.python3, 0.1, 0.2, 1.0, 2.0, 2.1, 2.1.1, 2.1.2, 2.1.3, 2.1.4, 2.1.5, 2.1.6, 2.2.0, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.0.4, 3.0.5, 3.1.0, 3.1.1) No matching distribution found for futures==3.2.0 (from -r requirements.txt (line 8)) You are using pip version 10.0.1, however version 18.1 is available. You should consider upgrading via the ‘pip install --upgrade pip‘ command.
Solution:
futures==3.2.0 => futures==3.1.1
2. Install Docker CE for Ubuntu
https://docs.docker.com/install/linux/docker-ce/ubuntu/
3. Problem: write /var/lib/docker/tmp/GetImageBlob891597147: no space left on device
Solution:
sudo apt-get autoclean # 删除你已经卸载掉的软件包的命令为 sudo apt-get clean # 若你想清理出更多的空间,可以把电脑上存储的安装包全部卸载 sudo apt-get autoremove # 删除已经被卸载的软件所依赖的(其他软件不依赖的)孤立的软件包
空间仍不足:通过建立软链接将/var目录下占用空间较大的目录移动到富足的空间区块(如/home)下,使得/var下不再占用空间。 (具体实现)
mv /var/cache /home/lsy #将var下的cache目录移动到home或者其他空间富足的区块中 ln -s /home/lsy/cache /var #/var/cache指向/home/lsy/cache,这样cache目录将不再占用/var目录的空间 mv /var/lib/docker /home/lsy ln -s /home/lsy/docker /var/lib service docker stop service docker start
一定要重启docker,否则会出现
OCI runtime create failed: /var/lib/docker/overlay2/c6eb60dada971e57fd5d125fb61d294870be347a2efb287862f8dfe52d99c57b/merged is not an absolute path or is a symlink: unknown
GraphSAGE代码详解
example_data:
1. toy-ppi-G.json 图的信息
{
directed: false
graph : {
{name: disjoint_union(,) }
nodes: [
{
test: false
id: 0
features: [ ... ]
val: false
lable: [ ... ]
}
{...}
...
]
links: [
{
test_removed: false
train_removed: false
target: 800 # 指向的节点id(默认从小节点指向大节点)
source: 0 # 从0节点按顺序展示
}
{...}
...
]
}
}
2. toy-ppi-class_map.json
3. toy-ppi-feats.npy 预训练好得到的features
4. toy-ppi-id_map.json 节点编号与序号的一一对应;数据格式为:{"0": 0, "1": 1,..., "14754": 14754}
5. toy-ppi-walks.txt
1. __init__.py
1 from __future__ import print_function 2 #即使在python2.X,使用print就得像python3.X那样加括号使用。 3 4 from __future__ import division 5 # 导入python未来支持的语言特征division(精确除法), 6 # 当我们没有在程序中导入该特征时,"/"操作符执行的是截断除法(Truncating Division); 7 # 当我们导入精确除法之后,"/"执行的是精确除法, "//"执行截断除除法
2. unsupervised_train.py
1 if __name__ == ‘__main__‘: 2 tf.app.run() 3 # https://blog.csdn.net/fxjzzyo/article/details/80466321 4 # tf.app.run()的作用:通过处理flag解析,然后执行main函数 5 # 如果你的代码中的入口函数不叫main(),而是一个其他名字的函数,如test(),则你应该这样写入口tf.app.run(test()) 6 # 如果你的代码中的入口函数叫main(),则你就可以把入口写成tf.app.run()
1 def main(argv=None): 2 print("Loading training data..") 3 train_data = load_data(FLAGS.train_prefix, load_walks=True) 4 # load_data函数在graphsage.utils中定义 5 6 print("Done loading training data..") 7 train(train_data) 8 # train函数在该文件中定义def train(train_data, test_data=None)
3. utils.py - func: load_data
(1) utils.py
if isinstance(G.nodes()[0], int): def conversion(n): return int(n) else:
a. isinstance() 函数来判断一个对象是否是一个已知的类型,类似 type()。
isinstance() 与 type() 区别:
type() 不会认为子类是一种父类类型,不考虑继承关系。
isinstance() 会认为子类是一种父类类型,考虑继承关系。
如果要判断两个类型是否相同推荐使用 isinstance()。
isinstance(object, classinfo)
参数
object -- 实例对象。
classinfo -- 可以是直接或间接类名、基本类型或者由它们组成的元组。
返回值
如果对象的类型与参数二的类型(classinfo)相同则返回 True,否则返回 False。
>>>a = 2 >>> isinstance (a,int) True >>> isinstance (a,str) False >>> isinstance (a,(str,int,list)) # 是元组中的一个返回 True True
type() 与 isinstance() 区别:
1 class A: 2 pass 3 4 class B(A): 5 pass 6 7 isinstance(A(), A) # returns True 8 type(A()) == A # returns True 9 isinstance(B(), A) # returns True 10 type(B()) == A # returns False
b. G.nodes()
返回的是图中节点n与节点属性nodedata。https://networkx.github.io/documentation/stable/reference/classes/generated/networkx.Graph.nodes.html
例子:
>>> G = nx.path_graph(3) >>> list(G.nodes) [0, 1, 2] >>> list(G) [0, 1, 2]
获取nodedata:
>>> G.add_node(1, time=‘5pm‘) >>> G.nodes[0][‘foo‘] = ‘bar‘ >>> list(G.nodes(data=True)) [(0, {‘foo‘: ‘bar‘}), (1, {‘time‘: ‘5pm‘}), (2, {})] >>> list(G.nodes.data()) [(0, {‘foo‘: ‘bar‘}), (1, {‘time‘: ‘5pm‘}), (2, {})]
>>> list(G.nodes(data=‘foo‘)) [(0, ‘bar‘), (1, None), (2, None)]
>>> list(G.nodes(data=‘time‘)) [(0, None), (1, ‘5pm‘), (2, None)]
>>> list(G.nodes(data=‘time‘, default=‘Not Available‘)) [(0, ‘Not Available‘), (1, ‘5pm‘), (2, ‘Not Available‘)]
If some of your nodes have an attribute and the rest are assumed to have a default attribute value you can create a dictionary from node/attribute pairs using the default keyword argument to guarantee the value is never None:
>>> G = nx.Graph() >>> G.add_node(0) >>> G.add_node(1, weight=2) >>> G.add_node(2, weight=3) >>> dict(G.nodes(data=‘weight‘, default=1)) {0: 1, 1: 2, 2: 3}
----------------------------
在utils.py中,判断G.nodes()[0] 是否为int型(即不带nodedata)。
若为int型,则将n转为int型;否则直接返回n.
c. conversion() 函数
1 id_map = json.load(open(prefix + "-id_map.json")) 2 id_map = {conversion(k): int(v) for k, v in id_map.items()}
前面定义的conversion()函数在id_map这里用到了,把外存中的文件内容读到内存中,用dict类型的id_map存储。
id_map.json文件中数据格式为:{"0": 0, "1": 1,..., "14754": 14754},也即id_map的迭代中k为str类型,v为int型。数据文件中G.nodes()[0] 显然是带nodedata的,也就算一般采用 def conversion(n): return n,返回的n为类型的(就是前面形参k的类型);
但是为什么当G.nodes()[0] 不带nodedata时,要返回int(n)?
4. walks?
5. class_map: {"0": [.0,1,..], "1": [.0,1,..]...} ?含义?
list(class_map.values()): [ [...], [...], ... ,[...] ]
list(class_map.values())[0]: 表示取第一个[...] =>含义?
if isinstance(list(class_map.values())[0], list): def lab_conversion(n): return n else: def lab_conversion(n): return int(n)
6. 为什么这里是实例时 返回n, 而 1 中是实例时返回int(n) ?
7. 为什么
# Remove all nodes that do not have val/test annotations
# (necessary because of networkx weirdness with the Reddit data)
val/test 有什么用?
8. broken_count = 0 删去的没有val 或者 test的节点
9.
# Make sure the graph has edge train_removed annotations
# (some datasets might already have this..)
什么是 train_removed ?
neigh_samplers.py
1 class UniformNeighborSampler(Layer): 2 """ 3 Uniformly samples neighbors. 4 Assumes that adj lists are padded with random re-sampling 5 """ 6 def __init__(self, adj_info, **kwargs): 7 super(UniformNeighborSampler, self).__init__(**kwargs) 8 self.adj_info = adj_info 9 10 def _call(self, inputs): 11 ids, num_samples = inputs 12 adj_lists = tf.nn.embedding_lookup(self.adj_info, ids) 13 adj_lists = tf.transpose(tf.random_shuffle(tf.transpose(adj_lists))) 14 adj_lists = tf.slice(adj_lists, [0,0], [-1, num_samples]) 15 return adj_lists
1. tf.nn.embedding_lookup
id0 id1 id2... --transpose--> id0 [...] --shuffle--> id1 [...] --transpose--> id1 id2 id0 --slice--> id1 id2
[] [] [] id1 [...] id2 [...] [] [] [] [] []
id2 [...] id0 [...]
slice之后,相当于随机挑选了num_samples个样本,并保留了这些样本的全部属性特征。
models.py
1. namedtuple
https://www.cnblogs.com/chenlin163/p/7259061.html
1 # SAGEInfo is a namedtuple that specifies the parameters 2 # of the recursive GraphSAGE layers 3 SAGEInfo = namedtuple("SAGEInfo", 4 [‘layer_name‘, # name of the layer (to get feature embedding etc.) 5 ‘neigh_sampler‘, # callable neigh_sampler constructor 6 ‘num_samples‘, 7 ‘output_dim‘ # the output (i.e., hidden) dimension 8 ])
supervised_train.py
1 flags.DEFINE_integer(‘samples_1‘, 25, ‘number of samples in layer 1‘) 2 flags.DEFINE_integer(‘samples_2‘, 10, ‘number of users samples in layer 2‘)
只取了2层,第一层S1 = 25, 第二层S2 = 10
1 if FLAGS.model == ‘graphsage_mean‘: 2 # Create model 3 sampler = UniformNeighborSampler(adj_info) 4 layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1), 5 SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)] 6 7 model = SampleAndAggregate(placeholders, 8 features, 9 adj_info, 10 minibatch.deg, 11 layer_infos=layer_infos, 12 model_size=FLAGS.model_size, 13 identity_dim=FLAGS.identity_dim, 14 logging=True)
所以这里layer_infos为2层的信息。
其中dim的信息为:
1 flags.DEFINE_integer( 2 ‘dim_1‘, 128, ‘Size of output dim (final is 2x this, if using concat)‘) 3 flags.DEFINE_integer( 4 ‘dim_2‘, 128, ‘Size of output dim (final is 2x this, if using concat)‘)
下面的GCN用的是concat, 所以为2x.
问题:
1. if FLAGS.model == ‘graphsage_mean‘:时,为什么SampleAndAggregate没有指定参数aggregator_type?
见models.py, 默认为“mean”,无需指定。
models.py
1. model_size = small / big 具体差别? 见aggregates.py: small: hidden_dim =512; big: hidden_dim = 1024
2. tf.get_variable https://blog.csdn.net/u012223913/article/details/78533910?locationNum=8&fps=1
获取已存在的变量(要求不仅名字,而且初始化方法等各个参数都一样),如果不存在,就新建一个。
可以用各种初始化方法,不用明确指定值。
1 W = tf.get_variable(name, shape=None, dtype=tf.float32, initializer=None, 2 regularizer=None, trainable=True, collections=None)
推荐使用tf.get_variable(), 因为:
(1) 初始化更方便。比如用xavier_initializer:
1 W = tf.get_variable("W", shape=[784, 256], initializer=tf.contrib.layers.xavier_initializer())
(2) 方便共享变量
因为tf.get_variable() 会检查当前命名空间下是否存在同样name的变量,可以方便共享变量。而tf.Variable 每次都会新建一个变量。
需要注意的是tf.get_variable() 要配合reuse和tf.variable_scope() 使用。
3. class SampleAndAggregate:
‘‘‘ Args: - placeholders: Stanford TensorFlow placeholder object. - features: Numpy array with node features. NOTE: Pass a None object to train in featureless mode (identity features for nodes)! - adj: Numpy array with adjacency lists (padded with random re-samples) - degrees: Numpy array with node degrees. - layer_infos: List of SAGEInfo namedtuples that describe the parameters of all the recursive layers. See SAGEInfo definition above. - concat: whether to concatenate during recursive iterations - aggregator_type: how to aggregate neighbor information - model_size: one of "small" and "big" - identity_dim: Set to positive int to use identity features (slow and cannot generalize, but better accuracy) ‘‘‘
adj 是什么?怎么计算得到的?邻接链表?
identity_dim ?是什么? 每一个node 的embdding的维度?取多少种属性?
get_shape().as_list(): https://blog.csdn.net/m0_37393514/article/details/82226754
models.py
class Model(object):
1. self.__class__.__name__.lower()
1 if not name: 2 name = self.__class__.__name__.lower()
self.__class__.__name__.lower(): https://stackoverflow.com/questions/36367736/use-name-as-attribute
1 class MyClass: 2 def __str__(self): 3 return str(self.__class__)
>>> instance = MyClass() >>> print(instance) __main__.MyClass
That is because the string version of the class includes the module that it is defined in. In this case, it is defined in the module that is currently being executed, the shell, so it shows up as __main__.MyClass. If we use self.__class__.__name__, however:
1 class MyClass: 2 def __str__(self): 3 return self.__class__.__name__ 4 5 instance = MyClass() 6 print(instance)
it outputs:
MyClass
The __name__ attribute of the class does not include the module.
Note: The __name__ attribute gives the name originally given to the class. Any copies will keep the name. For example:
1 class MyClass: 2 def __str__(self): 3 return self.__class__.__name__ 4 5 SecondClass = MyClass 6 7 instance = SecondClass() 8 print(instance)
output:
MyClass
That is because the __name__ attribute is defined as part of the class definition. Using SecondClass = MyClass is just assigning another name to the class. It does not modify the class or its name in any way.
2. allowed_kwargs = {‘name‘, ‘logging‘, ‘model_size‘}
其中name,logging,model_size指什么?
name: String, defines the variable scope of the layer. logging: Boolean, switches Tensorflow histogram logging on/off model_size: small / big 见aggregates.py: small: hidden_dim =512; big: hidden_dim = 1024
3. python 中参数*args, **kwargs
https://blog.csdn.net/anhuidelinger/article/details/10011013
1 def foo(*args, **kwargs): 2 print ‘args = ‘, args 3 print ‘kwargs = ‘, kwargs 4 print ‘---------------------------------------‘ 5 6 if __name__ == ‘__main__‘: 7 foo(1,2,3,4) 8 foo(a=1,b=2,c=3) 9 foo(1,2,3,4, a=1,b=2,c=3) 10 foo(‘a‘, 1, None, a=1, b=‘2‘, c=3) 11 12 # Output: 13 # args = (1, 2, 3, 4) 14 # kwargs = {} 15 16 # args = () 17 # kwargs = {‘a‘: 1, ‘c‘: 3, ‘b‘: 2} 18 19 # args = (1, 2, 3, 4) 20 # kwargs = {‘a‘: 1, ‘c‘: 3, ‘b‘: 2} 21 22 # args = (‘a‘, 1, None) 23 # kwargs = {‘a‘: 1, ‘c‘: 3, ‘b‘: ‘2‘}
1. 可以看到,这两个是python中的可变参数。
*args表示任何多个无名参数,它是一个tuple;
**kwargs表示关键字参数,它是一个 dict。
并且同时使用*args和**kwargs时,必须*args参数列要在**kwargs前.
像foo(a=1, b=‘2‘, c=3, a‘, 1, None, )这样调用的话,会提示语法错误“SyntaxError: non-keyword arg after keyword arg”。
2. 何时使用**kwargs:
Using **kwargs and default values is easy. Sometimes, however, you shouldn‘t be using **kwargs in the first place.
In this case, we‘re not really making best use of **kwargs.
1 class ExampleClass( object ): 2 def __init__(self, **kwargs): 3 self.val = kwargs.get(‘val‘,"default1") 4 self.val2 = kwargs.get(‘val2‘,"default2")
The above is a "why bother?" declaration. It is the same as
1 class ExampleClass( object ): 2 def __init__(self, val="default1", val2="default2"): 3 self.val = val 4 self.val2 = val2
When you‘re using **kwargs, you mean that a keyword is not just optional, but conditional. There are more complex rules than simple default values.
When you‘re using **kwargs, you usually mean something more like the following, where simple defaults don‘t apply.
1 class ExampleClass( object ): 2 def __init__(self, **kwargs): 3 self.val = "default1" 4 self.val2 = "default2" 5 if "val" in kwargs: 6 self.val = kwargs["val"] 7 self.val2 = 2*self.val 8 elif "val2" in kwargs: 9 self.val2 = kwargs["val2"] 10 self.val = self.val2 / 2 11 else: 12 raise TypeError( "must provide val= or val2= parameter values" )
3. logging = kwargs.get(‘logging‘, False) : default value: false
https://stackoverflow.com/questions/1098549/proper-way-to-use-kwargs-in-python
You can pass a default value to get() for keys that are not in the dictionary:
1 self.val2 = kwargs.get(‘val2‘,"default value")
However, if you plan on using a particular argument with a particular default value, why not use named arguments in the first place?
1 def __init__(self, val2="default value", **kwargs):
4. tf.variable_scope()
https://blog.csdn.net/IB_H20/article/details/72936574
layers.py
1. _LAYER_UIDS[layer_name]若从未出现过,如今出现了,则设为1;否则累加;即记录layer_name出现的次数,以出现的次数不同标定不同的id.
2. self.sparse_inputs = False 是什么?
Models.py
1. masked_softmax_cross_entropy ? 见metrics.py
1 # Cross entropy error 2 if self.categorical: 3 self.loss += metrics.masked_softmax_cross_entropy(self.outputs, self.placeholders[‘labels‘], 4 self.placeholders[‘labels_mask‘])
1 def masked_logit_cross_entropy(preds, labels, mask): 2 """Logit cross-entropy loss with masking.""" 3 loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=preds, labels=labels) 4 loss = tf.reduce_sum(loss, axis=1) 5 mask = tf.cast(mask, dtype=tf.float32) 6 mask /= tf.maximum(tf.reduce_sum(mask), tf.constant([1.])) 7 loss *= mask 8 return tf.reduce_mean(loss)
为什么要mask?
aggregators.py
class MeanAggregator:
1. glorot:
with tf.variable_scope(self.name + name + ‘_vars‘): self.vars[‘neigh_weights‘] = glorot([neigh_input_dim, output_dim], name=‘neigh_weights‘) self.vars[‘self_weights‘] = glorot([input_dim, output_dim], name=‘self_weights‘) if self.bias: self.vars[‘bias‘] = zeros([self.output_dim], name=‘bias‘)
其中,glorot 在inits.py中定义,用于权值初始化。from .inits import glorot, zeros
1 glorot_uniform(seed=None)
Glorot均匀分布初始化方法,又成Xavier均匀初始化,参数从[-limit, limit]的均匀分布产生,其中limit为sqrt(6 / (fan_in + fan_out))。fan_in为权值张量的输入单元数,fan_out是权重张量的输出单元数。
1 def glorot(shape, name=None): 2 """Glorot & Bengio (AISTATS 2010) init.""" 3 init_range = np.sqrt(6.0/(shape[0]+shape[1])) 4 initial = tf.random_uniform(shape, minval=-init_range, maxval=init_range, dtype=tf.float32) 5 return tf.Variable(initial, name=name)
minibatch.py
1 def construct_adj(self): 2 adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree)) 3 deg = np.zeros((len(self.id2idx),)) 4 5 for nodeid in self.G.nodes(): 6 if self.G.node[nodeid][‘test‘] or self.G.node[nodeid][‘val‘]: 7 continue 8 neighbors = np.array([self.id2idx[neighbor] 9 for neighbor in self.G.neighbors(nodeid) 10 if (not self.G[nodeid][neighbor][‘train_removed‘])]) 11 deg[self.id2idx[nodeid]] = len(neighbors) 12 if len(neighbors) == 0: 13 continue 14 if len(neighbors) > self.max_degree: 15 neighbors = np.random.choice(neighbors, self.max_degree, replace=False) 16 elif len(neighbors) < self.max_degree: 17 neighbors = np.random.choice(neighbors, self.max_degree, replace=True) 18 adj[self.id2idx[nodeid], :] = neighbors 19 return adj, deg
1. Graph.neighbour()
Return a list of the nodes connected to the node n.
Return a list of the nodes connected to the node n.
Parameters: | n (node) – A node in the graph |
---|---|
Returns: | nlist – A list of nodes that are adjacent to n. |
Return type: | list |
Raises: | NetworkXError – If the node n is not in the graph. |
Notes
It is usually more convenient (and faster) to access the adjacency dictionary as G[n]:
>>> G = nx.Graph() # or DiGraph, MultiGraph, MultiDiGraph, etc >>> G.add_edge(‘a‘,‘b‘,weight=7) >>> G[‘a‘] {‘b‘: {‘weight‘: 7}}
Examples
>>> G = nx.Graph() # or DiGraph, MultiGraph, MultiDiGraph, etc >>> G.add_path([0,1,2,3]) >>> G.neighbors(0) [1]
2. np.random.choice(a, size=None, replace=True, p=None)
代码中参数:range: neighbors; size = max_degree; replace: replace the origin matrix or not
返回值:samples : single item or ndarray. The generated random samples.
(1) 为什么邻居数 > max_degree时,replace = False
邻居数 < max_degree时,replace = True ?
>>> list(G.nodes(data=‘foo‘)) [(0, ‘bar‘), (1, None), (2, None)]
以上是关于GraphSAGE 代码解析的主要内容,如果未能解决你的问题,请参考以下文章