第四周.03.论文带读+GAT

Posted oldmao_2001

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了第四周.03.论文带读+GAT相关的知识,希望对你有一定的参考价值。


本文内容整理自深度之眼《GNN核心能力培养计划》
公式输入请参考: 在线Latex公式

论文1泛读

Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning
主要从该文章知道,为什么GCN通常不会很深。

摘要套路分析

Many interesting problems in machine learning are being revisited with new deep learning tools. For graph-based semisupervised learning, a recent important development is graph convolutional networks (GCNs), which nicely integrate local vertex features and graph topology in the convolutional layers.
从ML过度到GCN,并带上GCN的特点:图卷积可以整合结点embedding和图的拓扑信息。

Although the GCN model compares favorably with other state-of-the-art methods, its mechanisms are not clear and it still requires considerable amount of labeled data for validation and model selection.
转折,提出当前GCN的问题所在:可解释性的机制木有,模型需要监督学习。

In this paper, we develop deeper insights into the GCN model and address its fundamental limits.
过渡句,然后分两点来介绍本文工作。

First, we show that the graph convolution of the GCN model is actually a special form of Laplacian smoothing, which is the key reason why GCNs work, but it also brings potential concerns of oversmoothing with many convolutional layers.
第一点:展示图卷积的本质(拉普拉斯平滑的一种形式),由于这种方式,使得GCN会产生oversmoothing问题。(也就是分析出产生问题的原因)

Second, to overcome the limits of the GCN model with shallow architectures, we propose both co-training and self-training approaches to train GCNs.
第二点:提出解决oversmoothing的方法的两个方法。

Our approaches significantly improve GCNs in learning with very few labels, and exempt them from requiring additional labels for validation. Extensive experiments on benchmarks have verified our theory and proposals.
本文方法的优点或者创新。

Introduction

前面引入和背景,然后是无监督、半监督的前人工作,然后是GCN的相关工作。接下来是本文的思路:
In this paper, we demystify the GCN model for semisupervised learning.
总:我们干了啥

In particular, we show that the graph convolution of the GCN model is simply a special form of Laplacian smoothing, which mixes the features of a vertex and its nearby neighbors.
分:具体是什么,we show…

The smoothing operation makes the features of vertices in the same cluster similar, thus greatly easing the classification task, which is the key reason why GCNs work so well.
针对上面得到结果进一步分析得到why GCNs work so well. 原因是平滑操作(图卷积)会使得相同的类别的节点会有类似的embedding。

However, it also brings potential concerns of over-smoothing.
提出问题。

If a GCN is deep with many convolutional layers, the output features may be oversmoothed and vertices from different clusters may become indistinguishable.
具体描述问题,过多的卷积层会使得特征过度平滑,使得不同类别的节点embedding区分度下降。

Also, adding more layers to a GCN will make it much more difficult to train.
最后补刀,过多的卷积层,使得模型难以训练。

However, a shallow GCN model such as the two-layer GCN used in (Kipf and Welling 2017) has its own limits.
再转折,浅层GCN也有不足。

Besides that it requires many additional labels for validation, it also suffers from the localized nature of the convolutional filter. When only few labels are given, a shallow GCN cannot effectively propagate the labels to the entire data graph.
一是需要大量的标签数据进行验证,少量标签数据很难使得标签信息传递到整个图中。

As illustrated in Fig. 1, the performance of GCNs drops quickly as the training size shrinks, even for the one with 500 additional labels for validation.
当训练数据较大时,性能还好,但是一旦训练数据减少,没有验证集的GCN准确率掉得很厉害(下面的绿线)。
在这里插入图片描述
后面再补充具体的做法。

Preliminaries and RelatedWorks

下面的第二节:Preliminaries and RelatedWorks中有讲Graph-Based半监督学习,然后重点讲GCN
给出具体的公式,从公式4
H ( l + 1 ) = σ ( D ~ − 1 2 A ~ D ~ − 1 2 H ( l ) Θ ( l ) ) (4) H^{(l+1)}=\\sigma(\\tilde D^{-\\frac{1}{2}}\\tilde A\\tilde D^{-\\frac{1}{2}}H^{(l)}\\Theta^{(l)})\\tag4 H(l+1)=σ(D~21A~D~21H(l)Θ(l))(4)
中可以看到,其实 D ~ − 1 2 A ~ D ~ − 1 2 \\tilde D^{-\\frac{1}{2}}\\tilde A\\tilde D^{-\\frac{1}{2}} D~21A~D~21就是拉普拉斯平滑项。
然后讲半监督的GCN分类,具体看原文公式6:
L : = − ∑ i ∈ V l ∑ f = 1 F Y i f ln ⁡ Z i f (6) \\textit{L}:=-\\sum_{i\\in V_l}\\sum_{f=1}^FY_{if}\\ln Z_{if}\\tag6 L:=iVlf=1FYiflnZif(6)

i ∈ V l i\\in V_l iVl表示只针对有label的结点,也就是半监督,后面部分是交叉熵, Y i f Y_{if} Yif是真实值, Z i f Z_{if} Zif是预测值。F是输出维度,与实际分类数相等。

Analysis

这里有点意思,作者为了证明拉普拉斯平滑项的作用,将上面公式4的平滑项拿走,变成FCN:

H ( l + 1 ) = σ ( H ( l ) Θ ( l ) ) (7) H^{(l+1)}=\\sigma(H^{(l)}\\Theta^{(l)})\\tag7 H(l+1)=σ(H(l)Θ(l))(7)
相当于直接丢完全图进GCN,不考虑邻居,而是人人都是邻居。结果垮掉:
在这里插入图片描述
说明拉普拉斯平滑项很重要。
然后经过原始拉普拉斯的公式(原文公式9)和GCN的公式的比较,得到一个结论:
The Laplacian smoothing computes the new features of a vertex as the weighted average of itself and its neighbors’. Since vertices in the same cluster tend to be densely connected, the smoothing makes their features similar, which makes the subsequent classification task much easier. As we can see from Table 1, applying the smoothing only once has already led to a huge performance gain.

然后讲为什么两层GCN比一层好(两层比一层更平滑):
Multi-layer Structure.We can also see from Table 1 that while the 2-layer FCN only slightly improves over the 1-layer FCN, the 2-layer GCN significantly improves over the 1-layer GCN by a large margin. This is because applying smoothing again on the activations of the first layer makes the output features of vertices in the same cluster more similar and further eases the classification task.

然后讨论是不是层数越多越好?
A natural question is how many convolutional layers should be included in a GCN?
Certainly not the more the better. On the one hand, a GCN with many layers is difficult to train. On the other hand, repeatedly applying Laplacian smoothing may mix the features of vertices from different clusters and make them indistinguishable.
这里作者做了一个实验:
在这里插入图片描述
目测两层分类效果最好。
实验弄完,作者理论上又证明了一发:
In the following, we will prove that by repeatedly applying Laplacian smoothing many times, the features of vertices within each connected component of the graph will converge to the same values.
红线部分相当于拉普拉斯平滑项,m代表重复m次,w是参数,右边可以看到结果就是所有节点都一样。
在这里插入图片描述
除了证明,作者还啰嗦了这么一段:
Since label propagation only uses the graph information while GCNs utilize both structural and vertex features, it reflects the inability of the GCN model in exploring the global graph structure.
由于label propagation(图一中的蓝色线)在传播过程中只用了图信息,因此其受拉普拉斯平滑影响较小。GCN使用了结点embedding和图结构信息,因此其获取全局信息能力不强,主要是获取local信息。

Solutions

优点和缺点:
The advantages are: 1) the graph convolution – Laplacian smoothing helps making the classification problem much easier; 2) the multi-layer neural network is a powerful feature extractor.
The disadvantages are: 1) the graph convolution is a localized filter, which performs unsatisfactorily with few labeled data; 2) the neural network needs considerable amount of labeled data for validation and model selection.

论文2泛读

DeepGCNs:Can GCNs Go as Deep as CNNs?
这篇论文借鉴CV中的resNet思想,将残差引入GCN,使得GCN 可以更加DEEP

Abstract

先讲CNN,并提出CNN可以go deep。
Convolutional Neural Networks (CNNs) achieve impressive performance in a wide variety of fields. Their success benefited from a massive boost when very deep CNN models were able to be reliably trained.

CNN在非欧式距离上效果不行,因此引入GCN。并提到可以把CNN的trick用到GCN上。
Despite their merits, CNNs fail to properly address problems with non-Euclidean data. To overcome this challenge, Graph Convolutional Networks (GCNs) build graphs to represent non-Euclidean data, borrow concepts from CNNs, and apply them in training.

转折,提出GCN存在的问题

原文的图一左边也显示了,在没有残差网络的情况下,GCN层数越多,效果越差(梯度消失)
右边则显示了,加残差后效果就起飞了。
在这里插入图片描述
GCNs show promising results, but they are usually limited to very shallow models due to the vanishing gradient problem (see Figure 1).

因此,在没有残差网络的艰苦条件下,SOTA的GCN通常是3-4层。

为了解决这个问题,作者引入了CNN中的residual/dense connections and dilated convolutions,并应用到GCN上,解决了GCN不能deep的问题
In this work, we present new ways to successfully train very deep GCNs. We do this by borrowing concepts from CNNs, specifically residual/dense connections and dilated convolutions, and adapting them to GCN architectures.

最后要吹爆自己的idea
Extensive experiments show the positive effect of these deep GCN frameworks. Finally, we use these new concepts to build a very deep 56-layer GCN, and show how it significantly boosts performance (+3.7% mIoU over state-of-the-art) in the task of point cloud semantic segmentation. We believe that the community can greatly benefit from this work, as it opens up many opportunities for advancing GCN-based research.

3.2. Residual Learning for GCNs

关于ResNet和DenseNet的介绍可以看这里:zhuanlan.zhihu.com/p/37189203,那么把残差引入到GCN是什么形式?看原文公式3:
G l + 1 = H ( G l , W l ) = F ( G l , W l ) + G l = G l + 1 r e s + G l (3) \\begin{aligned}\\mathcal{G}_{l+1}&=\\mathcal{H}(\\mathcal{G}_{l},\\mathcal{W}_{l})\\\\ &=\\mathcal{F}(\\mathcal{G}_{l},\\mathcal{W}_{l})+\\mathcal{G}_{l}=\\mathcal{G}_{l+1}^{res}+\\mathcal{G}_{l}\\end{aligned}\\tag3 Gl+1=H(Gl,Wl)=F(Gl,Wl)+Gl=Gl+1res+Gl(3)

3.3. Dense Connections in GCNs

如果按DenseNet的方式来做:
G l + 1 = H ( G l , W l ) = T ( F ( G l , W l ) , G l ) = T ( F ( G l , W l ) , ⋯   , F ( G 0 , W 0 ) , G 0 ) (4) \\begin{aligned}\\mathcal{G}_{l+1}&=\\mathcal{H}(\\mathcal{G}_{l},\\mathcal{W}_{l})\\\\ &=\\mathcal{T}(\\mathcal{F}(\\mathcal{G}_{l},\\mathcal{W}_{l}),\\mathcal{G}_{l})\\\\ &=\\mathcal{T}(\\mathcal{F}(\\mathcal{G}_{l},\\mathcal{W}_{l}),\\cdots,\\mathcal{F}(\\mathcal{G}_{0},\\mathcal{W}_{0}),\\mathcal{G}_{0}) \\end{aligned}\\tag4 Gl+1=H(Gl,Wl)=T(F(Gl,Wl),Gl)=T<

以上是关于第四周.03.论文带读+GAT的主要内容,如果未能解决你的问题,请参考以下文章

PyG利用GAT实现CoraCiteseerPubmed引用论文节点分类

深度之眼Paper带读笔记目录

GRAPH ATTENTION NETWORKS 论文/GAT学习笔记

GRAPH ATTENTION NETWORKS 论文/GAT学习笔记

深度之眼Paper带读笔记NLP.10:DCNN

对抗生成网络GANs论文精读及研究方向思考