• [论文精读]Semi-Supervised Classification with Graph Convolutional Networks


    论文原文:[1609.02907] Semi-Supervised Classification with Graph Convolutional Networks (arxiv.org)

    论文代码:GitHub - tkipf/gcn: Implementation of Graph Convolutional Networks in TensorFlow

    英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用!

    目录

    1. 省流版

    1.1. 心得

    1.2. 论文框架图

    2. 论文逐段阅读

    2.1. Abstract

    2.2. Introduction

    2.3. Fast approximate convolutions on graphs

    2.3.1. Spectral graph convolutions

    2.3.2. Layer-wise linear model

    2.4. Semi-supervised node classification

    2.4.1. Example

    2.4.2. Implementation

    2.5. Related work

    2.5.1. Graph-based semi-supervised learning

    2.5.2. Neural networks on graphs

    2.6. Experiments

    2.6.1. Datasets

    2.6.2. Experimental set-up

    2.6.3. Baselines

    2.7. Results

    2.7.1. Semi-supervised node classifiication

    2.7.2. Evaluation of propagation model

    2.7.3. Training time per epoch

    2.8. Discussion

    2.8.1. Semi-supervised model

    2.8.2. Limitations and future work

    2.9. Conclusion

    3. 知识补充

    3.1. Skip-gram

    3.2. Bag of words

    3.3. Spectral graph convolutions

    4. Reference List


    1. 省流版

    1.1. 心得

    (1)怎么开头我就不知道在说什么啊这个论文感觉表述不是很清晰?

    (2)数学部分推理很清晰

    (3)要是模型展示更清晰就好了

    1.2. 论文框架图

    2. 论文逐段阅读

    2.1. Abstract

            ①Their convolution is based on localized first-order approximation

            ②They encode node features and local graph structure in hidden layers

    2.2. Introduction

            ①The authors think adopting Laplacian regularization in the loss function helps to label:

    \mathcal{L}=\mathcal{L}_0+\lambda\mathcal{L}_{\text{reg}},\quad\\\\\text{with}\quad\mathcal{L}_{\text{reg}}=\sum_{i,j}A_{ij}\|f(X_i)-f(X_j)\|^2=f(X)^\top\Delta f(X)

    where \mathcal{L}_0 represents supervised loss with labeled data, 

    f\left ( \right ) is a differentiable function,

    \lambda denotes weight,

    X denotes matrix with combination of node feature vectors,

    \triangle =D-A represents the unnormalized graph Laplacian,

    A is adjacency matrix,

    D is degree matrix.

            ②The model trains labeled nodes and is able to learn labeled and unlabeled nodes

            ③GCN achieves higher accuracy and efficiency than others

    2.3. Fast approximate convolutions on graphs

            ①GCN (undirected graph):

    H^{(l+1)}=\sigma\Big(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}H^{(l)}W^{(l)}\Big)

    where \tilde{A} denotes self-connection adjacency matrix, which means \tilde{A}=A+I_{N},

    I_{N} denotes identity matrix,

    \tilde{D} denotes self-connection degree matrix,

    W^{(l)} represents the trainable weight matrix in l-th layer,

    H^{(l)} denotes the activation matrix in l-th layer,

    \sigma \left ( \right ) represents activation function

    2.3.1. Spectral graph convolutions

            ①Spectral convolutions on graphs:

    g_{\theta }\star x=Ug_{\theta }U^{T}x

    where the filter g_{\theta }=diag\left ( \theta \right ),

    U comes from normalized graph Laplacian L=I_{N}-D^{-\frac12}AD^{-\frac12}=U\Lambda U^{T}and is the matrix of L's eigenvectors,

    \Lambda denotes a diagonal matrix with eigenvalues.

            ②However, it is too time-consuming to compute matrix especially for large graph. Ergo, approximating it in K-th order by Chebyshev polynomials:

    g_{\theta^{\prime}}(\Lambda)\approx\sum_{k=0}^K\theta_k^{\prime}T_k(\tilde{\Lambda})

    where \tilde{\Lambda}=\frac{2}{\lambda_{\max}}\Lambda-I_{N},

    {\theta }' denotes Chebyshev coefficients vector,

    recursive Chebyshev polynomials are T_{k}\left ( x \right )=2xT_{k-1}(x)-T_{k-2}(x) with baseline T_{0}(x)=1 and T_{1}(x)=x

            ③Then get new function:

    g_{\theta'}\star x\approx\sum_{k=0}^{K}\theta'_kT_k(\tilde{L})x

    where \tilde{L}=\frac{2}{\lambda_{\max}}L-I_{N}(U\Lambda U^\top)^k=U\Lambda^kU^\top.

            ④Through this approximation method, time complexity reduced from O\left ( n^{2} \right ) to O\left ( E \right )

    2.3.2. Layer-wise linear model

            ①Then, the authors stack the function above to build multiple conv layers and set K=1\lambda _{max}\approx 2

            ②They simplified 2.3.1. ③ to:

    g_{\theta'}\star x\approx\theta'_0x+\theta'_1\left(L-I_N\right)x=\theta'_0x-\theta'_1D^{-\frac{1}{2}}AD^{-\frac{1}{2}}x

    where \theta'_0 and \theta'_1 are free parameters 

            ③Nevertheless, more parameters bring more overfitting problem. It leads the authors change the expression to:

    g_\theta\star x\approx\theta\left(I_N+D^{-\frac{1}{2}}AD^{-\frac{1}{2}}\right)x

    where they define \theta=\theta_0^{\prime}=-\theta_1^{\prime} ,

    eigenvalues are in \left [ 0,2 \right ] .

    But keep using it may cause exploding/vanishing gradients or numerical instabilities.

            ④Then they adjust I_{N}+D^{-\frac{1}{2}}AD^{-\frac{1}{2}}\rightarrow\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}

            ⑤The convolved signal matrix \begin{aligned}Z&\in\mathbb{R}^{N\times F}\end{aligned}:

    Z=\tilde{D}^{-\frac12}\tilde{A}\tilde{D}^{-\frac12}X\Theta

    where C denotes input channels, namely feature dimensionality of each node,

    F denotes the number of filters or feature maps,

    \Theta\in\mathbb{R}^{C\times F} represents matrix of filter parameters

    2.4. Semi-supervised node classification

            The whole model shows:

    2.4.1. Example

            ①They take 2 layers model

            ②Forward propagation:

    Z=f(X,A)=\text{softmax}\Big(\hat{A}\, \text{ReLU}\Big(\hat{A}XW^{(0)}\Big)W^{(1)}\Big)

    where W^{(0)}\in\mathbb{R}^{C\times H} is a weight matrix with H feature maps, it crosses input layer to hidden layer,

    W^{(1)}\in\mathbb{R}^{H\times F} is a weight matrix crossing hidden layer to output layer,

    ReLU is for every row

            ③Cross entropy:

    \mathcal{L}=-\sum_{l\in\mathcal{Y}_L}\sum_{f=1}^FY_{lf}\ln Z_{lf}

    where \mathcal{Y}_{L} is the set of node indices of which nodes with labels

            ④They adopt mini-batch stochastic gradient descent

    2.4.2. Implementation

            ①Framework: TensorFlow

            ②Method: sparse-dense matrix multiplications

            ③Computational complexity: O\left ( \left | E \right |CHF \right )

    2.5. Related work

    2.5.1. Graph-based semi-supervised learning

    (1)Previous graph representations:

            ①Explicit graph Laplacian regularization: such as deep semi-supervised embedding, manifold regularization, label propagation 

            ②Graph-embedding

    (2)Recent works:

            ①DeepWalk

            ②LINE

            ③node2vec

    2.5.2. Neural networks on graphs

            ①Briefly introduced other works

            ②Their approach is based on spectral graph convolutional neural networks

    2.6. Experiments

    They tested their model in:

            ①Semi-supervised document classification in citation networks

            ②Semi-supervised entity classification in a bipartite graph extracted from a knowledge graph

            ③Evaluation of various graph propagation models

            ④Run-time analysis on random graphs

    bipartite  adj.两部分组成的;有两个部分的

    2.6.1. Datasets

            ①Datasets table:

    where documents represented by nodes, citation links represented by edges;

    also, Label\, rate=\frac{training\, \, labeled\, \, nodes}{total\, \, nodes};

    NELL is a bipartite graph dataset including 55,864 relation nodes and 9,891 entity nodes

            ②Citation networks: sparse bag-of-words feature vectors and list of citation links for and between documents are contained. Besides, undirected edges are citation links and 20 labels for one class only

            ③NELL: edges are labeled, directed connections. Additionally, one-hot representation is adopted in each node

            ④Random graphs: randomly genetate graph with N nodes and 2N edges

    2.6.2. Experimental set-up

            ①Labeled sample test set: 1,000

            ②They usually adopt 2 layers, but further test extra 10 layers in appendix b

            ③Epoch: under 200

            ④Optimizer: Adam

            ⑤Stop training when there is no decreasing of loss for 10 consecutive epochs

            ⑥Hidden layer includes 32 units

            ⑦Omit dropout and regularization

    2.6.3. Baselines

            ①They train local classifier with labeled data and then train unlabeled with a random node ordering for 10 iterations

            ②L2 regularization parameter and aggregation operator selected by performance of validation set

    2.7. Results

    2.7.1. Semi-supervised node classifiication

    (1)Comparison of accuracy table with mean accuracy of 100 times in random ordering:

    (2)Parameters for Citeseer, Cora and Pubmed

            ①Dropout rate: 0.5

            ②L2 regularization: 5e-4

            ③Hidden units number: 16

    (3)Parameters for NELL

            ①Dropout rate: 0.1

            ②L2 regularization: 1e-5

            ③Hidden units number: 64

    2.7.2. Evaluation of propagation model

            They compared variants of their propagation model on citation network datasets, where GCN is the bold one:

    2.7.3. Training time per epoch

            Mean training time per epoch of 100 epochs. Each epoch contains forward propagation, cross entropy computing, backward propagation:

    where * denotes memory explosion

    2.8. Discussion

    2.8.1. Semi-supervised model

            ①The authors attribute low efficiency of graph-Laplacian regularization to only encoding similar nodes

            ②In addition, they think skip-gram is hard to optimize

            ③Then, they reckon renormalized propagation is better than na¨ıve 1 st-order model and higher-order graph convolutional models with Chebyshev polynomials

    2.8.2. Limitations and future work

            ①Lack of memory. Slightly alleviated by mini-batch stochastic gradient descent

            ②The model does not support directed edge or edge features. (后面说的那句话不是很能理解,翻译成中文是“通过将原始有向图表示为具有原始图中表示边缘的附加节点的无向二部图,可以同时处理有向边和边缘特征”。难道说原本是一个普通的图,把每条边都变成一个节点放在另一边,然后和原先的节点合起来变成二部图?但是很奇怪啊)

            ③Weight of self-connection might vary in different graphs:

    \tilde{A}=A+\lambda I_{N}

    2.9. Conclusion

            They proposed a novel and efficient semi-supervised graph convolutional network

    3. 知识补充

    3.1. Skip-gram

    相关链接:理解 Word2Vec 之 Skip-Gram 模型 - 知乎 (zhihu.com)

    3.2. Bag of words

    相关链接:使用 Python 的 NLP 中的单词袋简介 |什么是 BoW? (mygreatlearning.com)

    3.3. Spectral graph convolutions

    相关链接:谱图卷积 - 知乎 (zhihu.com)

    4. Reference List

    Kipf, T. & Welling, M. (2017) 'Semi-Supervised Classification with Graph Convolutional Networks', ICLR 2017, doi: https://doi.org/10.48550/arXiv.1609.02907

  • 相关阅读:
    SSM框架之MyBatis入门(Maven工程实现全查功能,快速入门,适合小白)
    收一个快手协议下单算法
    JNI学习5.jstring的处理
    QT连接数据库
    聚观早报 | 美团打车接入腾讯出行;腾讯音乐将在港交所上市
    【Autosar 存储栈Memery Stack 4.Tc397的Flash编程】
    CS信息系统建设和服务能力资质办理指南
    java面向对象----封装 && 构造器
    3.51 什么是平坦式原理图?什么是层次式电路设计?它的优点有哪些?
    私域运营如何提高收益?
  • 原文地址:https://blog.csdn.net/Sherlily/article/details/133867185