• [论文精读]Graph Attention Networks


    论文原文:[1710.10903] Graph Attention Networks (arxiv.org)

    论文代码:https://github.com/PetarV-/GAT

    英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用!

    目录

    1. 省流版

    1.1. 心得

    1.2. 论文框架图

    2. 论文逐段精读

    2.1. Abstract

    2.2. Introduction

    2.3. GAT architecture

    2.3.1. Graph attention layer

    2.3.2. Comparisons to related work

    2.4. Evaluation

    2.4.1. Datasets

    2.4.2. State-of-the-art method

    2.4.3. Experimental setup

    2.4.4. Results

    2.5. Conclusions

    3. 知识补充

    3.1. Spectral and non-spectral approaches for GNN

    3.2. Spectral domain and frequency domain

    3.3. t-SNE

    4. Reference List


    1. 省流版

    1.1. 心得

    (1)Intro里面就包含了related work的样子?

    (2)狠狠赞扬Datasets的表格,我都不用总结了

    1.2. 论文框架图

    2. 论文逐段精读

    2.1. Abstract

            ①They proposed a graph attention networks (GATs), which is both suitable for inductive and transductive problems

            ②There is no need for special and costly matrix operation

            ③They test their model in Cora, Citeseer, Pubmed citation network datasets and proteinprotein interaction dataset

    upfront  adj.预付的;坦率的;诚实的;直爽的;预交的  adv.预付地,先期支付地

    2.2. Introduction

            ①CNN has been widely used in translation, image classification, semantic segmentation. However, it can not be used in none-grid, i.e. irregular representation, such as social/telecommunication/biological networks, 3D meshes, brain connectomes. Thus, graph structure can describe these structures more accurately

            ②Early works adopted recursive neural networks to process directed acyclic graphs

            ③They introduced spectral and non-spectral methods of graph processing

            ④Allowing different sizes of input, attention mechanism has been sucessfully used in NLP

            ⑤Attention mechanism is able to parallelize neighbors, assign weights to neighbors and be used in inductive learning

    acyclic  adj.无环的;非循环的;非周期的;非环状的

    reminiscent  adj.怀旧的;使回忆起(人或事);回忆过去的;缅怀往事的  n.回忆者;追记前事者

    2.3. GAT architecture

    2.3.1. Graph attention layer

            ①Input matrix:

    \mathbf{h}=\{\vec{h}_{1},\vec{h}_{2},\ldots,\vec{h}_{N}\},\vec{h}_{i}\in\mathbb{R}^{F},\mathbf{h}\in \mathbb{R}^{F\times N}

    where N denotes the number of nodes, F denotes the number of features

            ②Then transfer node features to higher level with shared weight matrix:

    e_{ij}=a(\mathbf{W}\vec{h_i},\mathbf{W}\vec{h_j})

    where a:\mathbb{R}^{F^{'}}\times \mathbb{R}^{F^{'}}\rightarrow \mathbb{R} is a attention mechanism;

    j is the neighbor node in the neighborhood of i;

    also, indicates the importance of node j's features to node i.

            ③Normalize neighbors:

    \alpha_{ij}=\text{softmax}_j(e_{ij})=\frac{\exp(e_{ij})}{\sum_{k\in\mathcal{N}_i}\exp(e_{ik})}

    where \mathcal{N}_i denotes neighborhood of node i and the order is set by 1, i.e. first-order neighbors.

            ④Further expanding function a\left ( \right ):

    \alpha_{ij}=\frac{\exp\left(\text{LeakyReLU}\left(\vec{\mathbf{a}}^T[\mathbf{W}\vec{h}_i\|\mathbf{W}\vec{h}_j]\right)\right)}{\sum_{k\in\mathcal{N}_i}\exp\left(\text{LeakyReLU}\left(\vec{\mathbf{a}}^T[\mathbf{W}\vec{h}_i\|\mathbf{W}\vec{h}_k]\right)\right)}

    which is a single-layer feedforward neural network,

    and where \vec{\mathbf{a}}\in\mathbb{R}^{2F^{\prime}} denotes a weight vector;

    negative slope \alpha =0.2 ;

    || denotes concatenation.

            ⑤Applying nonlinearity to get final output:

    \vec{h}_i'=\sigma\left(\sum_{j\in\mathcal{N}_i}\alpha_{ij}\mathbf{W}\vec{h}_j\right)

            ⑥They further introduce multi-head attention with concatenation:

    \vec h'_i=\parallel _{k=1}^{K}\sigma\left(\sum\limits_{j\in\mathcal{N}_i}\alpha_{ij}^k\mathbf{W}^k\vec h_j\right)

    where \alpha_{ij}^k denotes normalized attention coefficients caculated by the k-th attention mechanism a^{k}

            ⑦In prediction layer, averaging is much more sensible than multi-head:

    \vec h'_i=\sigma\left(\frac1K\sum_{k=1}^K\sum_{j\in\mathcal{N}_i}\alpha_{ij}^k\mathbf{W}^k\vec h_j\right)

            ⑧The figure of this model:

    where the left is attention mechanism and the right is multi-head attention mechanism with K=3 

    2.3.2. Comparisons to related work

    (1)Their improvements:

            ①There is no need for eigendecomposition or other time-consuming calculation. Furthermore, K multi-head operations can also be parallelized

            ②GAT allows to assign weights to neighbors

            ③Adopting to directed graph with imiting a_{ij} when there is no edge in j\rightarrow i

            ④Applicable to inductive

            ⑤GraphSAGE can not process the whole neighborhood but GAT can

            ⑥Compared with MoNet, which computes the node structure, GAT adopts similarity computations

    on par with  与...相当 

    par  n.(股票的)面值,票面价值;<高尔夫>标准杆数;平均量,常态,一般水平(或标准);标准(尤指某人的工作或健康)水准  adj.平价的,与票面价值相等的;平均的,正常的  vt.<高尔夫>标准杆数得分

    2.4. Evaluation

            Datasets information:

    2.4.1. Datasets

    (1)Transductive learning

            ①In the left three datasets, nodes represent documents, undirected edges represent citations and node features represent elements of a bag-of-words representation of a document

            ②Class: 20 node

    (2)Inductive learning

            ①Pre-processing: provided by Hamilton et al. (GraphSAGE)

    2.4.2. State-of-the-art method

    (1)Transductive learning

            Comparison table:

    (2)Inductive learning

            Comparison table:

    where Const-GAT adopts constant attention mechanism i.e. adopting same weight for each neighbor

    (3)Summary

            They provide MLP for each node

    2.4.3. Experimental setup

    (1)Transductive learning

            ①They adopted 2 layers model. The first layer uses K=8, and F{}'=8 for each multi-head. Then follows ELU. The second layer sets C (number of classes) features in one attention head, then follows Softmax.

            ②Moreover, L2 regularization with \lambda =0.0005

            ③Dropout rate: 0.6

    (2)Inductive learning

            ①They chose 3 layer model. In the first two layers, K=4 and F{}'=256 with a latter ELU. The third layer adopts K=6 followed by logistic Sigmoid.

            ②The set is large enough to ignore L2 regularization and dropout

            ③Adopting skip connections in the middle attention layers

    (3)Summary

            ①Initialization: Glorot

            ②Optimizer: Adam SGD

            ③Learning rate: 0.01 for Pubmed, and 0.005 for others

            ④Early stopping strategy: 100 epochs

    2.4.4. Results

            ①They tune and adjust other model to be similar to GAT for fair

            ②Transformed feature representations visualization with 7 labels:

    where this figure comes from the first layer in GAT on Cora dataset

    2.5. Conclusions

            ①REPEAT their low time-costing of parallelizable and easy matrix operation.

            ②“一个特别有趣的研究方向是利用注意机制对模型的可解释性进行彻底的分析”?啊???这种话真是从2018做到2023了可解释性都还没有结果呢

            ③Considering edge feature is feasible

    3. 知识补充

    3.1. Spectral and non-spectral approaches for GNN

    3.2. Spectral domain and frequency domain

    (1)Spectral domain: mainly used in GNN, adopting Fourier transform on space dimensionality

    (2)Frequency domain: mainly used in signal and image processing, adopting Fourier transform on temporal dimensionality

    3.3. t-SNE

    4. Reference List

    Velickovic, P. et al. (2018) 'Graph Attention Networks', ICLR 2018. doi: https://doi.org/10.48550/arXiv.1710.10903

  • 相关阅读:
    力扣:1175. 质数排列
    Qt文件读写的天花板QFile和IODevice搭配第一集
    7、Linux驱动开发:设备-自动创建设备节点
    有趣的GPT指令
    SpringCloud学习笔记(二)
    【MicroPython ESP32】 入网和udp数据接收通讯示例
    LeetCode刷题记录1720.解码异或后的数组
    计算机毕业设计ssm+vue基本微信小程序的好物推荐分享系统
    MySQL运维11-Mycat分库分表之应用指定分片
    【ChatGPT】无需代理使用ChatGPT
  • 原文地址:https://blog.csdn.net/Sherlily/article/details/133929302