论文信息
论文标题:Rethinking and Scaling Up Graph Contrastive Learning: An Extremely Efficient Approach with Group Discrimination
论文作者:Yizhen Zheng, Shirui Pan, Vincent Cs Lee, Yu Zheng, Philip S. Yu
论文来源:2022,NeurIPS
论文地址:download
论文代码:download
1 Introduction
GCL 需要大量的 Epoch 在数据集上训练,本文的启发来自 GCL 的代表性工作 DGI 和 MVGRL,因为 Sigmoid 函数存在的缺陷,因此,本文提出 Group Discrimination (GD) ,并基于此提出本文的模型 Graph Group Discrimination (GGD)。
Graph ContrastiveLearning 和 Group Discrimination 的区别:
- GD directly discriminates a group of positive nodes from a group of negative nodes.
- GCL maximise the mutual information (MI) between an anchor node and its positive counterparts, sharing similar semantic information while doing the opposite for negative counterparts.
贡献:
- 1) We re-examine existing GCL approaches (e.g., DGI and MVGRL), and we introduce a novel and efficient self-supervised GRL paradigm, namely, Group Discrimination (GD).
- 2) Based on GD, we propose a new self-supervised GRL model, GGD, which is fast in training and convergence, and possess high scalability.
- 3) We conduct extensive experiments on eight datasets, including an extremely large dataset, ogbn-papers100M with billion edges.
2 Rethinking Representative GCL Methods
本节以经典的 DGI 、MVGRL 为例子,说明了互信息最大化并不是对比学习的贡献因素,而是一个新的范式,群体歧视(group discrimination)。
2.1 Rethinking GCL Methods
回顾一下 DGI :
代码:
class DGI(nn.Module):
def __init__(self, g, in_feats, n_hidden, n_layers, activation, dropout):
super(DGI, self).__init__()
self.encoder = Encoder(g, in_feats, n_hidden, n_layers, activation, dropout)
self.discriminator = Discriminator(n_hidden)
self.loss = nn.BCEWithLogitsLoss()
def forward(self, features):
positive = self.encoder(features, corrupt=False)
negative = self.encoder(features, corrupt=True)
summary = torch.sigmoid(positive.mean(dim=0))
positive = self.discriminator(positive, summary)
negative = self.discriminator(negative, summary)
l1 = self.loss(positive, torch.ones_like(positive))
l2 = self.loss(negative, torch.zeros_like(negative))
return l1 + l2
本文研究 DGI 结论:一个 Sigmoid 函数不适用于权重被 Xavier 初始化的 GNN 生成的 summary vector,且 summary vector 中的元素非常接近于相同的值。
接着尝试将 Summary vector 中的数值变换成不同的常量 (from 0 to 1):
结论:
-
- 将 summary vector 中的数值变成 0,求解相似度时导致所有的 score 变成 0,也就是 postive 项的损失函数变成 负无穷,无法优化;
- summary vector 设置其他值,导致 数值不稳定;
DGI 的简化:
① 将 summary vector 设置为 单位向量(缩放对损失不影响);
② 去掉 Discriminator (Bilinear :先做线性变换,再求内积相似度)的权重向量;【双线性层的
Bilinear :
实验:替换
替换形式为:
其中,
DGI 真正所做的是区分正确拓扑生成的一组节点和损坏拓扑生成的节点,如 Figure 1 所示。可以这么理解,DGI 是使用一个固定的向量
Note:方差大的稍微大一点的 method ,就是容易被诋毁。
3 Methodology
整体框架:
组成部分:
-
- Siamese Network :模仿 MVGRL 的架构;
- Data Augmentation :提供相似意义信息,带来的是时间成本;【dropout edge、feature mask】
- Loss function :
Eq.3 " role="presentation">;
首先:固定 GNN encoder、MLP predict 的参数,获得初步的节点表示
其次:MVGRL 多视图对比工作给本文深刻的启发,所以考虑引入全局信息 :
最后:得到局部表示和全局表示的聚合
4 Experiments
4.1 Datasets
4.2 Result
节点分类
训练时间 和 内存消耗
5 Future Work
For example, can we extend the current binary Group Discrimination scheme (i.e., classifying nodes generated with different topology) to discrimination among multiple groups?
__EOF__