• CS224W Colab_2 笔记


    目录

    1、PyG中数据集的属性查询(Question 1,类别数目和特征数目 Question 2,3)

    2、OGB包的使用(Question 4:How many features are in the ogbn-arxiv graph?) 

    3、节点分类任务GNN框架的实现与初步理解

    ① torch.nn.ModuleList()类、torch.nn.LogSoftmax()类的使用

    ② PyG中GCNConv()类、BatchNorm()类的使用

    ③ 此处GNN类的作用/构成理解:

    ④ torch.flatten(input, start_dim=0, end_dim=-1):去掉start~end的维度,把结果包装成一个“一维Tensor”并返回。

    ​编辑

    ⑤ train函数的理解,def train(model, data, train_idx, optimizer, loss_fn):

    ⑥ loss_fn函数:

    ⑦ test(...)方法中的out.argmax(......)函数使用:如代码注释

    4、图级别预测

    ① 框架构建思路:

    ② DataLoader与batch:

    ③ train函数中的tips:


    1、PyG中数据集的属性查询(Question 1,类别数目和特征数目 Question 2,3)

    1. # Question 1:What is the number of classes and number of features in the ENZYMES dataset?
    2. def get_num_classes(pyg_dataset):
    3. # TODO: Implement a function that takes a PyG dataset object
    4. # and returns the number of classes for that dataset.
    5. num_classes = 0
    6. ############# Your code here ############
    7. ## (~1 line of code)
    8. ## Note
    9. ## 1. Colab autocomplete functionality might be useful.
    10. #########################################
    11. num_classes = pyg_dataset.num_classes
    12. return num_classes
    13. def get_num_features(pyg_dataset):
    14. # TODO: Implement a function that takes a PyG dataset object
    15. # and returns the number of features for that dataset.
    16. num_features = 0
    17. ############# Your code here ############
    18. ## (~1 line of code)
    19. ## Note
    20. ## 1. Colab autocomplete functionality might be useful.
    21. #########################################
    22. num_features = pyg_dataset.num_features
    23. return num_features
    24. if 'IS_GRADESCOPE_ENV' not in os.environ:
    25. num_classes = get_num_classes(pyg_dataset)
    26. num_features = get_num_features(pyg_dataset)
    27. print("{} dataset has {} classes".format(name, num_classes))
    28. print("{} dataset has {} features".format(name, num_features))
    1. # Question 2:What is the label of the graph with index 100 in the ENZYMES dataset?
    2. def get_graph_class(pyg_dataset, idx):
    3. # TODO: Implement a function that takes a PyG dataset object,
    4. # an index of a graph within the dataset, and returns the class/label
    5. # of the graph (as an integer).
    6. label = -1
    7. ############# Your code here ############
    8. ## (~1 line of code)
    9. #########################################
    10. label = pyg_dataset[idx].y.item()
    11. return label
    12. # Here pyg_dataset is a dataset for graph classification
    13. if 'IS_GRADESCOPE_ENV' not in os.environ:
    14. graph_0 = pyg_dataset[0]
    15. print(graph_0)
    16. idx = 100
    17. label = get_graph_class(pyg_dataset, idx)
    18. print('Graph with index {} has label {}'.format(idx, label))
    1. # Question 3:How many edges does the graph with index 200 have?
    2. print(pyg_dataset[200].num_edges)
    3. def get_graph_num_edges(pyg_dataset, idx):
    4. # TODO: Implement a function that takes a PyG dataset object,
    5. # the index of a graph in the dataset, and returns the number of
    6. # edges in the graph (as an integer). You should not count an edge
    7. # twice if the graph is undirected. For example, in an undirected
    8. # graph G, if two nodes v and u are connected by an edge, this edge
    9. # should only be counted once.
    10. num_edges = 0
    11. ############# Your code here ############
    12. ## Note:
    13. ## 1. You can't return the data.num_edges directly
    14. ## 2. We assume the graph is undirected
    15. ## 3. Look at the PyG dataset built in functions
    16. ## (~4 lines of code)
    17. #########################################
    18. num_edges = pyg_dataset[idx].num_edges / 2
    19. return num_edges
    20. if 'IS_GRADESCOPE_ENV' not in os.environ:
    21. idx = 200
    22. num_edges = get_graph_num_edges(pyg_dataset, idx)
    23. print('Graph with index {} has {} edges'.format(idx, num_edges))

    2、OGB包的使用(Question 4:How many features are in the ogbn-arxiv graph?) 

    1. #Question 4
    2. import torch
    3. import pandas as pd
    4. import os
    5. print("aa1")
    6. import torch_geometric.transforms as T
    7. print("aa2")
    8. from ogb.nodeproppred import PygNodePropPredDataset
    9. print("aa3")
    10. if 'IS_GRADESCOPE_ENV' not in os.environ:
    11. dataset_name = 'ogbn-arxiv'
    12. print("bb")
    13. # Load the dataset and transform it to sparse tensor
    14. dataset = PygNodePropPredDataset(name=dataset_name,
    15. transform=T.ToSparseTensor(), root="./Arxiv")
    16. print('The {} dataset has {} graph'.format(dataset_name, len(dataset)))
    17. # Extract the graph
    18. data = dataset[0]
    19. print(data)
    20. #Question 4
    21. def graph_num_features(data):
    22. # TODO: Implement a function that takes a PyG data object,
    23. # and returns the number of features in the graph (as an integer).
    24. num_features = 0
    25. ############# Your code here ############
    26. ## (~1 line of code)
    27. num_features = data.num_features
    28. #########################################
    29. return num_features
    30. if 'IS_GRADESCOPE_ENV' not in os.environ:
    31. num_features = graph_num_features(data)
    32. print('The graph has {} features'.format(num_features))

    3、节点分类任务GNN框架的实现与初步理解

    ① torch.nn.ModuleList()类、torch.nn.LogSoftmax()类的使用

    ② PyG中GCNConv()类、BatchNorm()类的使用

    ③ 此处GNN类的作用/构成理解:

            __init__(...)方法中定义上图中的操作,foward(...)方法负责调用这些操作,将图中的数据X、adj等最终映射为一个N维向量(N:节点数)。

    ④ torch.flatten(input, start_dim=0, end_dim=-1):去掉start~end的维度,把结果包装成一个“一维Tensor”并返回。

    举例:一个3维矩阵,压缩0~1维,最终返回一个“1维向量”。

    1. x = np.arange(27)
    2. x = np.reshape(x, (3, 3, 3))
    3. x = torch.from_numpy(x)
    4. print('before flatten', x)
    5. x = torch.flatten(x, start_dim=0, end_dim=1)
    6. print('after flatten', x)
    7. 输出如下图:

    结果分析:
            首先,去掉一个维度后,变量数目是变多的,例如变量为一个3*3的二维矩阵,去掉第0维后,变成3个有三个元素的向量,1到3,增多。
            回到本例,压缩/去除第0维,x[0]、x[1]和x[2]之间的联系取消,即最外边的中括号去掉,压缩第1维同理。得到的结果为[1,2,3], [4,5,6]...一个个“变量”,最终把这一个个变量封装成一个“一维向量”便是下图结果所示,实际上是2维的

    ⑤ train函数的理解,def train(model, data, train_idx, optimizer, loss_fn):

    作用:完成一个epoch的训练(即所有数据走一遍)。

    过程:model.train()函数(修改self.training变量的值,使得drop等操作有选择的执行,drop函数定义过程如下所示,需要用到self.training变量)——“优化器”梯度清零——data喂到model里,得到输出结果out——利用loss函数和out以及data标签计算损失——反向传播调整模型参数。

    ⑥ loss_fn函数:

            第二个参数需要为一维向量,不能为二维,内部函数处理时会自动将其上升一个维度,因此要使用torch.flatten(...)函数来降维。

    ⑦ test(...)方法中的out.argmax(......)函数使用:如代码注释

    1. class GCN(torch.nn.Module):
    2. def __init__(self, input_dim, hidden_dim, output_dim, num_layers,
    3. dropout, return_embeds=False):
    4. # TODO: Implement a function that initializes self.convs,
    5. # self.bns, and self.softmax.
    6. super(GCN, self).__init__()
    7. # A list of GCNConv layers
    8. self.convs = None
    9. # A list of 1D batch normalization layers
    10. self.bns = None
    11. # The log softmax layer
    12. self.softmax = None
    13. ############# Your code here ############
    14. ## Note:
    15. ## 1. You should use torch.nn.ModuleList for self.convs and self.bns
    16. ## 2. self.convs has num_layers GCNConv layers
    17. ## 3. self.bns has num_layers - 1 BatchNorm1d layers
    18. ## 4. You should use torch.nn.LogSoftmax for self.softmax
    19. ## 5. The parameters you can set for GCNConv include 'in_channels' and
    20. ## 'out_channels'. For more information please refer to the documentation:
    21. ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv
    22. ## 6. The only parameter you need to set for BatchNorm1d is 'num_features'
    23. ## For more information please refer to the documentation:
    24. ## https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html
    25. ## (~10 lines of code)
    26. #########################################
    27. self.convs = torch.nn.ModuleList()
    28. self.bns = torch.nn.ModuleList()
    29. self.softmax = torch.nn.LogSoftmax()
    30. tmp1 = input_dim
    31. tmp2 = hidden_dim
    32. for i in range(num_layers - 1):
    33. self.convs.append(GCNConv(tmp1, tmp2))
    34. self.bns.append(BatchNorm(tmp2))
    35. tmp1 = tmp2
    36. self.convs.append(GCNConv(hidden_dim, output_dim))
    37. # Probability of an element getting zeroed
    38. self.dropout = dropout
    39. # Skip classification layer and return node embeddings
    40. self.return_embeds = return_embeds
    41. def reset_parameters(self):
    42. for conv in self.convs:
    43. conv.reset_parameters()
    44. for bn in self.bns:
    45. bn.reset_parameters()
    46. def forward(self, x, adj_t):
    47. # TODO: Implement a function that takes the feature tensor x and
    48. # edge_index tensor adj_t and returns the output tensor as
    49. # shown in the figure.
    50. out = None
    51. ############# Your code here ############
    52. ## Note:
    53. ## 1. Construct the network as shown in the figure
    54. ## 2. torch.nn.functional.relu and torch.nn.functional.dropout are useful
    55. ## For more information please refer to the documentation:
    56. ## https://pytorch.org/docs/stable/nn.functional.html
    57. ## 3. Don't forget to set F.dropout training to self.training
    58. ## 4. If return_embeds is True, then skip the last softmax layer
    59. ## (~7 lines of code)
    60. #########################################
    61. out = x
    62. for i in range(len(self.bns)):
    63. out = self.convs[i](out, adj_t)
    64. out = self.bns[i](out)
    65. out = torch.nn.functional.relu(out)
    66. out = torch.nn.functional.dropout(out, p=self.dropout, training=self.training)
    67. out = self.convs[-1](out, adj_t)
    68. out = self.softmax(out) if (not self.return_embeds) else out
    69. return out
    70. def train(model, data, train_idx, optimizer, loss_fn):
    71. # TODO: Implement a function that trains the model by
    72. # using the given optimizer and loss_fn.
    73. model.train()
    74. loss = 0
    75. ############# Your code here ############
    76. ## Note:
    77. ## 1. Zero grad the optimizer
    78. ## 2. Feed the data into the model
    79. ## 3. Slice the model output and label by train_idx
    80. ## 4. Feed the sliced output and label to loss_fn
    81. ## (~4 lines of code)
    82. optimizer.zero_grad()
    83. out = model(data.x, data.adj_t)
    84. loss = loss_fn(out[train_idx], torch.flatten(data.y[train_idx]))
    85. #########################################
    86. loss.backward()
    87. optimizer.step()
    88. return loss.item()
    89. # Test function here
    90. @torch.no_grad()
    91. def test(model, data, split_idx, evaluator, save_model_results=False):
    92. # TODO: Implement a function that tests the model by
    93. # using the given split_idx and evaluator.
    94. model.eval()
    95. # The output of model on all data
    96. out = None
    97. ############# Your code here ############
    98. ## (~1 line of code)
    99. ## Note:
    100. ## 1. No index slicing here
    101. #########################################
    102. out = model(data.x, data.adj_t)
    103. # wjunjie
    104. # argmax在指定维度dim上操作,选择最大值的dim维度上的索引,keepdim=True相当于保留原始维度
    105. # out是一个n*m的矩阵,m等于类别数目.
    106. # 这里最后要得到一个n*1个pred矩阵,所以在dim=1上操作argmax(a[0][0],...,a[0][m-1])
    107. y_pred = out.argmax(dim=-1, keepdim=True)
    108. train_acc = evaluator.eval({
    109. 'y_true': data.y[split_idx['train']],
    110. 'y_pred': y_pred[split_idx['train']],
    111. })['acc']
    112. valid_acc = evaluator.eval({
    113. 'y_true': data.y[split_idx['valid']],
    114. 'y_pred': y_pred[split_idx['valid']],
    115. })['acc']
    116. test_acc = evaluator.eval({
    117. 'y_true': data.y[split_idx['test']],
    118. 'y_pred': y_pred[split_idx['test']],
    119. })['acc']
    120. if save_model_results:
    121. print("Saving Model Predictions")
    122. data = {}
    123. # 二维y_pred转为一维、cuda数据转为cpu、去掉tensor的梯度、转为numpy
    124. data['y_pred'] = y_pred.view(-1).cpu().detach().numpy()
    125. df = pd.DataFrame(data=data)
    126. # Save locally as csv
    127. df.to_csv('ogbn-arxiv_node.csv', sep=',', index=False)
    128. return train_acc, valid_acc, test_acc
    129. # Please do not change the args
    130. if 'IS_GRADESCOPE_ENV' not in os.environ:
    131. args = {
    132. 'device': device,
    133. 'num_layers': 3,
    134. 'hidden_dim': 256,
    135. 'dropout': 0.5,
    136. 'lr': 0.01,
    137. 'epochs': 100,
    138. }
    139. if 'IS_GRADESCOPE_ENV' not in os.environ:
    140. model = GCN(data.num_features, args['hidden_dim'],
    141. dataset.num_classes, args['num_layers'],
    142. args['dropout']).to(device)
    143. evaluator = Evaluator(name='ogbn-arxiv')
    144. import copy
    145. if 'IS_GRADESCOPE_ENV' not in os.environ:
    146. # reset the parameters to initial random value
    147. model.reset_parameters()
    148. optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
    149. loss_fn = F.nll_loss
    150. best_model = None
    151. best_valid_acc = 0
    152. for epoch in range(1, 1 + args["epochs"]):
    153. loss = train(model, data, train_idx, optimizer, loss_fn)
    154. result = test(model, data, split_idx, evaluator)
    155. train_acc, valid_acc, test_acc = result
    156. if valid_acc > best_valid_acc:
    157. best_valid_acc = valid_acc
    158. best_model = copy.deepcopy(model)
    159. print(f'Epoch: {epoch:02d}, '
    160. f'Loss: {loss:.4f}, '
    161. f'Train: {100 * train_acc:.2f}%, '
    162. f'Valid: {100 * valid_acc:.2f}% '
    163. f'Test: {100 * test_acc:.2f}%')
    164. if 'IS_GRADESCOPE_ENV' not in os.environ:
    165. best_result = test(best_model, data, split_idx, evaluator, save_model_results=True)
    166. train_acc, valid_acc, test_acc = best_result
    167. print(f'Best model: '
    168. f'Train: {100 * train_acc:.2f}%, '
    169. f'Valid: {100 * valid_acc:.2f}% '
    170. f'Test: {100 * test_acc:.2f}%')

    4、图级别预测

    ① 框架构建思路:

    一个图级别卷积过程包括:先利用节点GNN得到节点的表征——然后池化——最后进行一些适当的线性or非线性变化。

    因此init函数和foward函数的coding过程就按照上述思路来。

    ② DataLoader与batch:

    教程将obg的datset先转换为了DataLoader类型,并设定了batch。难道DataLoader是使用batch的一种方式?

    ③ train函数中的tips:

    A、Batch:一个batch相当于一个小的dataset,里面包含了“32”(batch size)个图的信息。

    B、python list解析式:其他语法糖还有dict解析式等...

     C、tensor的索引可以是list和LongTensor,还可以bool数组(如train函数中注释所示)。

    1. ### GCN to predict graph property
    2. class GCN_Graph(torch.nn.Module):
    3. def __init__(self, hidden_dim, output_dim, num_layers, dropout):
    4. super(GCN_Graph, self).__init__()
    5. # Load encoders for Atoms in molecule graphs
    6. self.node_encoder = AtomEncoder(hidden_dim)
    7. # Node embedding model
    8. # Note that the input_dim and output_dim are set to hidden_dim
    9. self.gnn_node = GCN(hidden_dim, hidden_dim,
    10. hidden_dim, num_layers, dropout, return_embeds=True)
    11. self.pool = None
    12. ############# Your code here ############
    13. self.pool = global_mean_pool
    14. ## Note:
    15. ## 1. Initialize self.pool as a global mean pooling layer
    16. ## For more information please refer to the documentation:
    17. ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers
    18. #########################################
    19. # Output layer
    20. self.linear = torch.nn.Linear(hidden_dim, output_dim)
    21. def reset_parameters(self):
    22. self.gnn_node.reset_parameters()
    23. self.linear.reset_parameters()
    24. def forward(self, batched_data):
    25. # TODO: Implement a function that takes as input a
    26. # mini-batch of graphs (torch_geometric.data.Batch) and
    27. # returns the predicted graph property for each graph.
    28. #
    29. # NOTE: Since we are predicting graph level properties,
    30. # your output will be a tensor with dimension equaling
    31. # the number of graphs in the mini-batch
    32. # Extract important attributes of our mini-batch
    33. x, edge_index, batch = batched_data.x, batched_data.edge_index, batched_data.batch
    34. embed = self.node_encoder(x)
    35. out = None
    36. ############# Your code here ############
    37. ## Note:
    38. ## 1. Construct node embeddings using existing GCN model
    39. ## 2. Use the global pooling layer to aggregate features for each individual graph
    40. ## For more information please refer to the documentation:
    41. ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers
    42. ## 3. Use a linear layer to predict each graph's property
    43. ## (~3 lines of code)
    44. out = self.gnn_node(embed, edge_index)
    45. out = self.pool(out, batch)
    46. out = self.linear(out)
    47. #########################################
    48. return out
    49. def train(model, device, data_loader, optimizer, loss_fn):
    50. # TODO: Implement a function that trains your model by
    51. # using the given optimizer and loss_fn.
    52. model.train()
    53. loss = 0
    54. for step, batch in enumerate(tqdm(data_loader, desc="Iteration")):
    55. batch = batch.to(device)
    56. # 一个batch相当于一个小的dataset,里面包含了“32”(batch size)个图的信息
    57. if batch.x.shape[0] == 1 or batch.batch[-1] == 0: # 这个batch是孤立节点或者该batch里只有一个图
    58. pass
    59. else:
    60. ## ignore nan targets (unlabeled) when computing training loss.
    61. is_labeled = batch.y == batch.y # 不出意外全是true 【32,1】
    62. ############# Your code here ############
    63. ## Note:
    64. ## 1. Zero grad the optimizer
    65. ## 2. Feed the data into the model
    66. ## 3. Use `is_labeled` mask to filter output and labels
    67. ## 4. You may need to change the type of label to torch.float32
    68. ## 5. Feed the output and label to the loss_fn
    69. ## (~3 lines of code)
    70. optimizer.zero_grad()
    71. out = model(batch)
    72. # python list解析式,其他语法糖还有dict解析式等...
    73. tmp_index = [index for index in range(is_labeled.shape[0]) if is_labeled[index]]
    74. loss = loss_fn(out[tmp_index], batch.y[tmp_index].type(torch.float32))
    75. # 看了其他人的,这里index直接用is_labeled代替就可以,所以tensor的索引可以是list和LongTensor,还可以bool数组。
    76. #########################################
    77. loss.backward()
    78. optimizer.step()
    79. return loss.item()
    80. # The evaluation function
    81. def eval(model, device, loader, evaluator, save_model_results=False, save_file=None):
    82. model.eval()
    83. y_true = []
    84. y_pred = []
    85. for step, batch in enumerate(tqdm(loader, desc="Iteration")):
    86. batch = batch.to(device)
    87. if batch.x.shape[0] == 1:
    88. pass
    89. else:
    90. with torch.no_grad():
    91. pred = model(batch)
    92. y_true.append(batch.y.view(pred.shape).detach().cpu())
    93. y_pred.append(pred.detach().cpu())
    94. y_true = torch.cat(y_true, dim=0).numpy()
    95. y_pred = torch.cat(y_pred, dim=0).numpy()
    96. input_dict = {"y_true": y_true, "y_pred": y_pred}
    97. if save_model_results:
    98. print("Saving Model Predictions")
    99. # Create a pandas dataframe with a two columns
    100. # y_pred | y_true
    101. data = {}
    102. data['y_pred'] = y_pred.reshape(-1)
    103. data['y_true'] = y_true.reshape(-1)
    104. df = pd.DataFrame(data=data)
    105. # Save to csv
    106. df.to_csv('ogbg-molhiv_graph_' + save_file + '.csv', sep=',', index=False)
    107. return evaluator.eval(input_dict)
    108. if 'IS_GRADESCOPE_ENV' not in os.environ:
    109. model = GCN_Graph(args['hidden_dim'],
    110. dataset.num_tasks, args['num_layers'],
    111. args['dropout']).to(device)
    112. evaluator = Evaluator(name='ogbg-molhiv')
    113. import copy
    114. if 'IS_GRADESCOPE_ENV' not in os.environ:
    115. model.reset_parameters()
    116. optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
    117. loss_fn = torch.nn.BCEWithLogitsLoss()
    118. best_model = None
    119. best_valid_acc = 0
    120. for epoch in range(1, 1 + args["epochs"]):
    121. print('Training...')
    122. loss = train(model, device, train_loader, optimizer, loss_fn)
    123. print('Evaluating...')
    124. train_result = eval(model, device, train_loader, evaluator)
    125. val_result = eval(model, device, valid_loader, evaluator)
    126. test_result = eval(model, device, test_loader, evaluator)
    127. train_acc, valid_acc, test_acc = train_result[dataset.eval_metric], val_result[dataset.eval_metric], test_result[dataset.eval_metric]
    128. if valid_acc > best_valid_acc:
    129. best_valid_acc = valid_acc
    130. best_model = copy.deepcopy(model)
    131. print(f'Epoch: {epoch:02d}, '
    132. f'Loss: {loss:.4f}, '
    133. f'Train: {100 * train_acc:.2f}%, '
    134. f'Valid: {100 * valid_acc:.2f}% '
    135. f'Test: {100 * test_acc:.2f}%')

    End...

  • 相关阅读:
    EN 13707防水用柔性薄板-屋顶防水用增强沥青薄板—CE认证
    第五篇 python 基本语法(一)
    SE(Squeeze and Excitation)模块的理解以及代码实现
    [apue] 标准 I/O 库那些事儿
    Mybatis-Plus开发提速器mybatis-plus-generator-ui
    Python第一阶段-第十章-Python基础综合案例-数据可视化
    四、ref与DOM-findDomNode-unmountComponentAtNode
    2022前端面试题整理
    Android Studio导入aosp源码
    springboot 四大组件
  • 原文地址:https://blog.csdn.net/qq_41661919/article/details/126426970