• 关于 ogbg-molhi数据集的个人解析


     cs224w_colab2.py这个图属性预测到底咋预测的

    1. dataset.meta_info.T
    2. Out[2]:
    3. num tasks 1
    4. eval metric rocauc
    5. download_name hiv
    6. version 1
    7. url http://snap.stanford.edu/ogb/data/graphproppre...
    8. add_inverse_edge True
    9. data type mol
    10. has_node_attr True
    11. has_edge_attr True
    12. task type binary classification
    13. num classes 2
    14. split scaffold
    15. additional node files None
    16. additional edge files None
    17. binary False
    18. Name: ogbg-molhiv, dtype: object

    参照上面 这里的num tasks  仅适用于图属性预测? num tasks = 1

    1. model = GCN_Graph(args['hidden_dim'],
    2. dataset.num_tasks, args['num_layers'],
    3. args['dropout']).to(device)
    4. train_loader.dataset.data.edge_index.shape
    5. Out[10]: torch.Size([2, 2259376])
    6. train_loader.dataset.data.edge_attr.shape
    7. Out[12]: torch.Size([2259376, 3])
    8. type(train_loader.dataset.data.node_stores)
    9. Out[26]: list
    10.  
    11. train_loader.dataset.data.node_stores[0]['y'].shape
    12. Out[46]: torch.Size([41127, 1])
    13. train_loader.dataset.data.node_stores[0]['y'].sum()
    14. Out[47]: tensor(1443) y 中的数值求和值
    15. torch.unique(train_loader.dataset.data.node_stores[0]['y'],return_counts=True)
    16. Out[58]: (tensor([0, 1]), tensor([39684,  1443]))  仅0,1两类
    17. self.node_encoder.atom_embedding_list
    18. Out[62]: 
    19. ModuleList(
    20.   (0): Embedding(119, 256)
    21.   (1): Embedding(5, 256)
    22.   (2): Embedding(12, 256)
    23.   (3): Embedding(12, 256)
    24.   (4): Embedding(10, 256)
    25.   (5): Embedding(6, 256)
    26.   (6): Embedding(6, 256)
    27.   (7): Embedding(2, 256)
    28.   (8): Embedding(2, 256)
    29. )
    30. list(enumerate(data_loader))
    31. Out[82]: 
    32. [(0,
    33.   DataBatch(edge_index=[2, 1734], edge_attr=[1734, 3], x=[807, 9], y=[32, 1], num_nodes=807, batch=[807], ptr=[33])),
    34. 若干组 很多
    1. x, edge_index, batch = batched_data.x, batched_data.edge_index, batched_data.batch
    2. embed = self.node_encoder(x) #使用编码器 将原先9维的编码为256维 self.node_encoder = AtomEncoder(hidden_dim)
    3. out = self.gnn_node(embed, edge_index) #使用gcn得到节点嵌入 embed=X edge_index 连边/节点对
    4. out = self.pool(out, batch)
    5. batch.unique(return_counts = True)
    6. Out[94]: 
    7. (tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
    8.          18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],
    9.         device='cuda:0'), 这里说明有31个待训练子图(化学分子) 下图api说明了聚合过程
    10.  tensor([30, 18, 21, 26, 12, 20, 17, 18, 36, 11, 31, 22, 21, 26, 22, 21, 21, 63,
    11.          15, 18, 18, 29, 18, 40, 41, 19, 19, 30, 12, 21, 19, 23], 每个分子中所含有的节点(原子)数量
    12.         device='cuda:0'))
    13. batch.shape
    14. Out[95]: torch.Size([758])

    1. def global_mean_pool(x: Tensor, batch: Optional[Tensor],
    2. size: Optional[int] = None) -> Tensor:
    3. dim = -1 if x.dim() == 1 else -2 #这里的x.dim() = 2
    4. # dim() → int Returns the number of dimensions of self tensor.
    5. if batch is None:
    6. return x.mean(dim=dim, keepdim=x.dim() <= 2) #keepdim=x.dim() <= 2 ??啥玩意<=
    7. size = int(batch.max().item() + 1) if size is None else size
    8. return scatter(x, batch, dim=dim, dim_size=size, reduce='mean')

    This package consists of a small extension library of highly optimized sparse update (scatter and segment) operations for the use in PyTorch, which are missing in the main package. Scatter and segment operations can be roughly described as reduce operations based on a given "group-index" tensor. Segment operations require the "group-index" tensor to be sorted, whereas scatter operations are not subject to these requirements.该包由一个小型扩展库组成,该库包含用于PyTorch的高度优化的稀疏更新(分散和分段)操作,这些操作在主包中丢失。散射和分段运算可以粗略地描述为基于给定“群索引”张量的归约运算。分段运算需要对“组索引”张量进行排序,而分散运算则不受这些要求的约束。

    由此(scatter)由多个节点的嵌入值最终得到这部分节点所在的子图嵌入(化学分子)。

    1. def forward(self, batched_data):
    2. # TODO: Implement a function that takes as input a
    3. # mini-batch of graphs (torch_geometric.data.Batch) and
    4. # returns the predicted graph property for each graph.
    5. #
    6. # NOTE: Since we are predicting graph level properties,
    7. # your output will be a tensor with dimension equaling
    8. # the number of graphs in the mini-batch
    9. x, edge_index, batch = batched_data.x, batched_data.edge_index, batched_data.batch
    10. embed = self.node_encoder(x) #使用编码器 将原先9维的编码为256维 self.node_encoder = AtomEncoder(hidden_dim)
    11. out = self.gnn_node(embed, edge_index) #使用gcn得到节点嵌入 embed=X edge_index 连边/节点对
    12. out = self.pool(out, batch)
    13. out = self.linear(out)
    14. ############# Your code here ############
    15. ## Note:
    16. ## 1. Construct node embeddings using existing GCN model
    17. ## 2. Use the global pooling layer to aggregate features for each individual graph
    18. ## For more information please refer to the documentation:
    19. ## https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#global-pooling-layers
    20. ## 3. Use a linear layer to predict each graph's property
    21. ## (~3 lines of code)
    22. #########################################
    23. return out
    24. out.shape
    25. Out[122]: torch.Size([32, 1])
    26. out
    27. Out[121]:
    28. tensor([[-0.4690],
    29. [-1.0285],
    30. [-0.4614],
    31. 最后经过线性层 返回得到所属类别概率 运行到如下部分结束反向传播forward() (op = model(batch)# 先进入model函数 然后运行 反向传播)
    32. def train(model, device, data_loader, optimizer, loss_fn):
    33. # TODO: Implement a function that trains your model by
    34. # using the given optimizer and loss_fn.
    35. model.train() #Sets the module in training mode. data_loader.dataset.data Data(num_nodes=1049163, edge_index=[2, 2259376], edge_attr=[2259376, 3], x=[1049163, 9], y=[41127, 1])
    36. loss = 0
    37. for step, batch in enumerate(tqdm(data_loader, desc="Iteration")): #,total= data_loader.batch_sampler
    38. # for step, batch in tqdm(enumerate(data_loader), desc="Iteration"): #,total= data_loader.batch_sampler
    39. batch = batch.to(device)
    40. if batch.x.shape[0] == 1 or batch.batch[-1] == 0:
    41. pass
    42. else:
    43. ## ignore nan targets (unlabeled) when computing training loss.
    44. is_labeled = batch.y == batch.y # 0/1转化为Ture/False
    45. ############# Your code here ############
    46. ## Note:
    47. ## 1. Zero grad the optimizer
    48. ## 2. Feed the data into the model
    49. ## 3. Use `is_labeled` mask to filter output and labels
    50. ## 4. You may need to change the type of label to torch.float32
    51. ## 5. Feed the output and label to the loss_fn
    52. ## (~3 lines of code)
    53. optimizer.zero_grad()
    54. # print('optimizer.zero_grad()')
    55. op = model(batch)# 先进入model函数 然后运行 反向传播
    56. 。。。。。。。。。。。。。。。后面计算损失 更新梯度等等

    存在错误等欢迎指正! 附件为整个作业的.py文件

  • 相关阅读:
    Linux CentOS 8(磁盘的挂载与卸载)
    【热门话题】前端框架发展史
    搭建vue3.2+vite+ts+pinia项目
    jupyter notebook闪退解决,安美解决
    C++ Reference: Standard C++ Library reference: C Library: cuchar: mbrtoc16
    深入理解Kubernetes Pod调试
    基于nodejs的二手物物交换平台【毕业设计源码】
    浅析 C# Console 控制台为什么也会卡死
    std::unique_ptr(基础和仿写)
    TCP 如何保证有效传输及拥塞控制
  • 原文地址:https://blog.csdn.net/wangxiaojie6688/article/details/132787951