Memory-Associated Differential Learning论文及代码解读

Memory-Associated Differential Learning论文及代码解读

论文来源:

论文PDF:

Memory-Associated Differential Learning论文

论文代码:

Memory-Associated Differential Learning代码

论文解读:

1.Abstract

Conventional Supervised Learning approaches focus on the mapping from input features to output labels. After training, the learnt models alone are adapted onto testing features to predict testing labels in isolation, with training data wasted and their associations ignored. To take full advantage of the vast number of training data and their associations, we propose a novel learning paradigm called Memory-Associated Differential (MAD) Learning. We first introduce an additional component called Memory to memorize all the training data. Then we learn the differences of labels as well as the associations of features in the combination of a differential equation and some sampling methods. Finally, in the evaluating phase, we predict unknown labels by inferencing from the memorized facts plus the learnt differences and associations in a geometrically meaningfull manner. We gently build this theory in unary situations and apply it on Image Recognition, then extend it into Link Prediction as a binary situation, in which our method outperforms strong state-of-the-art baselines on three citation networks and ogbl-ddi dataset.
传统的监督学习方法侧重于从输入特征到输出标签的映射。 在训练之后,单独学习的模型被调整到测试特征上以单独预测测试标签,训练数据被浪费并且它们的关联被忽略。 为了充分利用大量的训练数据及其关联,我们提出了一种新的学习范式,称为记忆关联差分学习。 我们首先引入一个名为Memory的附加组件来记忆所有的训练数据。 然后在微分方程和一些抽样方法的组合中,我们学习标签的差异以及特征的关联。 最后,在评估阶段,我们通过从记忆的事实加上学习的差异和联系中推断出几何意义上的完全方式来预测未知标签。 我们在一元情况下温和地构建这一理论,并将其应用于图像识别,然后将其扩展为二元情况下的链接预测,其中我们的方法在三个引用网络和ogbl-ddi数据集上优于强大的最先进的基线。

2.Introduction

在这里插入图片描述
Figure 1: The difference between Conventional Supervised Learning and MAD Learning. The former learns the mapping from features to labels in training data and apply this mapping on testing data, while the latter learns the differences and associations among data and inferences testing labels from memorized training data.
图1:常规监督学习和MAD学习的区别。 前者学习训练数据中从特征到标签的映射,并将该映射应用于测试数据,而后者学习数据之间的差异和关联,并从记忆的训练数据中推断测试标签。

3.Related Works

Instead of treating External Memory as a way to add more learnable parameters to store uninterpretable hidden states, we try to memorize the facts as they are, and then learn the differences and associations between them.
我们不是把外部记忆当作一种添加更多可学习的参数来存储无法解释的隐藏状态的方式,而是试图记住事实的本来面目,然后学习它们之间的区别和联系。

Most of the experiments in this article are designed to solve Link Prediction problem that we predict whether a pair of nodes in a graph are likely to be connected, how much the weight their edge bares, or what attributes their edge should have.
本文中的大部分实验都是为了解决链接预测问题,即我们预测图中的一对节点是否可能连通,它们的边露出多少权重,或者它们的边应该具有什么属性。

Although our method is derived from a different perspective of view, we point out that Matrix Factorization can be seen as a simplification of MAD Learning with no memory and no sampling.
虽然我们的方法是从不同的角度推导出来的,但我们指出,矩阵分解可以被视为无记忆、无采样的MAD学习的简化。

4.Proposed Approach

4.1 Memory-Associated Differential Learning

By applying Mean Value Theorem for Definite Integrals [Comenetz, 2002], we can estimate the unknown y with known y0 if x0 is close enough to x:
应用定积分中值定理[Comenetz,2002],如果x0与x足够接近,我们可以用已知y0来估计未知y:
在这里插入图片描述
In such way, we connect the current prediction tasks y to the past fact y0, which can be stored in external memory, and convert the learning of our target function y(x) to the learning of a differential function y0(x), which in general is more accessible than the former.
以这种方式,我们将当前预测任务y连接到可以存储在外部存储器中的过去事实y0,并将我们的目标函数y(x)的学习转换成微分函数y0(x)的学习,微分函数y0(x)通常比前者更容易访问。

4.2 Inferencing from Multiple References

To get a steady and accurate estimation of y, we can sample n references x1, x2, · · · , xn to get n estimations yˆ|y1, yˆ|y2, · · · , yˆ|yn and combine them with an aggregator such as mean:
为了获得对y的稳定而精确的估计,我们可以对n个参考x1、x2、、、xn进行采样,以获得n个估计yˇ| y1、yˇ| y2、,yˇ| yn,并将它们与均值结合。
在这里插入图片描述
Here we adopt a function Softmin derived from Softmax which rescales the inputted d-dimentional array v so that every element of v lies in the range [0,1] and all of them sum to 1:
这里,我们采用从Softmax导出的函数Softmin,该函数对输入的d维数组v进行重新缩放,使得v的每个元素都位于[0,1]的范围内,并且它们的总和为1:
在这里插入图片描述
By applying Softmin we get the aggregated estimation:
通过Softmin最小,我们得到了总的估计:
在这里插入图片描述
在这里插入图片描述
Figure 2: (a) Memory-Associated Differential Learning inferences labels from memorized ones following the first-order Taylor series approximation: y ≈ y0 +∆x · y0(x). (b) In binary MAD Learning, when v = v0 holds, ∂u∂r |(u,v) is simplified to be ∂u∂r |v since it is the change
图2: (a)记忆相关差分学习根据一阶泰勒级数近似从记忆的标签中推断标签:y≈y0+⇼x y0(x)。(b)在二进制MAD学习中,当v = v0成立时,∂u∂r |(u,v)简化为∂u∂r |v,因为它是变化在v固定的情况下,将u轻微移动到u0后的r。

4.3 Soft Sentinels and Uncertainty

we introduce a mechanism on top of Softmin named Soft Sentinel. A Soft Sentinel is a dummy element mixed into the array of estimations with no information (e.g. the logit is 0) but a set distance (e.g. 0).
我们在Softmin之上引入了一个名为Soft Sentinel的机制。 软哨点是一个混合到无信息估计数组中的虚拟元素(例如: logit为0)但是设定的距离(例如.:0).

The estimation after k Soft Sentinels distant at 1 added is:
增加k个软哨兵距离为1后的估计值为:
在这里插入图片描述
When Soft Sentinels involved, only estimations given by close-enough references can have most of their impacts on the final result that unreliable estimations are supressed.
当涉及软哨兵时,只有由足够接近的参考文献给出的估计才能对最终结果产生最大影响,即抑制不可靠的估计。

4.4 Other Details
For the sake of flexibility and performance, we usually do not use inputted features x directly, but to first convert x into position f(x).
为了灵活性和性能,我们通常不直接使用输入的特征x,而是先将x转换为位置f(x)。

To adapt to this situation, we generally wrap the memory with an adaptor function m such as a one-layer MLP, getting yˆ|y0 = m(y0) + (f(x) − f(x0)) · g(x) where g(x) stands for gradient.
为了适应这种情况,我们通常用适配器函数m包装存储器,例如一层MLP,得到y\ | y0 = m(y0)+(f(x)-f(x0))g(x),其中g(x)代表梯度。

When the encodings of nodes are dynamic and no features are provided, we usually adopt Random Mode in the training phase for efficiency and adopt Dynamic NN Mode in the evaluation phase for performance.
当节点的编码是动态的并且没有提供特征时,我们通常在训练阶段采用随机模式来提高效率,在评估阶段采用动态神经网络模式来提高性能。

4.5 Binary MAD Learning
We model the relationship between a pair of nodes in a graph by extending MAD Learning into binary situations.
我们通过将MAD学习扩展到二元情况来建模图中一对节点之间的关系。
在这里插入图片描述
Therefore, we may further assume ∂r∂u |(u,v) = g1(v) if v = v0 and ∂r∂v |(u,v) = g2(u) if u = u0, making
在这里插入图片描述
Here g1(·) is destination differential function and g2(·) is source differential function. If the edge is undirected, these two functions can be shared.
这里g1()是目的微分函数,g2()是源微分函数。 如果边是无向的,这两个函数可以共享。

5.Experiments

In the training phase, we sample arbitrary pairs of nodes to construct negative samples [Grover and Leskovec, 2016] and compare the scores between connected pairs and negative samples with Cross-Entropy as the loss function:
在训练阶段,我们对任意节点对进行采样,构造负样本[Grover和Leskovec,2016],并以交叉熵为损失函数,比较连通对和负样本之间的得分:
在这里插入图片描述

where y is the number of positive samples and n of negative samples, py(i) is the predicted probability of the i-th positive sample and pn(i) of the i-th negative sample. In the evaluating phase, we record the scores not only in Dynamic NN Mode but also in Random Mode.
其中y是正样本的数量,n是负样本的数量,py(i)是第I个正样本的预测概率,pn(i)是第I个负样本的预测概率。 在评估阶段,我们不仅在动态神经网络模式下记录分数,还在随机模式下记录分数。

We have these three experimental settings to examine the contribution of Softmin and Soft Sentinels:

mean. Estimations are aggregated by mean function.
softmin. Estimations given by different references are summed up weighted by the results of Softmin applied to the distances.
sentinel. Estimations of softmin with 8 Soft Sentinels at distance 1 added.
As is shown in Figure 4(b), it is no much difference between mean and Softmin. But when mixed with Soft Sentinels, MAD Learning performs better and converges faster.
我们有这三个实验设置来检验软敏和软哨兵的贡献:
mean: 估计通过均值函数聚合。
softmin: 不同参考文献给出的估计值通过应用于距离的软最小结果进行加权求和。
sentinel:在距离1处增加了8个软哨兵时,软敏度的估计值。
如图4(b)所示,平均值和软最小值之间没有太大差异。 但是当与软哨兵混合时,MAD学习表现更好,收敛更快。

we repeat that MAD Learning does not predict directly. From another point of view, this experiment implies that undirect references can also be beneficial on par with direct information.
我们重申,MAD学习不能直接预测。 从另一个角度来看,这个实验意味着无向引用与直接信息一样有益。

6.Discussion

by extending it from a scalar to a vector, MAD Learning can be used for graphs with featured edges.
通过将它从标量扩展到向量,MAD学习可以用于具有特征边的图。

We also point out that MAD Learning can learn relations in heterogeneous graphs where nodes belong to different types (usually represented by encodings in different lengths). The only requirement is that positions of the source nodes should match with gradients of the destination nodes and vice versa.
我们还指出,MAD Learning可以学习节点属于不同类型(通常由不同长度的编码表示)的异构图中的关系。 唯一的要求是源节点的位置应该与目标节点的梯度相匹配,反之亦然。

7. Conclusion

In this work, we explore a novel learning paradigm which is flexible, effective and interpretable. The outstanding results, especially on Link Prediction, open the door for several research directions:

  1. The most important part of MAD Learning is memory.However, MAD Learning have to index the whole training data for random access. In Link Prediction, we implement memory as a dense adjacency matrix which results in huge occupation of space. The way to shrink memory and improve the utilization of space should be investigated in the future.

  2. Based on memory as the ground-truth, MAD Learning appends some difference as the second part. We implement this difference simply as the product of distance and differential function, but we believe there exist different ways to model it.

  3. The third part of MAD Learning is the similarity, which is used to assign weights to estimations given by different references. We reuse distance to compute the similarity, but decoupling it by some other embeddings and some other measurements such as inner product should also be worthy to explore.

  4. In this work, we do deliberately not combine direct information to focus only on MAD Learning. Since MAD Learning takes another parallel route to predict, we believe integrating MAD Learning and Conventional Supervised Learning is also a promising direction.

在这项工作中,我们探索了一种灵活、有效和可解释的新型学习范式。 突出的结果,尤其是在链接预测方面,为几个研究方向打开了大门:

  1. MAD学习最重要的部分是记忆.然而,MAD Learning必须对整个训练数据进行索引,以便随机访问。 在链路预测中,我们将内存实现为密集的邻接矩阵,这导致了巨大的空间占用。 未来应该研究缩小内存和提高空间利用率的方法。

  2. 基于记忆作为基础事实,MAD学习附加了一些区别作为第二部分。 我们将这种差异简单地实现为距离和微分函数的乘积,但我们认为存在不同的建模方法。

  3. MAD学习的第三部分是相似度,它被用来给不同参考文献给出的估计赋值。 我们使用距离来计算相似度,但是通过一些其他嵌入和一些其他度量(例如内积)来解耦它也应该是值得探索的。

  4. 在这项工作中,我们故意不结合直接信息,只专注于MAD学习。 由于MAD学习采取了另一种平行的预测路线,我们认为将MAD学习和常规监督学习相结合也是一个有前途的方向。

代码解读:

关于MAD函数主要参考了学长的文章Memory-Associated Differential Learning论文Link Prediction源码解读我太菜了=。=
主要是分析citations.py:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
import dgl
import dgl.nn
from sklearn import metrics#超参数
g_data_name = 'pubmed'  # cora | citeseer | pubmed
g_toy = False	#文中的涉及toy部分应该都属于作者编写代码时的调试部分。
g_dim = 32  #对应论文中维度32维向量
n_samples = 8   #8个reference作为参考
total_epoch = 200
lr = 0.005
if g_toy:g_dim = g_dim // 2
elif g_data_name == 'pubmed':   #当对pubmed链接预测时,超参数的改变g_dim, n_samples, total_epoch, lr = 64, 64, 2000, 0.001def gpu(x):return x.cuda() if torch.cuda.is_available() else xdef cpu(x):return x.cpu() if torch.cuda.is_available() else xdef ip(x, y):return (x.unsqueeze(-2) @ y.unsqueeze(-1)).squeeze(-1).squeeze(-1)#unsqueeze(-2)在倒数第二个维度上增加一个维度,那么使用unsqueeze(-2)#squeeze(-1)在倒数第一维且该维维度为1时,去掉该维class MAD(nn.Module):def __init__(self, in_feats, n_nodes, node_feats,n_samples, mem, feats, gather2neighbor=False,):super(self.__class__, self).__init__()self.n_nodes = n_nodesself.node_feats = node_featsself.n_samples = n_samplesself.mem = memself.feats = featsself.gather2neighbor = gather2neighborself.f = gpu(nn.Linear(in_feats, node_feats))self.g = (None if gather2neighbor else gpu(nn.Linear(in_feats, node_feats)))# self.g相当于论文中的g(u)和g(v)中的g(x)函数,g(u)是r(u,v)在v_0处关于u的偏导数。# g(u)可以看作是当v=v_0时(把v当作是固定的),在r(u_0,v_0)的基础上,# 将节点u移动到u_0(移动的距离很小,和微积分一个意思)后得到的r(u,v)的细微变化self.adapt = gpu(nn.Linear(1, 1))self.nn = Nonedef nns(self, src, dst):    #调试代码if self.nn is None:n = self.n_samplesself.nn = gpu(torch.empty((self.n_nodes, n), dtype=int))for perm in DataLoader(range(self.n_nodes), 64, shuffle=False):self.nn[perm] = (self.feats[perm].unsqueeze(1) - self.feats.unsqueeze(0)).norm(dim=-1).topk(1 + n, largest=False).indices[..., 1:]return self.nn[src], self.nn[dst]def recall(self, src, dst):# 这里被调用是因为forward方法中调用了self.recall(mid0, dst.unsqueeze(1))# 所以,src=mid0,dst=dst.unsqueeze(1)# src的shape,是(1024,8),dst的shape是(1024,1)# self.mem[src, dst]中mem是矩阵张量,src和dst也是矩阵张量,其实这里是触发了广播机制,# dst每一行的元素会和src每一行的元素两两组合成一个二维坐标来确定取mem中的哪一行哪一列的值,# 而mem因为是训练集的邻接矩阵,所以self.mem[src, dst]相当于是取1024×8个r_0# dst每一行的元素(1个)会和src每一行的8个元素进行组合,从而得到8个二维坐标。# 一对多,这就是python的广播机制if self.mem is None:return 0return self.adapt((0.0 + self.mem[src, dst]).unsqueeze(-1)).squeeze(-1)# self.mem原来是bool型矩阵张量,(0.0 + self.mem[src, dst])使其变为数值型矩阵张量。# 目前来说,(0.0 + self.mem[src, dst])已经是取得的1024×8个r_0的数值形式,# 至于为什么要用self.adapt(定义为gpu(nn.Linear(1, 1)))进行线性变化,论文没提,# 看来是没有直接使用r_0(u_0,v_0),而是给每个r_0加了个可训练的权重系数def forward(self, src, dst):# 该方法被mad(train_src[perm], train_dst[perm])这一句所调用,所以,src=train_src[perm],dst=train_dst[perm]n = src.shape[0]#获得边的数目feats = self.feats#获得节点特征g = self.f if self.gather2neighbor else self.g#如果有gather2neighbor(临近值)则g=.f,否则g=.gmid0 = torch.randint(0, self.n_nodes, (n, self.n_samples))#生成的mid0 是形状为n×self.n_samples的二维张量,数值是(0, self.n_nodes)mid1 = torch.randint(0, self.n_nodes, (n, self.n_samples))# mid0, mid1 = self.nns(src, dst)srcdiff = self.f(feats[src]).unsqueeze(1) - self.f(feats[mid0])# feats[src]的形状是len(src)×节点特征维度,具体来说就是(1024,1433),# self.f是__init__方法中定的线性变化,经过线性变化,# self.f(feats[src])的shape=(1024,32)# 由于mid0的shape=(1024,8),所以feats[mid0]的shape=(1024,8,1433),# self.f(feats[mid0])的shape=(1024,8,32)。# feats[src]是该批次中所有边的起始节点的特征,每个起始节点都要和8个reference节点# (因为self.n_samples=8)进行差分,从而实现论文3.5节中的(u-u_0),这里有1024×8个u_0.# 对self.f(feats[src])使用unsqueeze(1)是为了将其shape变为(1024,1,32),这样# self.f(feats[src]).unsqueeze(1) - self.f(feats[mid0])就能自动触发广播机制,使得# 每条边的起始节点的特征都能和8个reference节点的特征进行相减(差分),# 得到的srcdiff 的shape=(1024,8,32)logits1 = (ip(srcdiff, g(feats[dst]).unsqueeze(1))+ self.recall(mid0, dst.unsqueeze(1)))# logits1 这一步完成的是论文中的g(v)·(u-u_0)+r_0,when v=v_0,# 不过实现起来是按批进行处理,每一批1024条边,每条边有8个g(v)·(u-u_0)+r_0# 其中u和v都是一维向量,分别表示一条边的起始节点和终点节点的特征# 具体分析如下:# 这里的ip()方法用于实现批量的g(v)·(u-u_0)操作(论文3.5节Link Prediction部分),# srcdiff是1024×8个(u-u_0),# g(feats[dst]).unsqueeze(1)是该批次所有的g(v),一共1024个,# g(feats[dst])的shape=(1024,32),之所以要使用unsqueeze(1)使其shape变为(1024,1,32)# 是考虑到srcdiff 的shape=(1024,8,32),为了触发广播机制,# 使得每个g(v)都能和8个(u-u_0)进行操作(操作在ip方法中执行)# def ip(x, y):# return (x.unsqueeze(-2) @ y.unsqueeze(-1)).squeeze(-1).squeeze(-1)# 这里的x=srcdiff ,y=g(feats[dst]).unsqueeze(1),x和y的形状传经过unsqueeze()变换以后# 分别变成了(1024,8,1,32)和(1024,1,32,1),调用x@y进行矩阵乘法,# 实际是x和y的最里层的两个维度# 进行矩阵乘法,即(1,32)×(32,1)=(1,1),# 其物理意义为g(v)和(u-u_0)这两个向量逐元素相乘并求和。# 最终得到的形状是(1024,8,1,1),然后经过squeeze(-1).squeeze(-1)后变为(1024,8)。# 至于self.recall(mid0, dst.unsqueeze(1)),# mid0的shape,是(1024,8),dst.unsqueeze(1)的shape是(1024,1),详细分析见self.recall方法# recall方法返回的是论文中的1024×8个r_0(u_0,v_0),1024是batch大小,# 8是每条边的reference数量# ip()方法用于实现批量的g(v)·(u-u_0),recall方法用于返回批量的r_0(u_0,v_0),# 由此就完成了论文中的g(v)·(u-u_0)+r_0,when v=v_0dstdiff = self.f(feats[dst]).unsqueeze(1) - self.f(feats[mid1])# feats[dst]是该批次中所有边的终点节点的特征,每个终点节点都要和8个reference节点# (因为self.n_samples=8)进行差分,从而实现论文3.5节中(v-v_0),这里有8个v_0.logits2 = (ip(dstdiff, g(feats[src]).unsqueeze(1))+ self.recall(src.unsqueeze(1), mid1))# logits2 这一步完成的是论文中的8个g(u)·(v-v_0)+r_0,when u=u_0,logits = torch.cat((logits1, logits2), dim=1)dist = torch.cat((srcdiff, dstdiff), dim=1).norm(dim=2)# 这一步中,srcdiff和dstdiff的shape都是(1024,8,32),所以torch.cat((srcdiff, dstdiff), dim=1)# 得到的shape是(1024,16,32),使用norm(dim=2)是为了计算论文中的距离,(norm:求指定维度上的范数。)# 即在维度2上计算2范数,因为norm方法的参数p没有指定,默认是计算2范数# 因此得到的dist的shape为(1024,16)logits = torch.cat((logits, gpu(torch.zeros(n, self.n_samples))), dim=1)# 使用8个 Soft Sentinels distant at 1,每个Soft Sentinel的logit是0,所以是使用zerosdist = torch.cat((dist, gpu(torch.ones(n, self.n_samples))), dim=1)# 每个Soft Sentinel的distance是1,所以是使用onesreturn torch.sigmoid(ip(logits, torch.softmax(-dist, dim=1)))dataset = (dgl.data.CoraGraphDataset() if g_data_name == 'cora'else dgl.data.CiteseerGraphDataset() if g_data_name == 'citeseer'else dgl.data.PubmedGraphDataset())#使用DGL提供的数据集,DGL官网有提供数据集的使用教程,网址:https://docs.dgl.ai/tutorials/blitz/2_dglgraph.html
graph = dataset[0]
#直接取出第一个graph
src, dst = graph.edges()
#获取图的所有边,分别是起,终节点的编号序列
node_features = gpu(graph.ndata['feat'])
#获取节点特征
node_labels = gpu(graph.ndata['label'])
#获取节点标签
train_mask = graph.ndata['train_mask']
#获取决定节点是否属于训练集的一维掩码张量,长度为graph中节点的数量,值为True表示对应的节点属于训练集
valid_mask = graph.ndata['val_mask']
#获取决定节点是否属于验证集的一维掩码张量
test_mask = graph.ndata['test_mask']
#获取决定节点是否属于测试集的一维掩码张量
n_nodes = graph.num_nodes()
#获取节点的数量
n_features = node_features.shape[1]
#shape[1]:表示矩阵的列数
#获取节点的特征的维度
n_labels = int(node_labels.max().item() + 1)
#item()的作用是取出单元素张量的元素值并返回该值,保持该元素类型不变。
#获取节点标签的种类数量(通过最大标签值+1获得)adj = gpu(torch.zeros((n_nodes, n_nodes), dtype=bool))
adj[src, dst] = 1
adj[dst, src] = 1
#构建一个全0的邻接矩阵矩阵,将边输入进去并对称化
if g_toy:   #当toy模式采用全部的图,应该是作者用来测试的部分mem = Nonetrain_src = gpu(src)train_dst = gpu(dst)mlp = gpu(nn.Linear(g_dim, n_labels))params = list(mlp.parameters())print('mlp params:', sum(p.numel() for p in params))mlp_opt = optim.Adam(params, lr=lr)
else:n = src.shape[0]#shape[0]就是读取矩阵第一维度的长度,即读取有多少条边perm = torch.randperm(n)#随机打乱n的序列val_num = int(0.05 * n) #划分验证集数量test_num = int(0.1 * n) #划分测试集数量train_src = gpu(src[perm[val_num + test_num:]])train_dst = gpu(dst[perm[val_num + test_num:]])#训练集是去掉验证集和测试集剩下的部分val_src = gpu(src[perm[:val_num]])val_dst = gpu(dst[perm[:val_num]])#划分验证集test_src = gpu(src[perm[val_num:val_num + test_num]])test_dst = gpu(dst[perm[val_num:val_num + test_num]])#划分测试集train_src, train_dst = (torch.cat((train_src, train_dst)),torch.cat((train_dst, train_src)))#torch.cat是将两个张量(tensor)拼接在一起val_src, val_dst = (torch.cat((val_src, val_dst)),torch.cat((val_dst, val_src)))test_src, test_dst = (torch.cat((test_src, test_dst)),torch.cat((test_dst, test_src)))mem = gpu(torch.zeros((n_nodes, n_nodes), dtype=bool))mem[train_src, train_dst] = 1#获得训练集邻接矩阵并且对称化total_aucs = []
total_aps = []
for run in range(10):torch.manual_seed(run)  #设置随机种子mad = MAD(  #调用MAD函数in_feats=n_features,n_nodes=n_nodes,node_feats=g_dim,n_samples=n_samples,mem=mem,feats=node_features,gather2neighbor=g_toy,)params = list(mad.parameters())print('params:', sum(p.numel() for p in params))#list()将元组转换为列表#构建好神经网络后,网络的参数都保存在parameters()函数当中,打印神经网络结构opt = optim.Adam(params, lr=0.01)   #选择Adam优化器best_aucs = [0, 0]best_aps = [0, 0]best_accs = [0, 0]for epoch in range(1, total_epoch + 1):mad.train() #将mad转换到可训练状态for perm in DataLoader(range(train_src.shape[0]), batch_size=1024, shuffle=True):#range() 函数返回的是一个可迭代对象(类型是对象)#随机打乱训练节点数值序列,每1024为一组opt.zero_grad() #清空前一个epoch 残留的梯度p_pos = mad(train_src[perm], train_dst[perm])#调用MAD中forward函数neg_src = gpu(torch.randint(0, n_nodes, (perm.shape[0], )))#random.randint()随机生一个整数int类型,可以指定这个整数的范围,同样有上限和下限值#随机生成作为负样本的边的起始节点和终点节点(创造负样本)neg_dst = gpu(torch.randint(0, n_nodes, (perm.shape[0], )))idx = ~(mem[neg_src, neg_dst])#~是取反操作符#随机生成的负样本边可能是正样本边,所以要提出这部分的负样本边,若原先是正样本的边也是负样本,变为0p_neg = mad(neg_src[idx], neg_dst[idx])loss = (-torch.log(1e-5 + 1 - p_neg).mean()- torch.log(1e-5 + p_pos).mean())loss.backward()opt.step()if epoch % 10:continueif g_toy:with torch.no_grad():embed = mad.f(node_features)for i in range(100):mlp.train()mlp_opt.zero_grad()logits = mlp(embed)loss = F.cross_entropy(logits[train_mask], node_labels[train_mask])loss.backward()mlp_opt.step()with torch.no_grad():logits = mlp(embed)_, indices = torch.max(logits[valid_mask], dim=1)labels = node_labels[valid_mask]v_acc = torch.sum(indices == labels).item() * 1.0 / len(labels)_, indices = torch.max(logits[test_mask], dim=1)labels = node_labels[test_mask]t_acc = torch.sum(indices == labels).item() * 1.0 / len(labels)if v_acc > best_accs[0]:best_accs = [v_acc, t_acc]print(epoch, 'acc:', v_acc, t_acc)continuewith torch.no_grad():mad.eval()aucs = []aps = []for src, dst in ((val_src, val_dst), (test_src, test_dst)):p_pos = mad(src, dst)n = src.shape[0]perm = torch.randperm(n * 2)neg_src = torch.cat((src, gpu(torch.randint(0, n_nodes, (n, )))))[perm]neg_dst = torch.cat((gpu(torch.randint(0, n_nodes, (n, ))), dst))[perm]idx = ~(adj[neg_src, neg_dst])neg_src = neg_src[idx][:n]neg_dst = neg_dst[idx][:n]p_neg = mad(neg_src, neg_dst)y_true = cpu(torch.cat((p_pos * 0 + 1, p_neg * 0)))y_score = cpu(torch.cat((p_pos, p_neg)))fpr, tpr, _ = metrics.roc_curve(y_true, y_score, pos_label=1)#roc曲线绘制:#fpr:数组,随阈值上涨的假阳性率#tpr:数组,随阈值上涨的真正例率auc = metrics.auc(fpr, tpr)ap = metrics.average_precision_score(y_true, y_score)aucs.append(auc)aps.append(ap)if aucs[0] > best_aucs[0]:best_aucs = aucsprint(epoch, 'auc:', aucs)if aps[0] > best_aps[0]:best_aps = apsprint(epoch, 'ap:', aps)print(run, 'best auc:', best_aucs)print(run, 'best ap:', best_aucs)print(run, 'best acc (toy):', best_accs)total_aucs.append(best_aucs[1])total_aps.append(best_aps[1])
total_aucs = torch.tensor(total_aucs)
total_aps = torch.tensor(total_aps)
print('auc mean:', total_aucs.mean().item(), 'std:', total_aucs.std().item())
print('ap mean:', total_aps.mean().item(), 'std:', total_aps.std().item())

最后补上学长对MAD函数最后一个语句的分析,我确实想不到啊~
在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389277.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

大数据技术 学习之旅_如何开始您的数据科学之旅?

大数据技术 学习之旅Machine Learning seems to be fascinating to a lot of beginners but they often get lost into the pool of information available across different resources. This is true that we have a lot of different algorithms and steps to learn but star…

数据可视化工具_数据可视化

数据可视化工具Visualizations are a great way to show the story that data wants to tell. However, not all visualizations are built the same. My rule of thumb is stick to simple, easy to understand, and well labeled graphs. Line graphs, bar charts, and histo…

Android Studio调试时遇见Install Repository and sync project的问题

我们可以看到,报的错是“Failed to resolve: com.android.support:appcompat-v7:16.”,也就是我们在build.gradle中最后一段中的compile项内容。 AS自动生成的“com.android.support:appcompat-v7:16.”实际上是根据我们的最低版本16来选择16.x.x及以上编…

VGAE(Variational graph auto-encoders)论文及代码解读

一,论文来源 论文pdf Variational graph auto-encoders 论文代码 github代码 二,论文解读 理论部分参考: Variational Graph Auto-Encoders(VGAE)理论参考和源码解析 VGAE(Variational graph auto-en…

tableau大屏bi_Excel,Tableau,Power BI ...您应该使用什么?

tableau大屏biAfter publishing my previous article on data visualization with Power BI, I received quite a few questions about the abilities of Power BI as opposed to those of Tableau or Excel. Data, when used correctly, can turn into digital gold. So what …

网络编程 socket介绍

Socket介绍 Socket是应用层与TCP/IP协议族通信的中间软件抽象层,它是一组接口。在设计模式中,Socket其实就是一个门面模式,它把复杂的TCP/IP协议族隐藏在Socket接口后面,对用户来说,一组简单的接口就是全部。 Socket通…

BP神经网络反向传播手动推导

BP神经网络过程: 基本思想 BP算法是一个迭代算法,它的基本思想如下: 将训练集数据输入到神经网络的输入层,经过隐藏层,最后达到输出层并输出结果,这就是前向传播过程。由于神经网络的输出结果与实际结果…

使用python和pandas进行同类群组分析

背景故事 (Backstory) I stumbled upon an interesting task while doing a data exercise for a company. It was about cohort analysis based on user activity data, I got really interested so thought of writing this post.在为公司进行数据练习时,我偶然发…

搜索引擎优化学习原理_如何使用数据科学原理来改善您的搜索引擎优化工作

搜索引擎优化学习原理Search Engine Optimisation (SEO) is the discipline of using knowledge gained around how search engines work to build websites and publish content that can be found on search engines by the right people at the right time.搜索引擎优化(SEO…

Siamese网络(孪生神经网络)详解

SiameseFCSiamese网络(孪生神经网络)本文参考文章:Siamese背景Siamese网络解决的问题要解决什么问题?用了什么方法解决?应用的场景:Siamese的创新Siamese的理论Siamese的损失函数——Contrastive Loss损失函…

Dubbo 源码分析 - 服务引用

1. 简介 在上一篇文章中,我详细的分析了服务导出的原理。本篇文章我们趁热打铁,继续分析服务引用的原理。在 Dubbo 中,我们可以通过两种方式引用远程服务。第一种是使用服务直联的方式引用服务,第二种方式是基于注册中心进行引用。…

一件登录facebook_我从Facebook的R教学中学到的6件事

一件登录facebookBetween 2018 to 2019, I worked at Facebook as a data scientist — during that time I was involved in developing and teaching a class for R beginners. This was a two-day course that was taught about once a month to a group of roughly 15–20 …

SiameseFC超详解

SiameseFC前言论文来源参考文章论文原理解读首先要知道什么是SOT?(Siamese要做什么)SiameseFC要解决什么问题?SiameseFC用了什么方法解决?SiameseFC网络效果如何?SiameseFC基本框架结构SiameseFC网络结构Si…

Python全栈工程师(字符串/序列)

ParisGabriel Python 入门基础字符串:str用来记录文本信息字符串的表示方式:在非注释中凡是用引号括起来的部分都是字符串‘’ 单引号“” 双引号 三单引""" """ 三双引有内容代表非空字符串否则是空字符串 区别&#xf…

跨库数据表的运算

跨库数据表的运算,一直都是一个说难不算太难,说简单却又不是很简单的、总之是一个麻烦的事。大量的、散布在不同数据库中的数据表们,明明感觉要把它们合并起来,再来个小小的计算,似乎也就那么回事……但真要做起来&…

熊猫在线压缩图_回归图与熊猫和脾气暴躁

熊猫在线压缩图数据可视化 (Data Visualization) I like the plotting facilities that come with Pandas. Yes, there are many other plotting libraries such as Seaborn, Bokeh and Plotly but for most purposes, I am very happy with the simplicity of Pandas plotting…

SiameseRPN详解

SiameseRPN论文来源论文背景一,简介二,研究动机三、相关工作论文理论注意:网络结构:1.Siamese Network2.RPN3.LOSS计算4.Tracking论文的优缺点分析一、Siamese-RPN的贡献/优点:二、Siamese-RPN的缺点:代码流…

数据可视化 信息可视化_可视化数据操作数据可视化与纪录片的共同点

数据可视化 信息可视化Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing data and kicki…

python 图表_使用Streamlit-Python将动画图表添加到仪表板

python 图表介绍 (Introduction) I have been thinking of trying out Streamlit for a while. So last weekend, I spent some time tinkering with it. If you have never heard of this tool before, it provides a very friendly way to create custom interactive Data we…

Python--day26--复习

转载于:https://www.cnblogs.com/xudj/p/9953293.html