boltzmann

RecSys系列 (RecSys Series)

Update: This article is part of a series where I explore recommendation systems in academia and industry. Check out the full series: Part 1, Part 2, Part 3, Part 4, Part 5, Part 6, and Part 7.

更新： 本文是我探索学术界和行业推荐系统的系列文章的一部分。 查看完整系列： 第1 部分，第2 部分， 第3部分 ， 第4部分 ， 第5部分 ，第6 部分和 第7部分 。

One of the best AI-related books that I read last year is Terrence Sejnowski’s “The Deep Learning Revolution.” The book explains how deep learning went from being an obscure academic field to an impactful technology in the information era. The author, Terry Sejnowski is one of the pioneers of deep learning who, together with Geoffrey Hinton, created Boltzmann machines: a deep learning network that has remarkable similarities to learning in the brain.

我去年读过的与AI相关的最好的书之一是Terrence Sejnowski的“ 深度学习革命” 。这本书解释了深度学习如何从一个不起眼的学术领域变成了信息时代的一种有影响力的技术。作者Terry Sejnowski是深度学习的先驱者之一，他与Geoffrey Hinton一起创造了Boltzmann机器：这是一种与大脑学习有着显着相似性的深度学习网络。

I recently listened to a podcast on Eye on AI where Terrence discussed machines dreaming, the birth of the Boltzmann machines, the inner-workings of the brain, and the process to recreate them in neural networks. In particular, he and Geoff Hinton invented the Boltzmann machine with a physics-inspired architecture:

我最近听了关于Eye on AI的播客，其中Terrence讨论了机器的梦想，玻尔兹曼机器的诞生，大脑的内部工作以及在神经网络中重建机器的过程。特别是，他和杰夫·欣顿(Geoff Hinton)发明了具有物理学灵感的架构的玻尔兹曼机：

Each unit has a probability to have an output that varies with the amount of input that is being given.
每个单元都有一个输出随输入量变化的概率。
They gave the network input and then kept track of the activity patterns within the network. For each connection, they kept track of the correlation between the input and the output. Then in order to be able to learn, they got rid of the inputs and let the network run free, which is called the sleep phase.
他们提供网络输入，然后跟踪网络内的活动模式。对于每个连接，他们都跟踪输入和输出之间的相关性。然后为了能够学习，他们摆脱了输入，让网络自由运行，这称为睡眠阶段 。
The learning algorithm is intuitive: They subtracted the sleep phase correlation from the wake learning phase and then adjusted the weights accordingly. With a big enough dataset, this algorithm can effectively learn arbitrary mappings between input and output.
学习算法很直观 ：他们从唤醒学习阶段中减去睡眠阶段的相关性，然后相应地调整权重。有了足够大的数据集，该算法可以有效地学习输入和输出之间的任意映射。

The Boltzmann machine analogy turns out to be a good insight into what’s happing in the human brain during sleep. In cognitive science, there’s a concept called replay, where the hippocampus plays back our memories and experiences to the cortex, and then the cortex integrates that into the semantic knowledge base that we have about the world.

事实证明，玻耳兹曼机器比喻是对人脑在睡眠期间发生什么变化的很好的洞察力。在认知科学中，有一个称为重播的概念，其中海马将我们的记忆和经验回放到皮层，然后皮层将其整合到我们拥有的关于世界的语义知识库中。

That’s a long-winded way to say that I have been interested in exploring Boltzmann machines for a while. And I was quite ecstatic to see their applications in the context of recommendation systems!

这是我一直对探索玻尔兹曼机器感兴趣的一个漫长的说法。我很高兴看到他们在推荐系统中的应用！

In this post and those to follow, I will be walking through the creation and training of recommendation systems, as I am currently working on this topic for my Master Thesis.

在本博文以及后续博文中，我将逐步介绍推荐系统的创建和培训，因为我目前正在为我的硕士论文处理该主题。

Part 1 provided a high-level overview of recommendation systems, how they are built, and how they can be used to improve businesses across industries.
第1部分概述了推荐系统，它们的构建方式以及如何将其用于改善整个行业的业务。
Part 2 provided a helpful review of the ongoing research initiatives concerning the strengths and application scenarios of these models.
第2部分对正在进行的有关这些模型的优势和应用场景的研究计划进行了有益的回顾。
Part 3 provided a couple of research directions that might be relevant to the recommendation system scholar community.
第3部分提供了一些与推荐系统学者社区有关的研究方向。
Part 4 provided the nitty-gritty mathematical details of 7 variants of matrix factorization that can be constructed: ranging from the use of clever side features to the application of Bayesian methods.
第4部分详细介绍了可以构造的7种矩阵分解的变体的数学细节：从使用巧妙的侧面特征到应用贝叶斯方法，不一而足。
Part 5 provided the architecture design of 5 variants of multi-layer perceptron based collaborative filtering models, which are discriminative models that can interpret the features in a non-linear fashion.
第5部分提供了基于多层感知器的协作过滤模型的5个变体的体系结构设计，这些模型是可以以非线性方式解释特征的判别模型。
Part 6 provided a master class on six variants of autoencoders based collaborative filtering models, which are generative models that are superior in learning underlying feature representation.
第6部分提供了基于自动编码器的六个变体的协作过滤模型的大师课程，这六个模型是生成模型，在学习基础特征表示方面表现出色。

In Part 7, I explore the use of Boltzmann Machines for collaborative filtering. More specifically, I will dissect three principled papers that incorporate Boltzmann Machines into their recommendation architecture. But first, let’s walk through a primer on Boltzmann Machine and its variants.

在第7部分中，我将探讨如何使用Boltzmann机器进行协作过滤。更具体地说，我将剖析三篇将Boltzmann机器纳入其推荐体系结构的原理性论文。但是首先，让我们看一下玻尔兹曼机及其变种的入门知识。

Boltzmann机器上的入门读物及其变体 (A Primer on Boltzmann Machine and Its Variants)

According to its inventor:

根据其发明人：

“A Boltzmann Machine is a network of symmetrically connected, neuron-like units that make stochastic decisions about whether to be on or off. Boltzmann machines have a simple learning algorithm that allows them to discover interesting features in datasets composed of binary vectors. The learning algorithm is very slow in networks with many layers of feature detectors, but it can be made much faster by learning one layer of feature detectors at a time.”
“玻尔兹曼机是由对称连接的类似神经元的单元组成的网络，它们随机决定是否开启。玻尔兹曼机器具有简单的学习算法，可让他们发现由二进制矢量组成的数据集中有趣的特征。在具有多层特征检测器的网络中，该学习算法非常慢，但是可以通过一次学习一层特征检测器来使其更快。

To unpack this further, Hinton states that we can use Boltzmann machines to tackle two different sets of computational problems:

为了进一步说明这一点，欣顿指出，我们可以使用玻尔兹曼机来解决两组不同的计算问题：

Search Problem: Boltzmann machines have fixed weights on the connections, which are used as the cost function of an optimization procedure.
搜索问题： Boltzmann机器的连接具有固定的权重，用作优化过程的成本函数。
Learning Problem: Given a set of binary data vectors, our goal is to find the weights on the connections to optimize the training process. Boltzmann machines update the weights’ values by solving many iterations of the search problem.
学习问题：给定一组二进制数据向量，我们的目标是找到连接的权重以优化训练过程。玻尔兹曼机器通过解决搜索问题的许多迭代来更新权重值。

A Restricted Boltzmann Machine (RBM) is a specific type of a Boltzmann machine, which has two layers of units. As illustrated below, the first layer consists of visible units, and the second layer includes hidden units. In this restricted architecture, there are no connections between units in a layer.

受限玻尔兹曼机 (RBM)是玻尔兹曼机的一种特殊类型，具有两层单元。如下图所示，第一层包含可见单元，第二层包含隐藏单元。在这种受限制的体系结构中，层中各单元之间没有连接。

Image for post — *Manish Nayak — An Intuitive Introduction of RBM (Manish Nayak-RBM的直观介绍(* *https://medium.com/datadriveninvestor/an-intuitive-introduction-of-restricted-boltzmann-machine-rbm-14f4382a0dbbhttps://medium.com/datadriveninvestor/an-intuitive-introduction-of-restricted-boltzmann-machine-rbm-14f4382a0dbb*))

The visible units in the model correspond to the observed components, and the hidden units represent the dependencies between these observed components. The goal is to model a joint probability of visible and hidden units: p(v, h). Because there are no connections between hidden units, the learning is effective as all hidden units are conditionally independent, given the visible units.

模型中的可见单元对应于观察到的组件，而隐藏单元代表这些观察到的组件之间的依赖关系。目的是为可见和隐藏单位的联合概率建模： p(v，h) 。因为隐藏单元之间没有连接，所以学习是有效的，因为在给定可见单元的情况下，所有隐藏单元在条件上都是独立的。

A Deep Belief Network (DBN) is a multi-layer learning architecture that uses a stack of RBMs to extract a deep hierarchical representation of the training data. In such a design, the hidden layer of each sub-network serves as the visible layer for the upcoming sub-network.

深度信仰网络 (DBN) 是一种多层学习体系结构，它使用RBM堆栈来提取训练数据的深层次表示。在这样的设计中，每个子网的隐藏层用作即将到来的子网的可见层。

When learning through a DBN, firstly, the RBM in the bottom layer is trained by inputting the original data into the visible units. Then, the parameters are fixed up, and the hidden units of the RBM are used as the input into the RBM in the second layer. The learning process continues until reaching the top of the stacked sub-networks, and finally, a suitable model is obtained to extract features from the input. Since the learning process is unsupervised, it is common to add a new network of supervised learning to the end of the DBN to use it in a supervised learning task such as classification or regression (Logistic Regression layer in the image above).

通过DBN学习时，首先，通过将原始数据输入可见单元来训练底层的RBM。然后，固定参数，并将RBM的隐藏单元用作第二层RBM的输入。学习过程一直持续到到达堆叠子网的顶部为止，最后，获得合适的模型以从输入中提取特征。由于学习过程是不受监督的，因此通常在DBN的末尾添加新的监督学习网络，以将其用于监督学习任务中，例如分类或回归(上图中的Logistic回归层)。

Okay, it’s time to review the different Boltzmann Machines based recommendation framework!

好的，是时候回顾一下不同的基于Boltzmann Machines的推荐框架了！

1 —协作过滤的受限玻尔兹曼机 (1 — Restricted Boltzmann Machines for Collaborative Filtering)

Recall in the classic collaborative filtering setting, we attempt to model the ratings (user-item interaction) matrix X with the dimension n x d, where n is the number of users, and d is the number of items. An entry xᵢⱼ (row i, column j) corresponds to user i’s rating for item j. In the MovieLens dataset (which has been used in all of my previous posts), xᵢⱼ ∈ 0, 1, 2, 3, 4, 5 (where 0 represents missing rating).

回想一下经典的协作过滤设置，我们尝试使用维度nxd来对评级( 用户与项目互动 )矩阵X进行建模，其中n是用户数，d是项目数。条目xᵢⱼ (行i，列j)对应于用户i对项目j的评级。在MovieLens数据集中(在我之前的所有文章中都曾使用过)，xᵢⱼ∈0、1、2、3、4、5(其中0表示缺少评分)。

For example, xᵢⱼ = 2 means that user i has given movie j the rating 2 out of 5. On the other hand, xᵢⱼ = 0 means that the user has not rated the movie j.
例如，xᵢⱼ= 2表示用户i在5中给电影j评分为2。另一方面，xᵢⱼ= 0表示用户未给电影j评分。
The rows of X encode each user’s preference over all movies, and the columns of X encode each item’s ratings received by all users.
X的行编码每个用户对所有电影的偏好，X的列编码所有用户接收的每个项目的评分。

Formally speaking, we define prediction and inference in the collaborative filtering context as follows:

正式地说，我们在协作过滤上下文中定义预测和推断，如下所示：

Prediction: Given the observed rating X, predict x_{im} (a rating that a user i has given for a new query movie m).
预测：给定观察到的等级X，预测x_ {im}(用户i为新查询电影m给出的等级)。
Inference: Compute the probability p(x_{im} = k | Xₒ) (where Xₒ denotes the non-zero entries of X and k ∈ 0, 1, 2, 3, 4, 5).
推论：计算概率p(x_ {im} = k |Xₒ)(其中Xₒ表示X的非零项，k∈0、1、2、3、4、5)。

Salakhutdinov, Mnih, and Hinton framed the task of computing p(x_{im} = k | Xₒ) as inference on an underlying RBM with trained parameters. The dataset is sub-divided into rating matrices, where a user’s ratings are one-hot encoded into a matrix V such that vⱼᵏ = 1 if the user rates movie j with rating k. The figure above illustrates the RBM graph:

Salakhutdinov，Mnih和Hinton将计算p(x_ {im} = k |Xₒ)的任务框架化为对具有训练参数的基础RBM的推断。数据集被细分为评分矩阵，其中用户的评分被一次热编码到矩阵V中，如果用户对影片j评分为k，则vⱼᵏ= 1。上图说明了RBM图：

V is a 5 x d matrix that corresponds to one-hot encoded integer ratings of the user.
V是一个5 xd矩阵，对应于用户的一键编码整数评级。
h is a F x 1 vector of binary hidden variables, where F is the number of hidden variables.
h是二进制隐藏变量的F x 1向量，其中F是隐藏变量的数量。
W is a d x F x 5 tensor that encodes adjacency between ratings and hidden features. Its entry Wⱼcᵏ corresponds to the edge potential between rating k of the movie j and the hidden feature c.
W是adx F x 5张量，可对等级和隐藏特征之间的邻接进行编码。它的项Wⱼcᵏ对应于电影j的等级k与隐藏特征c之间的边缘电势。

The whole user-item interaction matrix is a collection of V(s), where each V corresponds to each user’s ratings. Because each user can have different missing values, each will have a unique RBM graph. In each RBM graph, the edges connect ratings and hidden features but do not appear between items of missing ratings. The paper treats W as a set of edge potentials that are tied across all such RBM graphs.

整个用户-项目交互矩阵是V (s) 的集合 ，其中每个V对应于每个用户的等级 。由于每个用户可能具有不同的缺失值，因此每个用户都有唯一的RBM图。在每个RBM图中，边连接等级和隐藏特征，但不会出现在缺少等级的项目之间。本文将W视为边缘电位集，这些边缘电位绑在所有此类RBM图上。

In the training phase, RBM characterizes the relationship between the ratings and hidden features using conditional probabilities p(vⱼᵏ = 1 | h) and p(hₐ = 1 | V):

在训练阶段，RBM使用条件概率p(vⱼᵏ= 1 | h)和p(hₐ= 1 | V)来表征等级与隐藏特征之间的关系：

After getting these probabilities, there are two extra steps to compute p(vₒᵏ = 1 | V):

得到这些概率后，还有两个额外的步骤来计算p(vₒᵏ= 1 | V) ：

Compute the distribution of each hidden feature in h based on observed ratings V and the edge potentials W (p(hₐ = 1 | V) for each a).
根据观察到的等级V和边缘电势W(每个a的p(hₐ= 1 | V))计算h中每个隐藏特征的分布。
Compute p(vₒᵏ = 1 | V) based on the edge potentials W and the distribution of p(hₐ = 1 | V).
根据边缘电位W和p(hₐ= 1 | V)的分布计算p(vₒᵏ= 1 | V)。

In the optimization phase, W is optimized by the marginal likelihood of V — p(V). The gradient ∇ Wᵢⱼᵏ is computed using contrastive divergence, which is an approximation of the gradient-based on Gibbs sampling:

在优化阶段，通过V_p (V)的边际似然来优化W。使用对比散度计算梯度∇W，它是基于Gibbs采样的基于梯度的近似值：

The expectation <.>_T represents a distribution of samples from running the Gibbs sampler, initialized at the data, for T full steps. T is typically set to 1 at the beginning of learning and increased as the learning converges. When running the Gibbs sampler, the RBM reconstructs (as seen in equation 1) the distribution over the non-missing ratings. The approximate gradients of contrastive divergence can then be averaged over all n users.

期望值<_> _ T代表运行Gibbs采样器的采样分布，并在数据上初始化了T个完整步骤。 T通常在学习开始时设置为1，并随着学习收敛而增加。运行吉布斯采样器时，RBM会重建(如式1所示)非缺失等级上的分布。然后可以在所有n个用户上平均对比散度的近似梯度。

The PyTorch code of the RBM model class is given below for illustration purpose:

出于说明目的，以下给出了RBM模型类的PyTorch代码：

class RBM:def __init__(self, n_vis, n_hid):"""Initialize the parameters (weights and biases) we optimize during the training process:param n_vis: number of visible units:param n_hid: number of hidden units"""# Weights used for the probability of the visible units given the hidden unitsself.W = torch.randn(n_hid, n_vis)  # torch.rand: random normal distribution mean = 0, variance = 1# Bias probability of the visible units is activated, given the value of the hidden units (p_v_given_h)self.v_bias = torch.randn(1, n_vis)  # fake dimension for the batch = 1# Bias probability of the hidden units is activated, given the value of the visible units (p_h_given_v)self.h_bias = torch.randn(1, n_hid)  # fake dimension for the batch = 1def sample_h(self, x):"""Sample the hidden units:param x: the dataset"""# Probability h is activated given that the value v is sigmoid(Wx + a)# torch.mm make the product of 2 tensors# W.t() take the transpose because W is used for the p_v_given_hwx = torch.mm(x, self.W.t())# Expand the mini-batchactivation = wx + self.h_bias.expand_as(wx)# Calculate the probability p_h_given_vp_h_given_v = torch.sigmoid(activation)# Construct a Bernoulli RBM to predict whether an user loves the movie or not (0 or 1)# This corresponds to whether the n_hid is activated or not activatedreturn p_h_given_v, torch.bernoulli(p_h_given_v)def sample_v(self, y):"""Sample the visible units:param y: the dataset"""# Probability v is activated given that the value h is sigmoid(Wx + a)wy = torch.mm(y, self.W)# Expand the mini-batchactivation = wy + self.v_bias.expand_as(wy)# Calculate the probability p_v_given_hp_v_given_h = torch.sigmoid(activation)# Construct a Bernoulli RBM to predict whether an user loves the movie or not (0 or 1)# This corresponds to whether the n_vis is activated or not activatedreturn p_v_given_h, torch.bernoulli(p_v_given_h)def train(self, v0, vk, ph0, phk):"""Perform contrastive divergence algorithm to optimize the weights that minimize the energyThis maximizes the log-likelihood of the model"""# Approximate the gradients with the CD algorithmself.W += (torch.mm(v0.t(), ph0) - torch.mm(vk.t(), phk)).t()# Add (difference, 0) for the tensor of 2 dimensionsself.v_bias = torch.sum((v0 - vk), 0)self.h_bias = torch.sum((ph0 - phk), 0)

For my PyTorch implementation, I designed the RBM architecture with a hidden layer of 100 units activated by a non-linear sigmoid function. Other hyper-parameters include the batch size of 512 and epochs of 50.

对于我的PyTorch实施，我设计了RBM架构，其中包含一个由非线性S型函数激活的100个单位的隐藏层。其他超参数包括512的批量大小和50的时期。

2 —可解释的受限Boltzmann机器，用于协同过滤 (2 — Explainable Restricted Boltzmann Machines for Collaborative Filtering)

Explanations for recommendations can have multiple benefits, including effectiveness (helping users to make the right decisions), efficiency (assisting users to make faster decisions), and transparency (revealing the reasoning behind the recommendations). In the case of RBM, which assigns a low-dimensional set of features to items in a latent space, it is difficult to interpret these learned features. Therefore, a massive challenge is to choose an interpretable technique with moderate prediction accuracy for RBM.

对建议的解释可以带来多种好处，包括有效性(帮助用户做出正确的决定)，效率(帮助用户做出更快的决定)和透明性(揭示建议背后的原因)。对于RBM，它为潜在空间中的项目分配了一组低维特征，很难解释这些学习到的特征。因此， 一个巨大的挑战是为RBM选择一种具有中等预测精度的可解释技术 。

Abdollahi and Nasraoui designed an RBM model for a collaborative filtering recommendation system that suggests items that are explainable while maintaining accuracy. The paper’s scope is limited to recommendations where no additional source of data is used in explanations, and where explanations for recommended items can be generated from the ratings given to these items, by the active user’s neighbors only.

Abdollahi和Nasraoui设计了一种用于协作过滤推荐系统的RBM模型，该模型建议了在保持准确性的同时可以解释的项目。本文的范围仅限于建议，在这些建议中，无需在解释中使用其他数据源，并且只能由活动用户的邻居根据对这些项目的评级生成推荐项目的解释。

The main idea is that if many neighbors have rated the recommended item, then this could provide a basis upon which to explain the recommendations, using neighborhood-style explanation mechanisms. For user-based neighbor-style explanations, such as the one shown in the figure above, the Explainability Score of item i for user u is defined as:

主要思想是，如果许多邻居对推荐项目进行了评分，则可以使用邻域风格的解释机制为解释推荐提供依据。对于基于用户的邻居风格的解释，例如上图所示，用户i的项i的可解释性分数定义为：

Here N_k (u) is the set of user u’s k neighbors, r_{x, i} is the rating of x on item i, and R_max is the maximum rating value of N_k (u) on i. Cosine similarity defines the neighborhood. Without loss of information, r_{x, i} is 0 for missing ratings, indicating that user x does not contribute to the user-based neighbor-style explanation of item i for user u. Therefore, the Explainability Score is between 0 and 1. Item i is explainable for user u only if its explainability score is larger than 0. When no explanation can be made, the explainability ratio would be 0.

这里N_k(u)是用户u的k个邻居的集合， r_ {x，i}是项i上x的等级， R_max是i上N_k(u)的最大等级。 余弦相似度定义了邻域 。在不丢失信息的情况下，r_ {x，i}对于缺失的评分为0，指示用户x对用户u的项i的基于用户的邻居样式解释没有贡献。因此， 解释性分数在0到1之间 。仅当项目i的可解释性得分大于0时，项目i才能为用户u解释。如果无法解释，则可解释性比率将为0。

The TensorFlow code of the RBM model class is given below for illustration purpose:

出于说明目的，以下给出了RBM模型类的TensorFlow代码：

def rbm(movies_df):"""Implement RBM architecture in TensorFlow:param movies_df: data frame that stores movies information:return: variables to be used during TensorFlow training"""hiddenUnits = 100  # Number of hidden layersvisibleUnits = len(movies_df)  # Number of visible layers# Create respective placeholder variables for storing visible and hidden layer biases and weightsvb = tf.placeholder("float", [visibleUnits])  # Number of unique movieshb = tf.placeholder("float", [hiddenUnits])  # Number of featuresW = tf.placeholder("float", [visibleUnits, hiddenUnits])  # Weights that connect the hidden and visible layers# Pre-process the input datav0 = tf.placeholder("float", [None, visibleUnits])_h0 = tf.nn.sigmoid(tf.matmul(v0, W) + hb)h0 = tf.nn.relu(tf.sign(_h0 - tf.random_uniform(tf.shape(_h0))))# Reconstruct the pre-processed input data (Sigmoid and ReLU activation functions are used)_v1 = tf.nn.sigmoid(tf.matmul(h0, tf.transpose(W)) + vb)v1 = tf.nn.relu(tf.sign(_v1 - tf.random_uniform(tf.shape(_v1))))h1 = tf.nn.sigmoid(tf.matmul(v1, W) + hb)# Set RBM training parametersalpha = 0.1  # Set learning ratew_pos_grad = tf.matmul(tf.transpose(v0), h0)  # Set positive gradientsw_neg_grad = tf.matmul(tf.transpose(v1), h1)  # Set negative gradients# Calculate contrastive divergence to maximizeCD = (w_pos_grad - w_neg_grad) / tf.to_float(tf.shape(v0)[0])# Create methods to update the weights and biasesupdate_w = W + alpha * CDupdate_vb = vb + alpha * tf.reduce_mean(v0 - v1, 0)update_hb = hb + alpha * tf.reduce_mean(h0 - h1, 0)# Set error function (RMSE)err = v0 - v1err_sum = tf.sqrt(tf.reduce_mean(err * err))# Initialize variablescur_w = np.zeros([visibleUnits, hiddenUnits], np.float32)  # Current weightcur_vb = np.zeros([visibleUnits], np.float32)  # Current visible unit biasescur_hb = np.zeros([hiddenUnits], np.float32)  # Current hidden unit biasesprv_w = np.zeros([visibleUnits, hiddenUnits], np.float32)  # Previous weightprv_vb = np.zeros([visibleUnits], np.float32)  # Previous visible unit biasesprv_hb = np.zeros([hiddenUnits], np.float32)  # Previous hidden unit biasesreturn v0, W, vb, hb, update_w, prv_w, prv_vb, prv_hb, update_vb, update_hb, cur_w, cur_vb, cur_hb, err_sum

For my TensorFlow implementation, I designed the RBM architecture with a hidden layer of 100 units activated by a non-linear sigmoid function. Other hyper-parameters include the batch size of 512 and epochs of 50. I also showed a sample recommendation list for a hypothetical user with explainability scores included.

对于我的TensorFlow实现，我设计了RBM架构，其中包含一个由非线性S型函数激活的100个单位的隐藏层。其他超参数包括512的批量大小和50的时期。我还显示了一个假设用户的示例推荐列表，其中包括可解释性评分。

3 —用于协同过滤的神经自回归分布估计器 (3 — Neural Autoregressive Distribution Estimator for Collaborative Filtering)

One of the issues with the RBM model is such that it suffers from inaccuracy and impractically long training time since: (1) training is intractable, and (2) variational approximation or Markov Chain Monte-Carlo is required. Uria, Cote, Gregor, Murray, and Larochelle proposed the so-called Neural Autoregressive Distribution Estimator (NADE), which is a tractable distribution estimator for high-dimensional binary vectors. The estimator computes the conditional probabilities of each element, given the other elements to its left in the binary vector, where all conditionals share the same parameters. The probability of the binary vector can then be obtained by taking the product of these conditionals. NADE can be optimized efficiently via back-propagation, instead of expensive inference required to handle latent variables as in the case of RBM.

RBM模型的问题之一是，由于以下原因，它存在着不精确和不切实际的长训练时间的问题，因为：(1)训练是棘手的，并且(2)需要变分近似或马尔可夫链蒙特卡洛。 Uria，Cote，Gregor，Murray和Larochelle提出了所谓的神经自回归分布估计器 (NADE)，它是高维二元向量的易处理的分布估计器。给定其他元素在二进制向量中的左侧，其中所有条件共享相同的参数，估计器将计算每个元素的条件概率。然后可以通过取这些条件的乘积来获得二元向量的概率。 NADE可以通过反向传播进行有效优化，而不是像RBM那样处理潜在变量需要昂贵的推断。

As shown in the NADE diagram below:

如下图的NADE图所示：

In the input layer, units with value 0 are shown in black, while units with value 1 are shown in white. The dashed border represents a layer pre-activation.
在输入层中，值为0的单位以黑色显示，而值为1的单位以白色显示。虚线边框表示层预激活。
The outputs x^_0 give predictive probabilities for each dimension of a vector x_0, given elements earlier in some order.
输出x ^ _0给定向量x_0的每个维度的预测概率，这些元素按某种顺序更早地给出。
There is no path of connections between an output and the value being predicted, or elements of x_0 later in the ordering.
在输出和要预测的值之间，或者在以后的顺序中x_0的元素之间，没有连接路径。
Arrows connected correspond to connections with shared parameters.
连接的箭头对应于具有共享参数的连接。

Zheng, Tang, Ding, and Zhou proposed CF-NADE, which is inspired by RBM-CF and NADE models, that models the distribution of user ratings. Suppose we have four movies: m1 (rating is 5), m2 (rating is 3), m3 (rating is 4), and m4 (rating is 2). More specifically, the procedure goes as follows:

Zheng，Tang，Ding和Zhou提出了CF-NADE ，该模型受RBM-CF和NADE模型的启发，该模型对用户评分的分布进行了建模。假设我们有四部电影：m1(等级为5)，m2(等级为3)，m3(等级为4)和m4(等级为2)。更具体地说，过程如下：

The probability that the user gives m1 5-star conditioned on nothing.
用户给予m1 5星的概率不受任何限制。
The probability that the user gives m2 3-star conditioned on giving m1 5-star.
用户给m2 3星的概率取决于给m1 5星的概率。
The probability that the user gives m3 4-star conditioned on giving m1 5-star and m2 3-star.
用户给出m3 4星的概率取决于给出m1 5星和m2 3星的条件。
The probability that the user gives m4 2-star conditioned on giving m1 5-star, m2 3-star, and m3 4-star.
用户给出m4 2星的概率以给出m1 5星，m2 3星和m3 4星为条件。

Mathematically speaking, CF-NADE models the joint probability of the rating vector r by the chain rule as:

从数学上讲，CF-NADE通过链规则将评级向量r的联合概率建模为：

D is the number of items that the user has rated.
D是用户已评分的项目数。
o is the D-tuple in the permutations of (1, 2, …, D).
o是(1、2，...，D)排列中的D元组。
mᵢ ∈ {1, 2, …, M} is the index of the i-th rated item.
mᵢ∈{1,2，…，M}是第i个被评估项的索引。
rᵘ = (rᵘ_{m_{o₁}}, rᵘ_{m_{o₂}}, …, rᵘ_{m_{oD}}) denotes the training case for user u.
rᵘ=(rᵘ_{m_ {o₁}}，rᵘ_{m_ {o2}}，…，rᵘ_{m_ {oD}})表示用户u的训练案例。
rᵘ_{m_{oᵢ}} ∈ {1, 2, …, K} denotes the rating that the user gave to item m_{oᵢ}.
rᵘ_{m_ {oᵢ}}∈{1，2，…，K}表示用户对项目m_ {oᵢ}的评分。
rᵘ_{m_{o<ᵢ}} denotes the first i — 1 elements of r indexed by o.
rᵘ_{m_ {o <ᵢ}}表示由o索引的r的前i个元素。

To expand on the process of getting the conditionals in equation 5, CF-NADE first computes the hidden representation of dimension H given rᵘ_{m_{o<ᵢ}} as follows:

为了扩展在等式5中获得条件的过程，CF-NADE首先根据给定rᵘ_{m_ {o <ᵢ}}来计算维H的隐藏表示，如下所示：

g is the activation function.
g是激活功能。
Wᵏ is the connection matrix associated with rating k.
Wᵏ是与等级k相关的连接矩阵。
Wᵏ_{:,j} is the j-th column of Wᵏ and Wᵏ_{i,j} is an interaction parameter between the i-th hidden unit and item j with rating k.
Wᵏ_ {:, j}是Wᵏ的第j列，Wᵏ_{i，j}是第i个隐藏单元与等级k的项目j之间的交互参数。
c is the bias term.
c是偏差项。

Using this hidden representation from equation 6, CF-NADE then computes sᵏ_{m_{oᵢ}} (r_{m_{o_{<i}}}), which is the score indicating the preference that the user gave rating k for item m_{oᵢ}, given the previous ratings r_{m_{o_{<i}}}:

CF-NADE使用公式6中的隐藏表示，然后计算sᵏ_{m_ {oᵢ}}(r_ {m_ {o _ {<i}}})，该分数表示用户对商品m_的评分为k的偏好。 {oᵢ}，鉴于先前的评分r_ {m_ {o _ {<i}}}：

Vʲ and bʲ are the connection matrix and the bias term associated with rating k, respectively, where k is bigger than or equal to j. Using this score from equation 7, the conditionals in equation 5 could be modeled as:

Vʲ和bʲ分别是与矩阵k相关的连接矩阵和偏置项，其中k大于或等于j。使用公式7中的分数，公式5中的条件可以建模为：

CF-NADE is optimized via minimization of the negative log-likelihood of p(r) in equation 5:

通过最小化公式5中p(r)的负对数似然来优化CF-NADE：

Ideally, the order of movies (represented by notation o) should follow the timestamps of ratings. However, the paper shows that random drawing can yield good performance.

理想情况下，电影的顺序(用符号o表示)应遵循收视率的时间戳。但是，本文表明，随机绘制可以产生良好的性能。

The Keras code of the CF-NADE model class is given below for illustration purpose:

出于说明目的，下面给出了CF-NADE模型类的Keras代码：

class NADE(Layer):def __init__(self, hidden_dim, activation, W_regularizer=None, V_regularizer=None,b_regularizer=None, c_regularizer=None, bias=False, args=None, **kwargs):self.init = initializers.get('uniform')self.bias = biasself.activation = activationself.hidden_dim = hidden_dimself.W_regularizer = regularizers.get(W_regularizer)self.V_regularizer = regularizers.get(V_regularizer)self.b_regularizer = regularizers.get(b_regularizer)self.c_regularizer = regularizers.get(c_regularizer)self.args = argssuper(NADE, self).__init__(**kwargs)def build(self, input_shape):"""Build the NADE architecture:param input_shape: Shape of the input"""self.input_dim1 = input_shape[1]self.input_dim2 = input_shape[2]self.W = self.add_weight(shape=(self.input_dim1, self.input_dim2, self.hidden_dim),initializer=self.init, name='{}_W'.format(self.name), regularizer=self.W_regularizer)if self.bias:self.c = self.add_weight(shape=(self.hidden_dim,), initializer=self.init,name='{}_c'.format(self.name), regularizer=self.c_regularizer)if self.bias:self.b = self.add_weight(shape=(self.input_dim1, self.input_dim2), initializer=self.init,name='{}_b'.format(self.name), regularizer=self.b_regularizer)self.V = self.add_weight(shape=(self.hidden_dim, self.input_dim1, self.input_dim2),initializer=self.init, name='{}_V'.format(self.name), regularizer=self.V_regularizer)super(NADE, self).build(input_shape)

For my Keras implementation, I designed NADE architecture with a hidden layer of 100 units optimized via Adam with a learning rate of 0.001. Other hyper-parameters include the batch size of 512 and epochs of 50.

对于我的 Keras 实施，我设计了NADE架构，该架构的隐藏层为100个单元，通过Adam进行了优化，学习率为0.001。其他超参数包括512的批量大小和50的时期。

模型评估 (Model Evaluation)

You can check out all three Boltzmann Machines-based recommendation models that I built at this repository: https://github.com/khanhnamle1994/transfer-rec/tree/master/Boltzmann-Machines-Experiments.

您可以查看我在此存储库中建立的所有三个基于Boltzmann Machines的推荐模型： https : //github.com/khanhnamle1994/transfer-rec/tree/master/Boltzmann-Machines-Experiments 。

The dataset is MovieLens 1M, similar to the three previous experiments that I have done using Matrix Factorization, Multilayer Perceptron, and Autoencoders. The goal is to predict the ratings that a user will give to a movie, in which the ratings are between 1 to 5.
数据集是MovieLens 1M ，类似于我之前使用矩阵分解，多层感知器和自动编码器进行的三个实验。目的是预测用户对电影的评分，其中评分介于1到5之间。
The evaluation metric is Root Mean Squared Error (RMSE) in this setting. In other words, I want to minimize the delta between the predicted rating and the actual rating.
在此设置中，评估指标为均方根误差(RMSE) 。换句话说，我想最小化预测等级与实际等级之间的差异。

The result table is at the bottom of my repo’s README: the explainable RBM model has the lowest RMSE and shortest training time, while the NADE model has the highest RMSE and longest training time.
结果表在我的仓库的README文件的底部：可解释的RBM模型具有最低的RMSE和最短的训练时间，而NADE模型具有最高的RMSE和最长的训练时间。

结论 (Conclusion)

In this post, I have discussed the nuts and bolts of Boltzmann Machines and their use in collaborative filtering. I also walked through 3 different papers that use architectures inspired by Boltzmann Machines for the recommendation framework: (1) Restricted Boltzmann Machines, (2) Explainable Restricted Boltzmann Machines, and (3) Neural Autoregressive Distribution Estimator.

在本文中，我讨论了Boltzmann机器的基本原理及其在协同过滤中的使用。我还浏览了三篇不同的论文，这些论文使用了受Boltzmann机器启发的体系结构作为推荐框架：(1)受限Boltzmann机器，(2)可解释的受限Boltzmann机器和(3)神经自回归分布估计器。

There are a couple of other papers worth being mentioned that I haven’t had time to go into details:

还有一些值得一提的论文，我还没有时间详细介绍：

Georgiev and Nakov used RBMs to jointly model both: (1) the correlations between a user’s voted items and (2) the correlation between the users who voted a particular item to improve the accuracy of the recommendation system.
Georgiev和Nakov使用RBM共同建模：(1)用户投票项目之间的相关性；(2)对特定项目进行投票的用户之间的相关性，以提高推荐系统的准确性。
Hu et al. used RBM in group-based recommendation systems to model group preferences by jointly modeling collective features and group profiles.
Hu等。在基于组的推荐系统中使用RBM通过共同建模集体特征和组配置文件来对组偏好进行建模。
Truyen et al. used Boltzmann machines to extract both: (1) the relation between a rated item and its rating (thanks to the connections between the hidden layer and the softmax layer) and (2) the correlations between rated items (thanks to the connections between the softmax layer units).
Truyen等。使用Boltzmann机器提取以下两者：(1)额定项目与其额定值之间的关系(由于隐藏层和softmax层之间的连接)和(2)额定项目之间的相关性(由于softmax之间的连接图层单位)。
Gunawardana and Meek used Boltzmann machines not only for modeling correlation between users and items but also for integrating content information. More specifically, the model parameters are tied with the content information.
Gunawardana和Meek不仅使用Boltzmann机器对用户和物品之间的关联进行建模，而且还用于集成内容信息。更具体地说，模型参数与内容信息相关联。

Stay tuned for the next blog post of this series that explores the various types of evaluation metrics in the context of recommendation systems.

请继续关注本系列的下一篇博客文章，该文章在推荐系统的背景下探索各种类型的评估指标。

If you would like to follow my work on Recommendation Systems, Deep Learning, and Data Science Journalism, you can check out my Medium and GitHub, as well as other projects at https://jameskle.com/. You can also tweet at me on Twitter, email me directly, or find me on LinkedIn. Sign up for my newsletter to receive my latest thoughts on machine learning in research and production right at your inbox!

如果您想关注我在推荐系统，深度学习和数据科学新闻学方面的工作，可以 在 https://jameskle.com/ 上 查看我的 Medium 和 GitHub 以及其他项目 。 您也可以在Twitter在我的 微博， 直接给我发电子邮件 ，或者 找到我的LinkedIn 。 注册我的时事通讯， 以在您的收件箱中接收我对研究和生产中的机器学习的最新想法！