求一个张量的梯度_张量流中离散策略梯度的最小工作示例2 0

求一个张量的梯度

Training discrete actor networks with TensorFlow 2.0 is easy once you know how to do it, but also rather different from implementations in TensorFlow 1.0. As the 2.0 version was only released in September 2019, most examples that circulate on the web are still designed for TensorFlow 1.0. In a related article — in which we also discuss the mathematics in more detail — we already treated the continuous case. Here, we use a simple multi-armed bandit problem to show how we can implement and update an actor network the discrete setting [1].

一旦您知道如何使用TensorFlow 2.0训练离散的actor网络就很容易了,而且与TensorFlow 1.0的实现也有很大不同。 由于2.0版本仅在2019年9月发布,因此大多数在网络上传播的示例仍是针对TensorFlow 1.0设计的。 在相关文章中(我们还将更详细地讨论数学),我们已经处理了连续的情况。 在这里,我们使用一个简单的多臂匪问题来说明如何实现离散设置[1]并更新演员网络。

一点数学 (A bit of mathematics)

We use the classical policy gradient algorithm REINFORCE in which the actor is represented by a neural network known as the actor network. In the discrete case, the network output is simply the probability of selecting each of the actions. So, if the set of actions is defined by A and the action by a ∈ A, then the network output are the probabilities p(a), ∀a ∈ A. The input layer contains the state s or a feature array ϕ(s), followed by one or more hidden layers that transform the input, with the output being the probabilities for each action that might be selected.

我们使用经典的策略梯度算法REINFORCE,其中角色由称为角色网络的神经网络表示。 在离散情况下,网络输出仅仅是选择每个动作的概率。 因此,如果一组动作由A定义,而动作由a∈A定义 ,则网络输出为概率p(a)∀a∈A 。 输入层包含状态s或要素数组ϕ(s) ,后跟一个或多个隐藏层来转换输入,输出是每个可能选择的动作的概率。

The policy π is parameterized by θ, which in deep reinforcement learning represents the neural network weights. After each action we take, we observe a reward v. Computing the gradients for θ and using learning rate α, the update rule typically encountered in textbooks looks as follows [2,3]:

策略π由θ参数化在深度强化学习中它表示神经网络权重。 在我们执行每个动作之后,我们都会观察到奖励v 。 计算θ的梯度并使用学习率α ,教科书中通常会遇到的更新规则如下[2,3]:

Image for post

When applying backpropagation updates to neural networks we must slightly modify this update rule, but the procedure follows the same lines. Although we might update the network weights manually, we typically prefer to let TensorFlow (or whatever library you use) handle the update. We only need to provide a loss function; the computer handles the calculation of gradients and other fancy tricks such as customized learning rates. In fact, the sole thing we have to do is add a minus sign, as we perform gradient descent rather than ascent. Thus, the loss function — which is known as the log loss function or cross-entropy loss function[4] — looks like this:

将反向传播更新应用于神经网络时,我们必须稍微修改此更新规则,但是该过程遵循相同的原则。 尽管我们可能会手动更新网络权重,但我们通常更喜欢让TensorFlow(或您使用的任何库)来处理更新。 我们只需要提供一个损失函数; 计算机可以处理梯度和其他花式技巧(例如自定义学习率)的计算。 实际上,我们唯一要做的就是添加一个减号,因为我们执行梯度下降而不是上升 。 因此,损失函数(称为对数损失函数交叉熵损失函数 [4])如下所示:

Image for post

TensorFlow 2.0实施 (TensorFlow 2.0 implementation)

Now let’s move on to the actual implementation. If you have some experience with TensorFlow, you likely first compile your network withmodel.compileand then perform model.fitormodel.train_on_batchto fit the network to your data. As TensorFlow 2.0 requires a loss function to have exactly two arguments, (y_true and y_predicted) we cannot use these methods though, since we need the action, state and reward as input arguments. The GradientTapefunctionality — which did not exist in TensorFlow 1.0 [5] — conveniently solves this problem. After storing a forward pass through the actor network on a `tape' , it is able to perform automatic differentiation in a backward pass later on.

现在让我们继续实际的实现。 如果您有使用TensorFlow的经验,则可能首先使用model.compile编译网络,然后执行model.fitmodel.train_on_batch使网络适合您的数据。 由于TensorFlow 2.0需要一个损失函数来具有正好两个参数( y_truey_predicted ),因此我们无法使用这些方法,因为我们需要将操作,状态和奖励作为输入参数。 TensorFlow 1.0 [5]中不存在的GradientTape功能可以方便地解决此问题。 在通过actor网络将前向通行证存储在“ 磁带 ”上之后,稍后可以在后向通行证中执行自动区分。

We start by defining our cross entropy loss function:

我们首先定义交叉熵损失函数:

In the next step, we use the function .trainable_variables to retrieve the network weights. Subsequently, tape.gradient calculates all the gradients for you by simply plugging in the loss value and the trainable variables. With optimizer.apply_gradients we update the network weights using a selected optimizer. As mentioned earlier, it is crucial that the forward pass (in which we obtain the action probabilities from the network) is included in the GradientTape. The code to update the weights is as follows:

在下一步中,我们使用函数.trainable_variables检索网络权重。 随后, tape.gradient只需插入损失值和可训练变量,即可为您计算所有梯度。 通过optimizer.apply_gradients我们使用选定的优化器来更新网络权重。 如前所述,至关重要的是将正向传递(我们从网络中获得动作概率)包括在GradientTape中。 更新权重的代码如下:

多臂匪 (Multi-armed bandit)

In the multi-armed bandit problem, we are able to play several slot machines with unique pay-off properties [6]. Each machine i has a mean payoff μ_i and a standard deviation σ_i, which are unknown to the player. At every decision moment you play one of the machines and observe the reward. After sufficient iterations and exploration, you should be able to fairly accurately estimate the mean reward of each machine. Naturally, the optimal policy is to always play the slot machine with the highest expected payoff.

在多武装匪徒问题中,我们能够玩几台具有独特回报特性的老虎机[6]。 每台机器i均具有玩家不知道的平均收益μ_i和标准偏差σ_i 。 在每个决策时刻,您都玩一台机器并观察奖励。 经过足够的迭代和探索,您应该能够相当准确地估计每台机器的平均回报。 自然,最佳策略是始终使用预期收益最高的老虎机。

Using Keras, we define a dense actor network. It takes a fixed state (a tensor with value 1) as input. We have two hidden layers that use five ReLUs per layer as activation functions. The network outputs the probabilities of playing each slot machine. The bias weights are initialized in such a way that each machine has equal probability at the beginning. Finally, the chosen optimizer is Adam with its default learning rate of 0.001.

使用Keras,我们定义了一个密集的actor网络。 它采用固定状态(值为1的张量)作为输入。 我们有两个隐藏层,每个层使用五个ReLU作为激活函数。 网络输出玩每个老虎机的概率。 偏置权重的初始化方式是,每台机器在开始时都有相同的概率。 最后,选择的优化器是Adam,默认学习率为0.001。

We test four settings with differing mean payoffs. For simplicity we set all standard deviations equal. The figures below show the learned probabilities for each slot machine, testing with four machines. As expected, the policy learns to play the machine(s) with the highest expected payoff. Some exploration naturally persists, especially when payoffs are close together. A bit of fine-tuning and you surely will do a lot better during your next Vegas trip.

我们测试了四种具有不同平均收益的设置。 为简单起见,我们将所有标准偏差设置为相等。 下图显示了在四台老虎机上进行测试后,每台老虎机的学习概率。 正如预期的那样,该策略将学习播放具有最高预期收益的机器。 自然会持续进行一些探索,尤其是当收益接近时。 进行一些微调,在您下一次维加斯之旅中,您肯定会做得更好。

Image for post

关键点 (Key points)

  • We define a pseudo-loss to update actor networks. For discrete control, the pseudo-loss function is simply the negative log probability multiplied with the reward signal, also known as the log loss- or cross-entropy loss function.

    我们定义了伪损失来更新参与者网络。 对于离散控制,伪损失函数仅是负对数概率乘以奖励信号,也称为对数损失或交叉熵损失函数。
  • Common TensorFlow 2.0 functions only accept loss functions with exactly two arguments. The GradientTape does not have this restriction.

    常见的TensorFlow 2.0函数仅接受具有两个参数的损失函数。 GradientTape没有此限制。

  • Actor networks are updated using three steps: (i) define a custom loss function, (ii) compute the gradients for the trainable variables and (iii) apply the gradients to update the weights of the actor network.

    使用三个步骤来更新角色网络:(i)定义自定义损失函数;(ii)计算可训练变量的梯度;(iii)应用梯度来更新角色网络的权重。

This article is partially based on my method paper: ‘Implementing Actor Networks for Discrete Control in TensorFlow 2.0’ [1]

本文部分基于我的方法论文:“在Actors Networks中实现TensorFlow 2.0中的离散控制” [1]

The GitHub code (implemented using Python 3.8 and TensorFlow 2.3) can be found at: www.github.com/woutervanheeswijk/example_discrete_control

GitHub代码(使用Python 3.8和TensorFlow 2.3实现)可以在以下位置找到: www.github.com/woutervanheeswijk/example_discrete_control

翻译自: https://towardsdatascience.com/a-minimal-working-example-for-discrete-policy-gradients-in-tensorflow-2-0-d6a0d6b1a6d7

求一个张量的梯度

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389961.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

zabbix网络发现主机

1 功能介绍 默认情况下,当我在主机上安装agent,然后要在server上手动添加主机并连接到模板,加入一个主机组。 如果有很多主机,并且经常变动,手动操作就很麻烦。 网络发现就是主机上安装了agent,然后server自…

python股市_如何使用python和破折号创建仪表板来主导股市

python股市始终关注大局 (Keep Your Eyes on the Big Picture) I’ve been fascinated with the stock market since I was a little kid. There is certainly no shortage of data to analyze, and if you find an edge you can make some easy money. To stay on top of the …

阿里巴巴开源 Sentinel,进一步完善 Dubbo 生态

为什么80%的码农都做不了架构师?>>> 阿里巴巴开源 Sentinel,进一步完善 Dubbo 生态 Sentinel 开源地址:https://github.com/alibaba/Sentinel 转载于:https://my.oschina.net/dyyweb/blog/1925839

离群值如何处理_有理处理离群值的局限性

离群值如何处理ARIMA models can be quite adept when it comes to modelling the overall trend of a series along with seasonal patterns.ARIMA模型可以很好地建模一系列总体趋势以及季节性模式。 In a previous article titled SARIMA: Forecasting Seasonal Data with P…

10生活便捷:购物、美食、看病时这样搜,至少能省一半心

本次课程介绍实实在在能够救命、省钱的网站,解决了眼前这些需求后,还有“诗和远方”——不花钱也能点亮自己的生活,获得美的享受! 1、健康医疗这么搜,安全又便捷 现在的医疗市场确实有些混乱,由于医疗的专业…

ppt图表图表类型起始_梅科图表

ppt图表图表类型起始There are different types of variable width bar charts but two are the most popular: 1) Bar Mekko chart; 2) Marimekko chart.可变宽度条形图有不同类型,但最受欢迎的有两种:1)Mekko条形图; 2)Marimekko图表。 Th…

Tomcat日志乱码了怎么处理?

【前言】 tomacat日志有三个地方,分别是Output(控制台)、Tomcat Localhost Log(tomcat本地日志)、Tomcat Catalina Log。 启动日志和大部分报错日志、普通日志都在output打印;有些错误日志,在Tomcat Localhost Log。 三个日志显示区,都可能…

5888. 网络空闲的时刻

5888. 网络空闲的时刻 给你一个有 n 个服务器的计算机网络,服务器编号为 0 到 n - 1 。同时给你一个二维整数数组 edges ,其中 edges[i] [ui, vi] 表示服务器 ui 和 vi 之间有一条信息线路,在 一秒 内它们之间可以传输 任意 数目的信息。再…

django框架预备知识

内容: 1.web预备知识 2.django介绍 3.web框架的本质及分类 4.django安装与基本设置 1.web预备知识 HTTP协议:https://www.cnblogs.com/wyb666/p/9383077.html 关于web的本质:http://www.cnblogs.com/wyb666/p/9034042.html 如何自定义web框架…

现实世界 机器学习_公司沟通分析简介现实世界的机器学习方法

现实世界 机器学习In my previous posts I covered analytical subjects from a scientific point of view, rather than an applied real world problem. For this reason, this article aims at approaching an analytical idea from a managerial point of view, rather tha…

拷贝构造函数和赋值函数

1、拷贝构造函数:用一个已经有的对象构造一个新的对象。 CA(const CA & c )函数的名称必须和类名称相一致,它的唯一的一个参数是本类型的一个引用变量,该参数是const 类型,不可变。 拷贝构造函数什么时…

Chrome keyboard shortcuts

2019独角兽企业重金招聘Python工程师标准>>> Chrome keyboard shortcuts https://support.google.com/chrome/answer/157179?hlen 转载于:https://my.oschina.net/qwfys200/blog/1927456

数据中心细节_当细节很重要时数据不平衡

数据中心细节定义不平衡数据 (Definition Imbalanced Data) When we speak of imbalanced data, what we mean is that at least one class is underrepresented. For example, when considering the problem of building a classifier, let’s call it the Idealisstic-Voter.…

辛普森悖论_所谓的辛普森悖论

辛普森悖论We all know the Simpsons family from Disneyland, but have you heard about the Simpson’s Paradox from statistic theory? This article will illustrate the definition of Simpson’s Paradox with an example, and show you how can it harm your statisti…

查看NVIDIA使用率工具目录

2019独角兽企业重金招聘Python工程师标准>>> C:\Program Files\NVIDIA Corporation\Display.NvContainer\NVDisplay.Container.exe 转载于:https://my.oschina.net/u/2430809/blog/1927560

余弦相似度和欧氏距离_欧氏距离和余弦相似度

余弦相似度和欧氏距离Photo by Markus Winkler on UnsplashMarkus Winkler在Unsplash上拍摄的照片 This is a quick and straight to the point introduction to Euclidean distance and cosine similarity with a focus on NLP.这是对欧氏距离和余弦相似度的快速而直接的介绍&…

七、 面向对象(二)

匿名类对象 创建的类的对象是匿名的。当我们只需要一次调用类的对象时,我们就可以考虑使用匿名的方式创建类的对象。特点是创建的匿名类的对象只能够调用一次! package day007;//圆的面积 class circle {double radius;public double getArea() {// TODO…

机器学习 客户流失_通过机器学习预测流失

机器学习 客户流失介绍 (Introduction) This article is part of a project for Udacity “Become a Data Scientist Nano Degree”. The Jupyter Notebook with the code for this project can be downloaded from GitHub.本文是Udacity“成为数据科学家纳米学位”项目的一部分…

Qt中的坐标系统

转载:原野追逐 Qt使用统一的坐标系统来定位窗口部件的位置和大小。 以屏幕的左上角为原点即(0, 0)点,从左向右为x轴正向,从上向下为y轴正向,这整个屏幕的坐标系统就用来定位顶层窗口; 此外,窗口内部也有自己…

预测股票价格 模型_建立有马模型来预测股票价格

预测股票价格 模型前言 (Preface) If you are reading this, it’s most likely because you love to solve puzzles. I’m a very competitive person by nature. The Mt. Everest of puzzles, in my opinion, is trying to find excess returns through active trading in th…