Reinforcement Learning with Code 【Code 1. Tabular Q-learning】

Reinforcement Learning with Code 【Code 1. Tabular Q-learning】

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning.
This code refers to Mofan’s reinforcement learning course.

文章目录

  • Reinforcement Learning with Code 【Code 1. Tabular Q-learning】
    • 1.1 Problem and result
    • 1.2 Environment
    • 1.3 Tabular Q-learning Algorithm
    • 1.4 Run this main
    • 1.5 Check the Q table
    • Reference

1.1 Problem and result

Please consider the problem that a little mouse (denoted by red block) wants to avoid trap (denoted by black block) to get the cheese (denoted by yellow circle). As the figure shows.

Image

This chapter aims to realize tabular Q-learning algorithm sovle this problem.

1.2 Environment

We use the tkinter package of python to build our environment to interact with agent.

import numpy as np
import time
import sys
import tkinter as tk
# if sys.version_info.major == 2: # 检查python版本是否是python2
#     import Tkinter as tk
# else:
#     import tkinter as tkUNIT = 40   # pixels
MAZE_H = 4  # grid height
MAZE_W = 4  # grid widthclass Maze(tk.Tk, object):def __init__(self):super(Maze, self).__init__()# Action Spaceself.action_space = ['up', 'down', 'right', 'left'] # action space self.n_actions = len(self.action_space)# 绘制GUIself.title('Maze env')self.geometry('{0}x{1}'.format(MAZE_W * UNIT, MAZE_H * UNIT))   # 指定窗口大小 "width x height"self._build_maze()def _build_maze(self):self.canvas = tk.Canvas(self, bg='white',height=MAZE_H * UNIT,width=MAZE_W * UNIT)     # 创建背景画布# create gridsfor c in range(UNIT, MAZE_W * UNIT, UNIT): # 绘制列分隔线x0, y0, x1, y1 = c, 0, c, MAZE_H * UNITself.canvas.create_line(x0, y0, x1, y1)for r in range(UNIT, MAZE_H * UNIT, UNIT): # 绘制行分隔线x0, y0, x1, y1 = 0, r, MAZE_W * UNIT, rself.canvas.create_line(x0, y0, x1, y1)# create origin 第一个方格的中心,origin = np.array([UNIT/2, UNIT/2]) # hell1hell1_center = origin + np.array([UNIT * 2, UNIT])self.hell1 = self.canvas.create_rectangle(hell1_center[0] - (UNIT/2 - 5), hell1_center[1] - (UNIT/2 - 5),hell1_center[0] + (UNIT/2 - 5), hell1_center[1] + (UNIT/2 - 5),fill='black')# hell2hell2_center = origin + np.array([UNIT, UNIT * 2])self.hell2 = self.canvas.create_rectangle(hell2_center[0] - (UNIT/2 - 5), hell2_center[1] - (UNIT/2 - 5),hell2_center[0] + (UNIT/2 - 5), hell2_center[1] + (UNIT/2 - 5),fill='black')# create oval 绘制终点圆形oval_center = origin + np.array([UNIT*2, UNIT*2])self.oval = self.canvas.create_oval(oval_center[0] - (UNIT/2 - 5), oval_center[1] - (UNIT/2 - 5),oval_center[0] + (UNIT/2 - 5), oval_center[1] + (UNIT/2 - 5),fill='yellow')# create red rect 绘制agent红色方块,初始在方格左上角self.rect = self.canvas.create_rectangle(origin[0] - (UNIT/2 - 5), origin[1] - (UNIT/2 - 5),origin[0] + (UNIT/2 - 5), origin[1] + (UNIT/2 - 5),fill='red')# pack all 显示所有canvasself.canvas.pack()def get_state(self, rect):# convert the coordinate observation to state tuple# use the uniformed center as the state such as # |(1,1)|(2,1)|(3,1)|...# |(1,2)|(2,2)|(3,2)|...# |(1,3)|(2,3)|(3,3)|...# |....x0,y0,x1,y1 = self.canvas.coords(rect)x_center = (x0+x1)/2y_center = (y0+y1)/2state = ((x_center-(UNIT/2))/UNIT + 1, (y_center-(UNIT/2))/UNIT + 1)return statedef reset(self):self.update()self.after(500) # delay 500msself.canvas.delete(self.rect)   # delete origin rectangleorigin = np.array([UNIT/2, UNIT/2])self.rect = self.canvas.create_rectangle(origin[0] - (UNIT/2 - 5), origin[1] - (UNIT/2 - 5),origin[0] + (UNIT/2 - 5), origin[1] + (UNIT/2 - 5),fill='red')# return observation return self.get_state(self.rect)   def step(self, action):# agent和环境进行一次交互s = self.get_state(self.rect)   # 获得智能体的坐标base_action = np.array([0, 0])reach_boundary = Falseif action == self.action_space[0]:   # upif s[1] > 1:base_action[1] -= UNITelse: # 触碰到边界reward=-1并停留在原地reach_boundary = Trueelif action == self.action_space[1]:   # downif s[1] < MAZE_H:base_action[1] += UNITelse:reach_boundary = True   elif action == self.action_space[2]:   # rightif s[0] < MAZE_W:base_action[0] += UNITelse:reach_boundary = Trueelif action == self.action_space[3]:   # leftif s[0] > 1:base_action[0] -= UNITelse:reach_boundary = Trueself.canvas.move(self.rect, base_action[0], base_action[1])  # move agents_ = self.get_state(self.rect)  # next state# reward functionif s_ == self.get_state(self.oval):     # reach the terminalreward = 1done = Trues_ = 'success'elif s_ == self.get_state(self.hell1): # reach the blockreward = -1s_ = 'block_1'done = Falseelif s_ == self.get_state(self.hell2):reward = -1s_ = 'block_2'done = Falseelse:reward = 0done = Falseif reach_boundary:reward = -1return s_, reward, donedef render(self):time.sleep(0.15)self.update()if __name__ == '__main__':def test():for t in range(10):s = env.reset()print(s)while True:env.render()a = 'right's, r, done = env.step(a)print(s)if done:breakenv = Maze()env.after(100, test)      # 在延迟100ms后调用函数testenv.mainloop()

This part is important that the reward function design is include, which is as follows

reward = { 1 , if reach the cheese − 1 , if reach the trap or reach the boundary 0 , others \text{reward} = \left \{ \begin{aligned} & 1, \quad \text{if reach the cheese} \\ & -1, \quad \text{if reach the trap or reach the boundary} \\ & 0, \quad \text{others} \end{aligned} \right. reward= 1,if reach the cheese1,if reach the trap or reach the boundary0,others

We need to explan some function of the class Maze.

  • First, the function _build_maze creates the inital maze location.
    In this example we use the left up coordination of each grid as the state of each block.
  • Second, the function get_state converts the coordination of each grid to numerical representation such as ( 1 , 1 ) , ( 1 , 2 ) , ⋯ (1,1),(1,2),\cdots (1,1),(1,2),.
  • Third, the function reset renew the state which means placing the mouse in the original grid.
  • Then, the function step we let the agent interact with envrionment for one step, ang get the reward after the action.
  • Then, the function render controls updating the window.

1.3 Tabular Q-learning Algorithm

import numpy as np
import pandas as pdclass QLearningTable():def __init__(self, actions, learning_rate=0.05, reward_decay=0.9, e_greedy=0.9):self.actions = actions  # action listself.lr = learning_rateself.gamma = reward_decayself.epsilon = e_greedy # epsilon greedy update policyself.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64)def check_state_exist(self, state):if state not in self.q_table.index:# append new state to q table, use the coordinate as the observation# self.q_table = self.q_table.append(       # DataFrame.append is invalid#     pd.Series(#         [0]*len(self.actions),#         index=self.q_table.columns,#         name=state,#     )# )self.q_table = pd.concat([self.q_table,pd.DataFrame(data=np.zeros((1,len(self.actions))),columns = self.q_table.columns,index = [state])])def choose_action(self, observation):self.check_state_exist(observation)# action selection# epsilon greedy algorithmif np.random.uniform() < self.epsilon:state_action = self.q_table.loc[observation, :]# some actions may have the same value, randomly choose on in these actions# state_action == np.max(state_action) generate bool mask# choose best actionaction = np.random.choice(state_action[state_action == np.max(state_action)].index)else:# choose random actionaction = np.random.choice(self.actions)return actiondef learn(self, s, a, r, s_):self.check_state_exist(s_)q_predict = self.q_table.loc[s, a]if s_ != 'success':q_target = r + self.gamma * self.q_table.loc[s_, :].max()  # next state is not terminalelse:q_target = r  # next state is terminalself.q_table.loc[s, a] += self.lr * (q_target - q_predict)  # update

We store the Q-table as a DataFrame of pandas. The explanation of the functions are as follows.

  • First, the function check_state_exist check the existence of one state, if not we append it to the Q-table. This is because once the state-action pair is visited, then we update it into the Q-table.
  • Second, the function choose_action is following the ϵ \epsilon ϵ-greedy algorithm

π ( a ∣ s ) = { 1 − ϵ ∣ A ( s ) ∣ ( ∣ A ( s ) ∣ − 1 ) , for the geedy action ϵ ∣ A ( s ) ∣ , for the other  ∣ A ( s ) ∣ − 1 actions \pi(a|s) = \left \{ \begin{aligned} 1 - \frac{\epsilon}{|\mathcal{A}(s)|}(|\mathcal{A(s)}|-1), & \quad \text{for the geedy action} \\ \frac{\epsilon}{|\mathcal{A}(s)|}, & \quad \text{for the other } |\mathcal{A}(s)|-1 \text{ actions} \end{aligned} \right. π(as)= 1A(s)ϵ(A(s)1),A(s)ϵ,for the geedy actionfor the other A(s)1 actions

  • Third, the function learn is update the q value as Q-learning algorithm purposed.

Q-learning : { q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q t ( s t + 1 , a ) ) ] q t + 1 ( s , a ) = q t ( s , a ) , for all  ( s , a ) ≠ ( s t , a t ) \text{Q-learning} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - (r_{t+1}+ \gamma \max_{a\in\mathcal{A}(s_{t+1})} q_t(s_{t+1},a)) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right. Q-learning: qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γaA(st+1)maxqt(st+1,a))]=qt(s,a),for all (s,a)=(st,at)

1.4 Run this main

Run this main script that we can run the all codes.

from maze_env_custom import Maze
from RL_brain import QLearningTableMAX_EPISODE = 30def update():for episode in range(MAX_EPISODE):# initial observation, observation is the rect's coordiante# observation is [x0,y0, x1,y1]observation = env.reset()   while True:# fresh envenv.render()# RL choose action based on observation ['up', 'down', 'right', 'left']action = RL.choose_action(str(observation))# RL take action and get next observation and rewardobservation_, reward, done = env.step(action)# RL learn from this transitionRL.learn(str(observation), action, reward, str(observation_))# swap observationobservation = observation_# break while loop when end of this episodeif done:break# show q_tableprint(RL.q_table)print('\n')# end of gameprint('game over')env.destroy()if __name__ == "__main__":env = Maze()RL = QLearningTable(env.action_space)env.after(100, update)env.mainloop()

1.5 Check the Q table

After a long run we can check the q-table to judge wheter the learning is reasonable. The q-table is as follows:

                  up      down     right          left
(1.0, 1.0) -0.226208  0.000963  0.000000 -9.750000e-02
(1.0, 2.0)  0.000024  0.005773  0.000000 -5.000000e-02
(2.0, 1.0) -0.050000  0.000000  0.000000  5.247904e-07
(2.0, 2.0)  0.000000 -0.050000 -0.050000  0.000000e+00
block_2     0.000000  0.000000  0.000000  1.793534e-04
(2.0, 4.0) -0.097500 -0.050000  0.336315  2.916072e-03
(1.0, 4.0)  0.002162 -0.140781  0.112337 -5.000000e-02
(1.0, 3.0)  0.000008  0.033479 -0.050000 -9.739821e-02
block_1     0.000000  0.097500  0.000000  0.000000e+00
(4.0, 2.0)  0.000000  0.006525 -0.050000 -5.000000e-02
success     0.000000  0.000000  0.000000  0.000000e+00
(3.0, 1.0) -0.050000 -0.047750  0.000000  0.000000e+00
(3.0, 4.0)  0.722610 -0.050000  0.000000  1.298347e-02
(4.0, 1.0) -0.050000  0.000101 -0.050000  0.000000e+00
(4.0, 3.0)  0.000000  0.000000  0.000000  1.426250e-01

For example, when at the original place if the mouse wants to move up or move left it will reach the boundary and get reward − 1 -1 1. Hence the state value in q-table is minus.


Reference

赵世钰老师的课程
莫烦ReinforcementLearning course

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/16189.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Windows 10 中无法最大化任务栏中的程序

方法1&#xff1a;仅选择选项 PC 屏幕 如果您使用双显示器&#xff0c;有时这可能会发生在您的 1 台计算机已插入但您正在访问的应用程序正在另一台计算机上运行的情况下&#xff0c;因此您看不到任何选项。因此&#xff0c;请设置仅在主计算机上显示显示的 PC 屏幕选项。 第…

搭建自己第一个golang程序

概念&#xff1a; golang 和 java有些类似&#xff0c;配置好环境就可以直接编写运行了&#xff1b;这里分两种&#xff1a; 一.shell模式 创建一个go类型的文件 往里面编写代码 二.开发工具模式 这里的开发工具 我选用goland package mainimport "fmt"func mai…

Ubuntu 20.04.4 LTS安装Terminator终端(Linux系统推荐)

Terminator终端可以在一个窗口中创建多个终端&#xff0c;并且可以水平、垂直分割&#xff0c;运行ROS时很方便。 sudo apt install terminator这样安装完成后&#xff0c;使用快捷键Ctrl Alt T打开的就是新安装的terminator终端&#xff0c;可以使用以下方法仍然打开ubuntu默…

【数据结构】实验四:循环链表

实验四 循环链表 一、实验目的与要求 1&#xff09;熟悉循环链表的类型定义和基本操作&#xff1b; 2&#xff09;灵活应用循环链表解决具体应用问题。 二、实验内容 题目一&#xff1a;有n个小孩围成一圈&#xff0c;给他们从1开始依次编号&#xff0c;从编号为1的小孩开…

异步检索在 Elasticsearch 中的理论与实践

异步检索在 Elasticsearch 中的理论与实践 https://www.elastic.co/guide/en/elasticsearch/reference/8.1/async-search.html#submit-async-search 引言 Elasticsearch 是一种强大的分布式搜索和分析引擎&#xff0c;它能够快速地存储、搜索和分析大量数据。在处理大规模数据时…

Prometheus中的关键设计

1、标准先行&#xff0c;注重生态 Prometheus 最重要的规范就是指标命名方式&#xff0c;数据格式简单易读。比如&#xff0c;对于应用层面的监控&#xff0c;可以要求必须具备这几个信息。 指标名称 metric Prometheus 内置建立的规范就是叫 metric&#xff08;即 __name__…

正则表达式 —— Awk

Awk awk&#xff1a;文本三剑客之一&#xff0c;是功能最强大的文本工具 awk也是按行来进行操作&#xff0c;对行操作完之后&#xff0c;可以根据指定命令来对行取列 awk的分隔符&#xff0c;默认分隔符是空格或tab键&#xff0c;多个空格会压缩成一个 awk的用法 awk的格式…

学习day53

今天主要是做一个案例 TodoList 组件化编码流程&#xff1a; 1. 拆分静态组件&#xff1a;组件要按照功能点拆分&#xff0c;命名不要与html元素冲突 2.实现动态组件&#xff1a;考虑好数据的存放位置&#xff0c;数据是一个组件在用&#xff0c;还是一些组件在用&#xff1a…

ICMP协议(网际报文控制协议)详解

ICMP协议&#xff08;网际报文控制协议&#xff09;详解 ICMP协议的功能ICMP的报文格式常见的ICMP报文差错报文目的站不可达数据报超时 查询报文回送请求或回答 ICMP协议是一个网络层协议。 一个新搭建好的网络&#xff0c;往往需要先进行一个简单的测试&#xff0c;来验证网络…

线程池 LinkedBlockingQueue、ArrayBlockingQueue、SynchronousQueue 的区别是什么 分别有什么优缺点

LinkedBlockingQueue、ArrayBlockingQueue 和 SynchronousQueue 都是 Java 中常用的阻塞队列实现&#xff0c;在线程池等多线程场景中经常用于保存等待执行的任务。它们之间的区别和各自的优缺点如下&#xff1a; LinkedBlockingQueue: 是一个基于链表的阻塞队列&#xff0c;…

基于libevent的多线程http server (CentOS)

文章目录 一、安装libevent二、安装jsoncpp三、http多线程服务 一、安装libevent 下载编译安装&#xff0c;提前安装好gcc, make sudo su yum -y install wget wget http://www.monkey.org/~provos/libevent-2.0.10-stable.tar.gz tar -zxvf libevent-2.0.10-stable.tar.gz c…

小白到运维工程师自学之路 第六十集 (docker的概述与安装)

一、概述 1、客户&#xff08;老板&#xff09;-产品-开发-测试-运维项目周期不断延后&#xff0c;项目质量差。 随着云计算和DevOps生态圈的蓬勃发展&#xff0c;产生了大量优秀的系统和软件。软件开发人员可以自由选择各种软件应用环境。但同时带来的问题就是需要维护一个非…

React高阶学习(二)

目录 1. 基本概念和语法2. 组件化开发3. 状态管理4. 生命周期钩子5. 条件渲染6. 循环渲染7. 事件处理8. 组件间通信9. 动画效果10. 模块化开发 1. 基本概念和语法 React 是基于 JavaScript 的库&#xff0c;用于构建用户界面。它采用虚拟 DOM 技术&#xff0c;能够高效地渲染页…

spring-authorization-server (1.1.1)自定义认证

前言 注意&#xff1a;我本地没有生成公钥和私钥&#xff0c;所以每次启动项目jwkSource都会重新生成&#xff0c;导致之前认证的token都会失效&#xff0c;具体如何生成私钥和公钥以及怎么配置到授权服务器中&#xff0c;网上有很多方法自行实现即可 之前有个项目用的0.0.3的…

Vue(待续)

概念 一套用于构建用户界面的渐进式JavaScript框架 Vue可以自底向上逐层的应用&#xff1a; 简单应用:只需一个轻量小巧的核心库。 复杂应用:可以引入各式各样的Vue插件。 1.采用组件化模式&#xff0c;提高代码复用率、且让代码更好维护。 2.声明式编码&#xff0c;让编码人员…

【设计模式——学习笔记】23种设计模式——装饰器模式Decorator(原理讲解+应用场景介绍+案例介绍+Java代码实现)

文章目录 生活案例咖啡厅 咖啡定制案例 装饰者模式介绍介绍出场角色 案例实现案例一&#xff08;咖啡厅问题&#xff09;类图代码实现咖啡样式拓展代码实现 案例二类图代码实现 装饰着模式在IO流源码的应用总结什么是父类和子类的一致性如何让自己和被委托对象有一致性 文章说明…

深度学习和神经网络

人工神经网络分为两个阶段&#xff1a; 1 &#xff1a;接收来自其他n个神经元传递过来的信号&#xff0c;这些输入信号通过与相应的权重进行 加权求和传递给下个阶段。&#xff08;预激活阶段&#xff09; 2&#xff1a;把预激活的加权结果传递给激活函数 sum :加权 f:激活…

【Linux】UDP协议

​&#x1f320; 作者&#xff1a;阿亮joy. &#x1f386;专栏&#xff1a;《学会Linux》 &#x1f387; 座右铭&#xff1a;每个优秀的人都有一段沉默的时光&#xff0c;那段时光是付出了很多努力却得不到结果的日子&#xff0c;我们把它叫做扎根 目录 &#x1f449;传输层&a…

初级算法-动态规划

文章目录 爬楼梯题意&#xff1a;解&#xff1a;代码&#xff1a; 买卖股票的最佳时机题意&#xff1a;解&#xff1a;代码&#xff1a; 最大子序和题意&#xff1a;解&#xff1a;代码&#xff1a; 打家劫舍题意&#xff1a;解&#xff1a;代码&#xff1a; 爬楼梯 题意&…

Mysql的锁

加锁的目的 对数据加锁是为了解决事务的隔离性问题&#xff0c;让事务之前相互不影响&#xff0c;每个事务进行操作的时候都必须先加上一把锁&#xff0c;防止其他事务同时操作数据。 事务的属性 &#xff08;ACID&#xff09; 原子性 一致性 隔离性 持久性 事务的隔离级别 锁…