目录
- 一、ADP的结构和基本原理
- 1、ADP的基本结构
- 2、ADP的基本原理
- 2.1 评价网络
- 2.2 执行网络
- 二、评价-执行(演员-评论家)网络设计及更新
- 1、评价网络设计
- 2、执行网络设计
- 三、基于matlab神经网络工具箱例子实现
自适应动态规划(Adaptive Dynamic Programming,ADP)方法通过逐步迭代逼近动态规划的真解,进而逐渐逼近非线性架构的最优控制解。
一、ADP的结构和基本原理
1、ADP的基本结构
设有离散时间非线性动态系统:
x ( k + 1 ) = f [ x ( k ) , u ( k ) , k ] , k = 0 , 1 , . . . \begin{gather} \begin{aligned} x(k+1)=f[x(k),u(k),k], \ \ \ k=0,1,... \end{aligned}\end{gather}x(k+1)=f[x(k),u(k),k],k=0,1,...式中, x ∈ R n x \in R^nx∈Rn表示系统的状态向量;u ∈ R m u \in R^mu∈Rm表示控制动作;f ff是系统函数。与该系统对应的k kk时刻性能指标(或代价)函数通常考虑为二次型成本函数为:
J [ x ( k ) , k ] = ∑ i = k ∞ γ i − k ( x ( i ) T Q x ( i ) + u ( i ) T R u ( i ) ) \begin{gather} \begin{aligned} J[x(k),k] = \sum\limits_{i = k}^\infty {{\gamma ^{i - k}}(x{{(i)}^T}Qx(i) + u{{(i)}^T}Ru(i))} \end{aligned}\end{gather}J[x(k),k]=i=k∑∞γi−k(x(i)TQx(i)+u(i)TRu(i))其中,Q ∈ n × n Q \in {^{n \times n}}Q∈n×n是正定状态权重矩阵;R ∈ n × n R \in {^{n \times n}}R∈n×n是正定控制权重矩阵;γ \gammaγ是折扣因子表示注重当下收益,且0 < γ ≤ 1 0 < \gamma \le 10<γ≤1。动态规划的目的是选择一个控制序列u ( i ) , i = k , k + 1 , . . . u(i),i = k,k + 1,...u(i),i=k,k+1,...使得公式(2)中定义的函数J JJ(即代价)最小化。
自适应动态规划的基本结构如图1所示(虚线表示更新网络的意思):

2、ADP的基本原理
2.1 评价网络
评价网络的输出J ^ \hat JJ^是对由式(2)给出的函数J JJ的估计。这可以通过随着时间最小化下式的误差来实现。
∥ E c ∥ = ∑ k E c ( k ) = 1 2 ∑ k [ J ^ ( k ) − U ( k ) − γ J ^ ( k + 1 ) ] 2 \begin{gather} \begin{aligned} \left\| {{E_c}} \right\| = \sum\limits_k {{E_c}(k)} = \frac{1}{2}\sum\limits_k {{{[\hat J(k) - U(k) - \gamma \hat J(k + 1)]}^2}} \end{aligned}\end{gather}∥Ec∥=k∑Ec(k)=21k∑[J^(k)−U(k)−γJ^(k+1)]2式中, J ^ ( k ) = J ^ [ x ( k ) , u ( k ) , k , W c ] \hat J(k) = \hat J[x\left( k \right),u\left( k \right),k,{W_c}]J^(k)=J^[x(k),u(k),k,Wc] , W c W_cWc代表评价网络的参数。函数U ( k ) = x ( k ) T Q x ( k ) + u ( k ) T R u ( k ) U(k) = x{(k)^T}Qx(k) + u{(k)^T}Ru(k)U(k)=x(k)TQx(k)+u(k)TRu(k)是与式(2)中的效用函数完全一样的效用函数,注意U ( k ) U(k)U(k) 是 k kk时刻一个时刻的效用函数,不是k kk时刻到无穷时刻的累加。当对于所有的k kk 都有 E c ( k ) = 0 {E_c}(k) = 0Ec(k)=0时,式(3)意味着
J ^ ( k ) = U ( k ) + γ J ^ ( k + 1 ) = U ( k ) + γ [ J ^ ( k + 1 ) + γ J ^ ( k + 2 ) ] = . . . . . = ∑ i = k γ i − k U ( i ) \begin{gather} \begin{aligned} \begin{array}{l} \hat J(k) = U(k) + \gamma \hat J(k + 1)\\ \ \ \ \ \ \ \ \ \ = U(k) + \gamma [\hat J(k + 1) + \gamma \hat J(k + 2)]\\ \ \ \ \ \ \ \ \ \ = .....\\ \ \ \ \ \ \ \ \ \ = \sum\limits_{i = k}^{} {{\gamma ^{i - k}}U(i)} \end{array} \end{aligned}\end{gather}J^(k)=U(k)+γJ^(k+1)=U(k)+γ[J^(k+1)+γJ^(k+2)]=.....=i=k∑γi−kU(i)式(4)与式(2)中定义的代价函数完全一样。因此,最小化由式(3)定义的误差函数,可以获得一个训练好的神经网络,该网络的输出J ^ \hat JJ^是式(4.2)中定义的代价函数J JJ的一个估计。
2.2 执行网络
执行网络的训练是通过使用控制信号u ( k ) = u [ x ( k ) , k , W a ] u(k) = u[x(k),k,{W_a}]u(k)=u[x(k),k,Wa] ( W a W_aWa代表执行网络的参数),以最小化J ^ ( k ) \hat J(k)J^(k)为目标。即通过最小化评价网络的输出来训练,将得到个训练过的网络,它将产生一个最优或者次优的控制信号。
二、评价-执行(演员-评论家)网络设计及更新
1、评价网络设计
评价网络是输入当前系统的状态,输出代价值。因此,当系统存在n维状态时,采用具有n个输入神经元,p个隐藏层神经元和1个输出神经元的结构。n个输入是状态向量的n个分量。输出是与输入状态对应的最优性能指标的估计。评价网络的隐藏层采用双极性 sigmoidal函数(也可以采用其他的激活函数),输出层采用线性函数purelin。评价网络结构如图2所示。
评价网络的训练仍然由正向的计算和反向的误差传播过程组成。评价网络的正向计算过程为:
c h 1 j ( k ) = ∑ i = 1 n x ^ i ( k ) ⋅ W c 1 i j ( k ) , j = 1 , 2 , . . . , p \begin{gather} \begin{aligned} {c_{h1j}}(k) = \sum\limits_{i = 1}^n {{{\hat x}_i}(k)}\cdot {W_{c1ij}}(k),\ \ \ \ \ \ j = 1,2,...,p \end{aligned}\end{gather}ch1j(k)=i=1∑nx^i(k)⋅Wc1ij(k),j=1,2,...,pc h 2 j ( k ) = 1 − e − c h 1 j ( k ) 1 + e − c h 1 j ( k ) , j = 1 , 2 , . . . , p \begin{gather} \begin{aligned} {c_{h2j}}(k) = \frac{{1 - {e^{ - {c_{h1j}}(k)}}}}{{1 + {e^{ - {c_{h1j}}(k)}}}},\ \ \ \ \ \ j = 1,2,...,p \end{aligned}\end{gather}ch2j(k)=1+e−ch1j(k)1−e−ch1j(k),j=1,2,...,pJ ^ ( k ) = ∑ j = 1 p c h 2 j ( k ) ⋅ W c 2 j ( k ) \begin{gather} \begin{aligned} \hat J(k) = \sum\limits_{j = 1}^p {{c_{h2j}}(k)} \cdot {W_{c2j}}(k) \end{aligned}\end{gather}J^(k)=j=1∑pch2j(k)⋅Wc2j(k)式中,c h 1 j ( k ) {c_{h1j}}(k)ch1j(k)为评价网络隐藏层第j个节点的输入;c h 2 j ( k ) {c_{h2j}}(k)ch2j(k)为评价网络隐藏层第j个节点的输出。评价网络的训练同样采用梯度下降法,通过最小化下式定义的误差来实现:
∥ E c ∥ = ∑ k E c ( k ) = 1 2 ∑ k e c 2 ( k ) \begin{gather} \begin{aligned} \left\| {{E_c}} \right\| = \sum\limits_k {{E_c}(k)} = \frac{1}{2}\sum\limits_k {e_c^2(k)} \end{aligned}\end{gather}∥Ec∥=k∑Ec(k)=21k∑ec2(k)e c ( k ) = J ^ ( k ) − U ( k ) − γ J ^ ( k + 1 ) \begin{gather} \begin{aligned} e_c^{}(k) = \hat J(k) - U(k) - \gamma \hat J(k + 1) \end{aligned}\end{gather}ec(k)=J^(k)−U(k)−γJ^(k+1)评价网络权值更新推导如下:
① W c 2 W_{c2}Wc2(隐藏层到输出层的权值矩阵):
Δ W c 2 j ( k ) = l c ( k ) [ − ∂ E c ( k ) ∂ W c 2 j ( k ) ] = l c ( k ) [ − ∂ E c ( k ) ∂ J ^ ( k ) ∂ J ^ ( k ) ∂ W c 2 j ( k ) ] = − l c ( k ) ⋅ e c ( k ) ⋅ c h 2 j ( k ) \begin{gather} \begin{aligned} \begin{array}{l} \Delta {W_{c2j}}(k) = {l_c}(k)\left[ { - \frac{{\partial {E_c}(k)}}{{\partial {W_{c2j}}(k)}}} \right]\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = {l_c}(k)\left[ { - \frac{{\partial {E_c}(k)}}{{\partial \hat J(k)}}\frac{{\partial \hat J(k)}}{{\partial {W_{c2j}}(k)}}} \right]\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = - {l_c}(k) \cdot {e_c}(k) \cdot {c_{h2j}}(k)\end{array} \end{aligned}\end{gather}ΔWc2j(k)=lc(k)[−∂Wc2j(k)∂Ec(k)]=lc(k)[−∂J^(k)∂Ec(k)∂Wc2j(k)∂J^(k)]=−lc(k)⋅ec(k)⋅ch2j(k)Δ W c 2 ( k ) = − l c ( k ) ⋅ e c ( k ) ⋅ c h 2 T ( k ) \begin{gather} \begin{aligned} \Delta {W_{c2}}(k) = - {l_c}(k) \cdot {e_c}(k) \cdot c_{h2}^T(k) \end{aligned}\end{gather}ΔWc2(k)=−lc(k)⋅ec(k)⋅ch2T(k)W c 2 ( k + 1 ) = W c 2 ( k ) + Δ W c 2 ( k ) \begin{gather} \begin{aligned} {W_{c2}}(k + 1) = {W_{c2}}(k) + \Delta {W_{c2}}(k) \end{aligned}\end{gather}Wc2(k+1)=Wc2(k)+ΔWc2(k)
② W c 1 W_{c1}Wc1(输入层到隐藏层的权值矩阵):
Δ W c 1 i j ( k ) = l c ( k ) [ − ∂ E c ( k ) ∂ W c 1 i j ( k ) ] = l c ( k ) [ − ∂ E c ( k ) ∂ J ^ ( k ) ∂ J ^ ( k ) ∂ c h 2 j ( k ) ∂ c h 2 j ( k ) ∂ c h 1 j ( k ) ∂ c h 1 j ( k ) ∂ W c 1 i j ( k ) ] = − l c ( k ) ⋅ e c ( k ) ⋅ W c 2 j ( k ) ⋅ 1 2 [ 1 − c h 2 j 2 ( k ) ] ⋅ x ^ i ( k ) \begin{gather} \begin{aligned} \begin{array}{l} \Delta {W_{c1ij}}(k) = {l_c}(k)\left[ { - \frac{{\partial {E_c}(k)}}{{\partial {W_{c1ij}}(k)}}} \right]\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = {l_c}(k)\left[ { - \frac{{\partial {E_c}(k)}}{{\partial \hat J(k)}}\frac{{\partial \hat J(k)}}{{\partial {c_{h2j}}(k)}}\frac{{\partial {c_{h2j}}(k)}}{{\partial {c_{h1j}}(k)}}\frac{{\partial {c_{h1j}}(k)}}{{\partial {W_{c1ij}}(k)}}} \right]\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = - {l_c}(k) \cdot {e_c}(k) \cdot {W_{c2j}}(k) \cdot \frac{1}{2}\left[ {1 - {c_{h2j}}^2(k)} \right] \cdot {{\hat x}_i}(k) \end{array} \end{aligned}\end{gather}ΔWc1ij(k)=lc(k)[−∂Wc1ij(k)∂Ec(k)]=lc(k)[−∂J^(k)∂Ec(k)∂ch2j(k)∂J^(k)∂ch1j(k)∂ch2j(k)∂Wc1ij(k)∂ch1j(k)]=−lc(k)⋅ec(k)⋅Wc2j(k)⋅21[1−ch2j2(k)]⋅x^i(k)Δ W c 1 ( k ) = − 1 2 ⋅ l c ( k ) ⋅ e c ( k ) ⋅ x ^ T ( k ) × { W c 2 T ( k ) ⊗ [ 1 − c h 2 ( k ) ⊗ c h 2 ( k ) ] } \begin{gather} \begin{aligned} \Delta {W_{c1}}(k) = - \frac{1}{2} \cdot {l_c}(k) \cdot {e_c}(k) \cdot {{\hat x}^T}(k) \times \left\{ {W_{c2}^T(k) \otimes \left[ {1 - {c_{h2}}(k) \otimes {c_{h2}}(k)} \right]} \right\} \end{aligned}\end{gather}ΔWc1(k)=−21⋅lc(k)⋅ec(k)⋅x^T(k)×{Wc2T(k)⊗[1−ch2(k)⊗ch2(k)]}W c 1 ( k + 1 ) = W c 1 ( k ) + Δ W c 1 ( k ) \begin{gather} \begin{aligned} {W_{c1}}(k + 1) = {W_{c1}}(k) + \Delta {W_{c1}}(k) \end{aligned}\end{gather}Wc1(k+1)=Wc1(k)+ΔWc1(k)
2、执行网络设计
执行网络采用具有n个输入神经元,q个隐藏层神经元和m个输出神经元的结构。n个输入分别是体系在k时刻的状态向量x ( k ) x(k)x(k)的n个分量。m个输出则是与输入状态x ( k ) x(k)x(k)对应的控制向量u ( k ) u(k)u(k)的m个分量。动作网络的隐藏层采用双极性sigmoidal函数,输出层采用线性函数purelin。动作网络结构如图3所示:

执行网络的训练仍然由正向的计算和反向的误差传播过程组成。执行网络的正向计算过程为:
a h 1 j ( k ) = ∑ i = 1 n x ^ i ( k ) ⋅ W a 1 i j ( k ) , j = 1 , 2 , . . . , q \begin{gather} \begin{aligned} {a_{h1j}}(k) = \sum\limits_{i = 1}^n {{{\hat x}_i}(k)}\cdot {W_{a1ij}}(k),\ \ \ \ \ \ j = 1,2,...,q \end{aligned}\end{gather}ah1j(k)=i=1∑nx^i(k)⋅Wa1ij(k),j=1,2,...,qa h 2 j ( k ) = 1 − e − a h 1 j ( k ) 1 + e − a h 1 j ( k ) , j = 1 , 2 , . . . , q \begin{gather} \begin{aligned} {a_{h2j}}(k) = \frac{{1 - {e^{ - {a_{h1j}}(k)}}}}{{1 + {e^{ - {a_{h1j}}(k)}}}},\ \ \ \ \ \ j = 1,2,...,q \end{aligned}\end{gather}ah2j(k)=1+e−ah1j(k)1−e−ah1j(k),j=1,2,...,qu j ( k ) = ∑ j = 1 q a h 2 i ( k ) ⋅ W a 2 i j ( k ) , j = 1 , 2 , . . . , m \begin{gather} \begin{aligned} {u_j}(k) = \sum\limits_{j = 1}^q {{a_{h2i}}(k)}\cdot {W_{a2ij}}(k), \ \ \ \ \ \ j = 1,2,...,m \end{aligned}\end{gather}uj(k)=j=1∑qah2i(k)⋅Wa2ij(k),j=1,2,...,m式中,a h 1 j ( k ) {a_{h1j}}(k)ah1j(k)为执行网络隐藏层第j个节点的输入;a h 2 j ( k ) {a_{h2j}}(k)ah2j(k)为执行网络隐藏层第j个节点的输出。执行网络的训练以最小化J ^ ( k ) \hat J(k)J^(k)为目标。执行网络的训练同样采用梯度下降法。
Δ W a = l a ( k ) ⋅ [ − ∂ J ^ ( k ) ∂ W a ( k ) ] = − l a ( k ) ⋅ ∂ J ^ ( k ) ∂ u ( k ) ∂ u ( k ) ∂ W a ( k ) \begin{gather} \begin{aligned} \Delta {W_a} = {l_a}(k) \cdot \left[ { - \frac{{\partial \hat J(k)}}{{\partial {W_a}(k)}}} \right] = - {l_a}(k) \cdot \frac{{\partial \hat J(k)}}{{\partial u(k)}}\frac{{\partial u(k)}}{{\partial {W_a}(k)}} \end{aligned}\end{gather}ΔWa=la(k)⋅[−∂Wa(k)∂J^(k)]=−la(k)⋅∂u(k)∂J^(k)∂Wa(k)∂u(k)∂ J ^ ( k ) ∂ u ( k ) = ∂ U ( k ) ∂ u ( k ) + γ ∂ J ^ ( k + 1 ) ∂ u ( k ) \begin{gather} \begin{aligned} \frac{{\partial \hat J(k)}}{{\partial u(k)}} = \frac{{\partial U(k)}}{{\partial u(k)}} + \gamma \frac{{\partial \hat J(k + 1)}}{{\partial u(k)}} \end{aligned}\end{gather}∂u(k)∂J^(k)=∂u(k)∂U(k)+γ∂u(k)∂J^(k+1)式中,∂ U ( k ) ∂ u ( k ) \frac{{\partial U(k)}}{{\partial u(k)}}∂u(k)∂U(k)的值取决于效用函数的定义,而效用函数的定义与具体的被控环境有关,这里如果定义效用函数为二次型,即:
U ( k ) = x ( k ) A x T ( k ) + u ( k ) B u T ( k ) \begin{gather} \begin{aligned} U(k) = x(k)A{x^T}(k) + u(k)B{u^T}(k) \end{aligned}\end{gather}U(k)=x(k)AxT(k)+u(k)BuT(k)式中,A,B分别为n维和m维的单位矩阵,则∂ U ( k ) ∂ u ( k ) = 2 u ( k ) \frac{{\partial U(k)}}{{\partial u(k)}} = 2u(k)∂u(k)∂U(k)=2u(k) ,故 :
∂ J ^ ( k ) ∂ u ( k ) = ∂ U ( k ) ∂ u ( k ) + γ ∂ J ^ ( k + 1 ) ∂ u ( k ) = 2 u ( k ) + γ ∂ J ^ ( k + 1 ) ∂ u ( k ) \begin{gather} \begin{aligned} \begin{array}{l} \frac{{\partial \hat J(k)}}{{\partial u(k)}} = \frac{{\partial U(k)}}{{\partial u(k)}} + \gamma \frac{{\partial \hat J(k + 1)}}{{\partial u(k)}}\\ \ \ \ \ \ \ \ \ \ \ = 2u(k) + \gamma \frac{{\partial \hat J(k + 1)}}{{\partial u(k)}} \end{array} \end{aligned}\end{gather}∂u(k)∂J^(k)=∂u(k)∂U(k)+γ∂u(k)∂J^(k+1)=2u(k)+γ∂u(k)∂J^(k+1)
执行网络权值更新推导如下(几页纸,略)。
三、基于matlab神经网络工具箱例子完成
例:我们考虑以下的离散线性系统:
x k + 1 = A x k + B u k \begin{gather} \begin{aligned} x_{k+1}=Ax_{k}+Bu_{k} \end{aligned}\end{gather}xk+1=Axk+Buk其中,x k = [ x 1 k , x 2 k ] T x_{k}=[x_{1k}, x_{2k}]^Txk=[x1k,x2k]T 并且 u ∈ R 1 u \in R^1u∈R1,A矩阵为[ 0 0.1; 0.3 -1 ],B矩阵为[ 0; 0.5],初始状态x 0 = [ 1 , − 1 ] T x_{0}=[1, -1]^Tx0=[1,−1]T。代价函数指标用公式(2),即U ( x k , u k ) = x k T Q x k + u k T R u k U(x_k, u_k) = x_k^TQx_k + u_k^TRu_kU(xk,uk)=xkTQxk+ukTRuk ,其中 Q = I Q=IQ=I, R = 0.5 I R=0.5IR=0.5I, I II为单位矩阵。
利用神经网络搭建了策略迭代和值迭代两种算法,本例子的评论家网络和演员网络为三层BP神经网络,其结构分别为2-8-1和2-8-1的三层网络。对于每一个迭代步骤,使用α = 0.02的学习率对批评者网络和动作网络进行80步的训练,使神经网络的训练误差小于1 0 − 5 10^{−5}10−5。


完整代码见:链接:我在闲鱼发布了【自适应动态规划代码!ADP,入门最佳代码,易懂。包括值迭代和策略迭代】
至此,自适应动态规划的数学推导以及例子就记录到这里了,
敲公式不易,麻烦各位看官一键三连!感谢!欢迎收藏以便后续用到的时候查公式!