策略梯度方法介绍——确定性策略梯度定理

引言

引言

上一节我们介绍了 行动者-评论家(AC)方法，其核心思想是将policy_based与value_based方法相结合，仅需要执行一次状态转移过程，就可立即进行策略改进。本节将继续沿用AC方法框架，介绍 确定性策略梯度定理。

回顾：策略梯度定理

在策略梯度方法介绍——蒙特卡洛策略梯度方法(REINFORCE)介绍了策略梯度定理的期望表达形式：
$\nabla \mathcal J(\theta) = \mathbb E_{S_t \sim \rho^{\pi_{\theta}};A_t \sim \pi_{\theta}}[\nabla \log \pi(A_t \mid S_t;\theta)q_{\pi_{\theta}}(S_t,A_t)]$
其中， $t$ 时刻状态 $S_t$ 服从状态分布 $\rho^{\pi_{\theta}}$ ， $t$ 时刻动作 $A_t$ 服从 $S_t$ 时刻的策略函数 $\pi(A_t \mid S_t;\theta)$ 。

确定性策略梯度

确定性策略梯度的表示形式

既然有确定型策略梯度，自然也会有随机性策略梯度。在策略梯度定理推导过程中介绍的就是随机性策略梯度的推导过程。两者之间主要的差别是 $t$ 时刻动作 $A_t$ 服从的策略是确定性策略还是随机性策略。

在最早的马尔可夫奖励过程(MRP)中介绍到确定性策略—智能体在某一状态下只能执行唯一一个确定的动作。因此，在策略梯度方法中，策略 $\pi(A_t \mid S_t;\theta)$ 是一个常数，而常数自身是不存在梯度的，因此 $\nabla \pi(A_t \mid S_t;\theta) = 0$ ;

为了在算法过程中，继续对参数 $\theta$ 求解梯度，对确定性策略设定一个符号： $\mu(S_t;\theta)$ ，记为 $\mu_{\theta}$ 。

$\mu(S_t;\theta)$ 可看成是关于 $S_t$ 和参数 $\theta$ 的函数，而不是条件概率；
$\mu(S_t;\theta)$ 本身就可以表示动作 $A_t$ ；

$\mathcal J(\theta)$ 仍然表示 $\mu_{\theta}$ 条件下，初始状态的回报 $G_0$ 的期望 $\mathbb E_{\mu_{\theta}}[G_0]$ 。因此，对确定性策略梯度 $\nabla \mathcal J(\theta)$ 表示如下：
$\nabla \mathcal J(\theta) = \nabla \mathbb E_{\mu_{\theta}}[G_0] = \mathbb E \left[\sum_{k=0}^{+\infty}\gamma^k \nabla \mu(S_t;\theta)[\nabla_{a}q_{\mu_{\theta}}(S_t,a)]|_{a=\mu(S_t;\theta)} \right]$
更一般的形式表示如下：
$\nabla \mathcal J(\theta) = \mathbb E_{S_t \sim \rho^{\mu_{\theta}}} \left[\nabla \mu(S_t;\theta) \nabla_{a}q_{\mu_{\theta}}(S_t,a) |_{a=\mu(S_t;\theta)}\right]$

和回顾中策略梯度定理中期望的表达形式对比，主要有如下几个区别：

期望结果中，分布只包含状态分布 $S_t \sim \rho^{\mu_{\theta}}$ ；
期望内部，不仅要对 $\mu(S_t;\theta)$ 中的 $\theta$ 求解梯度，还要对状态-动作价值函数 $q_{\mu_{\theta}}(S_t,a)$ 中的 $a$ 求解梯度；
不存在 $\log$ 项；

带着上述的几个区别，执行确定性策略梯度的算法推导过程。

确定性策略梯度算法推导过程

整个推导过程和‘策略梯度定理’推导过程非常相似，大家可以对比查看。
当策略函数 $\pi(a \mid s;\theta)$ 成为确定性策略 $\mu(s;\theta)$ 后，最主要的变化 是状态价值函数 $V_{\mu_{\theta}}(s)$ 与状态-动作价值函数 $q_{\mu_{\theta}}(s,\mu(s;\theta))$ 相等：
动作被唯一确定了；
$V_{\mu_{\theta}}(s) = q_{\mu_{\theta}}(s,\mu(s;\theta)), s \in \mathcal S$

根据贝尔曼期望方程，将 $q_{\mu_{\theta}}(s,a)$ 展开为如下形式：
$r(s,\mu(s,\theta))$ 被称为奖赏函数。
$q_{\mu_{\theta}}(s,\mu(s;\theta)) = r(s,\mu(s;\theta)) + \gamma \sum_{s',r}P(s',r \mid s,\mu(s;\theta))V_{\mu_{\theta}}(s'), s \in \mathcal S$
继续化简，后一项的 $r$ 可以使用概率密度积分的方式消掉。整理得：
$q_{\mu_{\theta}}(s,\mu(s;\theta)) = r(s,\mu(s,\theta)) + \gamma \sum_{s'}P(s' \mid s,\mu(s;\theta))V_{\mu_{\theta}}(s'), s \in \mathcal S$

分别对 $V_{\mu_{\theta}}(s),q_{\mu_{\theta}}(s,\mu(s;\theta))$ 求解梯度：
$\nabla V_{\mu_{\theta}}(s) = \nabla q_{\mu_{\theta}}(s,\mu(s;\theta))$
对 $q_{\mu_{\theta}}(s,\mu(s;\theta))$ 求解梯度过程中，由于对 $\theta$ 求解梯度，因此注意 链式求导法则 和 乘法求导：
$\nabla q_{\mu_{\theta}}(s,\mu(s;\theta)) = \nabla_{a} r(s,a)|_{a=\mu(s;\theta)}\cdot\nabla\mu(s;\theta) + \gamma \sum_{s'}\left\{\nabla_{a} P(s' \mid s,a)|_{a = \mu(s;\theta)} \cdot \nabla \mu(s;\theta) \cdot V_{\mu_{\theta}}(s') + P(s' \mid s,\mu(s;\theta)) \nabla V_{\mu_{\theta}}(s') \right\}$
将含有 $\nabla \mu(s;\theta)$ 的项提出来：
$\nabla\mu(s;\theta) \cdot \left[\nabla_{a}r(s,a) + \gamma\sum_{s'}\nabla_{a}P(s' \mid s,a) \cdot V_{\mu_{\theta}}(s')\right]_{a=\mu(s;\theta)} + \gamma \sum_{s'}P(s'\mid s,\mu(s;\theta))V_{\mu_{\theta}}(s')$
又因为：

\begin{aligned} \nabla_{a} r (s, a) + γ \sum_{s^{'}} \nabla_{a} P (s^{'} ∣ s, a) \cdot V_{μ_{θ}} (s^{'}) \\ = \nabla_{a} r (s, a) + \nabla_{a} γ \sum_{s^{'}} P (s^{'} ∣ s, a) \cdot V_{μ_{θ}} (s^{'}) \\ = \nabla_{a} [r (s, a) + γ \sum_{s^{'}} P (s^{'} ∣ s, a) \cdot V_{μ_{θ}} (s^{'})] \\ = \nabla_{a} q_{μ_{θ}} (s, a) \end{aligned}

\nabla_{a} r (s, a) + γ s^{'} \sum \nabla_{a} P (s^{'} ∣ s, a) \cdot V_{μ_{θ}} (s^{'}) = \nabla_{a} r (s, a) + \nabla_{a} γ s^{'} \sum P (s^{'} ∣ s, a) \cdot V_{μ_{θ}} (s^{'}) = \nabla_{a} [r (s, a) + γ s^{'} \sum P (s^{'} ∣ s, a) \cdot V_{μ_{θ}} (s^{'})] = \nabla_{a} q_{μ_{θ}} (s, a)

则有：
$\nabla q_{\mu_{\theta}}(s,\mu(s;\theta)) = \nabla \mu(s;\theta) \cdot \nabla_{a}q_{\mu_{\theta}}(s,a)|_{a = \mu(s;\theta)} + \gamma \sum_{s'}P(s' \mid s,\mu(s;\theta)) \cdot V_{\mu_{\theta}}(s')$

最终有：
$\nabla V_{\mu_{\theta}}(s) = \nabla \mu(s;\theta) \cdot \nabla_{a}q_{\mu_{\theta}}(s,a)|_{a = \mu(s;\theta)} + \gamma \sum_{s'}P(s' \mid s,\mu(s;\theta)) \cdot V_{\mu_{\theta}}(s')$
至此，我们得到了 $\nabla V_{\mu_{\theta}}(s) \to \nabla V_{\mu_{\theta}}(s')$ 的 递推关系。我们同样可以得到 $\nabla V_{\mu_{\theta}}(s') \to \nabla V_{\mu_{\theta}}(s'')$ 的递推关系：
$\nabla V_{\mu_{\theta}}(s') = \nabla \mu(s';\theta) \cdot \nabla_{a'}q_{\mu_{\theta}}(s',a')|_{a' = \mu(s';\theta)} + \gamma \sum_{s''}P(s'' \mid s',\mu(s';\theta)) \cdot V_{\mu_{\theta}}(s'')$

将 $\nabla V_{\mu_{\theta}}(s')$ 带回 $\nabla V_{\mu_{\theta}}(s)$ ：
$\nabla V_{\mu_{\theta}}(s) = \nabla \mu(s;\theta) \cdot \nabla_{a}q_{\mu_{\theta}}(s,a)|_{a = \mu(s;\theta)} + \gamma \sum_{s'}P(s' \mid s,\mu(s;\theta)) \cdot \left\{\nabla \mu(s';\theta) \cdot \nabla_{a'}q_{\mu_{\theta}}(s',a')|_{a' = \mu(s';\theta)} + \gamma \sum_{s''}P(s'' \mid s',\mu(s';\theta)) \cdot V_{\mu_{\theta}}(s'') \right\}$

展开后依然是三项加和的形式：
$\nabla \mu(s;\theta) \cdot \nabla_{a}q_{\mu_{\theta}}(s,a)|_{a = \mu(s;\theta)} \\ \gamma \sum_{s'}P(s' \mid s,\mu(s;\theta)) \cdot \nabla \mu(s';\theta) \cdot \nabla_{a'}q_{\mu_{\theta}}(s',a')|_{a' = \mu(s';\theta)} \\ \gamma \sum_{s'}P(s' \mid s,\mu(s;\theta)) \cdot \gamma \sum_{s''}P(s'' \mid s',\mu(s';\theta)) \cdot V_{\mu_{\theta}}(s'')$
由于最后一项依然可以继续展开，因此我们先关注前面两项是否存在某种表达规律；
第一个式子，我们可以理解成 使用确定性策略 $\mu(s;\theta)$ 从状态 $s$ 经过动作 $a=\mu(s;\theta)$ 执行0次状态转移至 状态 $s$ 的价值函数的梯度。
状态 $s$ 转移至状态 $\to$ 相当于没有进行状态转移——静止不动 ，因此动态特性函数 $\mid s,\mu(s;\theta))=1$ 恒成立。并且 转移后的状态结果只有 $s$ 自身。因此，可以将第一项式子扩展如下：

\begin{aligned} \nabla μ (s; θ) \cdot \nabla_{a} q_{μ_{θ}} (s, a) |_{a = μ (s; θ)} \\ = 1 \times \sum_{s} 1 \times \nabla μ (s; θ) \cdot \nabla_{a} q_{μ_{θ}} (s, a) |_{a = μ (s; θ)} \\ = γ^{0} \sum_{s} P (s ∣ s, μ (s; θ)) \cdot \nabla μ (s; θ) \cdot \nabla_{a} q_{μ_{θ}} (s, a) |_{a = μ (s; θ)} \end{aligned}

\nabla μ (s; θ) \cdot \nabla_{a} q_{μ_{θ}} (s, a) ∣_{a = μ (s; θ)} = 1 \times s \sum 1 \times \nabla μ (s; θ) \cdot \nabla_{a} q_{μ_{θ}} (s, a) ∣_{a = μ (s; θ)} = γ^{0} s \sum P (s ∣ s, μ (s; θ)) \cdot \nabla μ (s; θ) \cdot \nabla_{a} q_{μ_{θ}} (s, a) ∣_{a = μ (s; θ)}

再次将第一项与第二项进行对比：

\gamma^0 \sum_{s} P(s \mid s,\mu(s;\theta)) \cdot \nabla \mu(s;\theta) \cdot \nabla_{a}q_{\mu_{\theta}}(s,a)|_{a = \mu(s;\theta)}\\ \gamma \sum_{s'}P(s' \mid s,\mu(s;\theta)) \cdot \nabla \mu(s';\theta) \cdot \nabla_{a'}q_{\mu_{\theta}}(s',a')|_{a' = \mu(s';\theta)}

至此，找到规律：
如果状态 $s$ 执行了 $N$ 次状态转移后达到状态

s^{(N)}

：

\gamma^N \sum_{s^{(N)}} P(s^{(N)} \mid s,\mu(s;\theta)) \cdot \nabla \mu(s^{(N)};\theta) \cdot \nabla_{a^{(N)}}q_{\mu_{\theta}}(s^{(N)},a^{(N)})|_{a^{(N)} = \mu(s^{(N)};\theta)}

因此，

\nabla \mathcal J(\theta) = \nabla V_{\mu_{\theta}}(s_0)

表示如下：

\begin{aligned} \nabla J (θ) & = \nabla V_{μ_{θ}} (s_{0}) \\ = \nabla μ (s; θ) \cdot \nabla_{a} q_{μ_{θ}} (s, a) |_{a = μ (s; θ)} + γ \sum_{s^{'}} P (s^{'} ∣ s, μ (s; θ)) \cdot V_{μ_{θ}} (s^{'}) \\ = \nabla μ (s; θ) \cdot \nabla_{a} q_{μ_{θ}} (s, a) |_{a = μ (s; θ)} + γ \sum_{s^{'}} P (s^{'} ∣ s, μ (s; θ)) \cdot \nabla μ (s^{'}; θ) \cdot \nabla_{a^{'}} q_{μ_{θ}} (s^{'}, a^{'}) |_{a^{'} = μ (s^{'}; θ)} + γ \sum_{s^{'}} P (s^{'} ∣ s, μ (s; θ)) \cdot γ \sum_{s^{″}} P (s^{″} ∣ s^{'}, μ (s^{'}; θ)) \cdot V_{μ_{θ}} (s^{″}) \\ = \dots \\ = \sum_{N = 0}^{+ \infty} γ^{N} \sum_{s^{(N)}} P (s^{(N)} ∣ s, μ (s; θ)) \cdot \nabla μ (s^{(N)}; θ) \cdot \nabla_{a^{(N)}} q_{μ_{θ}} (s^{(N)}, a^{(N)}) |_{a^{(N)} = μ (s^{(N)}; θ)} \end{aligned}

同样可以引入状态分布，构建一个符号：

P_r\{s_0 \to s,k,\mu\}

表示 从初始状态 $s_0$ 开始，在确定性策略 $\mu_{\theta}$ 条件下，执行 $k$ 次状态转移后达到状态 $s$ 的概率：

P_r\{s_0 \to s,k,\mu\}

符号产生过程 -> 传送门

\begin{aligned} V_{μ_{θ}} (s) & = \sum_{s \in S} \sum_{N = 0}^{+ \infty} γ^{N} \times P_{r} {s_{0} \to s, N, μ} \nabla μ (s; θ) \cdot \nabla_{a} q_{μ_{θ}} (s, a) |_{a = μ (s; θ)} \\ = \sum_{s \in S} γ^{N} \times η (s) \times \nabla μ (s; θ) \cdot \nabla_{a} q_{μ_{θ}} (s, a) |_{a = μ (s; θ)} \end{aligned}

最终同样可以得到和策略梯度定理相似的如下表达：
$\nabla \mathcal J(\theta) \propto \sum_{s \in \mathcal S}\mu(s) \times \nabla \mu(s;\theta) \cdot \nabla_{a}q_{\mu_{\theta}}(s,a)|_{a = \mu(s;\theta)}$

最终引入状态分布符号 $S_t \sim \rho^{\pi_{\theta}}$ ，将上述公式化简为期望形式：
$\nabla \mathcal J(\theta) = \mathbb E_{S_t \sim \rho^{\mu_{\theta}}} \left[\nabla \mu(S_t;\theta) \nabla_{a}q_{\mu_{\theta}}(S_t,a) |_{a=\mu(S_t;\theta)}\right]$

本质上，确定性策略梯度定理与策略梯度定理推导非常相似，只是推导初始存在差异。
由于 $\mu(S_t;\theta)$ 是确定性策略，因此不会对动作求解期望，从而不会像策略梯度定理一样通过除以 $\mu(S_t;\theta)$ 以获取 $\log$ 项。

相关参考：
深度强化学习原理、算法pytorch实战 —— 刘全，黄志刚编著
深度强化学习-确定性策略梯度算法推导

相关阅读:
classification_report
基于 Docker 搭建开发环境
[Mysql] 删除数据
Java：修改Jar的源码，并上传Nexus私有仓库，替换jar版本
洗地机哪个牌子好？2023热门洗地推荐
Vue3实现获取验证码按钮倒计时效果
depot_tools原理和实现
数据可视化复习第三章
OpenCV：开源计算机视觉的魔力之门
华为OD机试 - 用连续自然数之和来表达整数 - 滑动窗口（Java 2023 B卷 100分）

原文地址：https://blog.csdn.net/qq_34758157/article/details/126033456