(其实是专业课作业🤣 感觉算法岗面试可能会问,来存一下档)
问题:证明 Value Iteration 收敛性
Please prove the convergence of Value Iteration algorithm in discounted reward setting, and analyze the performance between resulting policy and optimal policy. 证明 Value Iteration 的收敛性。
背景:
- Policy Iteration:先得到一个 policy,然后算出它的 value function,基于该 value function 取 max_a Q(s,a) 作为新 policy,再算新 policy 的 value function…
- Value Iteration:不停对 value function 使用贝尔曼算子,新 V = max_a [r(s,a) + γV(s')]。
0 Definitions - 定义
First, we define
- Discrete state space
, discrete action space . - Value function:
. - Bellman operator: for all
,
The Value Iteration algorithm is to conduct Bellman operator
我们定义了状态空间、动作空间、值函数(value function)和贝尔曼算子(Bellman operator)。VI 算法,就是对 value function 一直使用 Bellman operator,希望最后能收敛到最优的 value function。这样,该最优 value function 必然是 Bellman equation
1 Bellman operator is a Contraction Mapping - 贝尔曼算子是压缩映射
Then, we need to prove that Bellman Operator is a Contraction Mapping. The definition of Contraction Mapping is as follows:
where
where
接下来,证明贝尔曼算子是压缩映射(Contraction Mapping)。在这里,压缩映射的定义是,两个变量的无穷范数(向量各维之差的最大绝对值)会被
Let's prove that Bellman Operator
贝尔曼算子是压缩映射的证明:一连串放缩,如上。
2 Contraction Mapping converges to the optimal value function - 压缩映射能收敛到最优值函数
If Bellman Operator is a Contraction Mapping, then we can use the Banach fixed point theorem to prove that, the optimal value function
如果贝尔曼算子是压缩映射,则可以用巴拿赫不动点定理(Banach fixed point theorem),证明
The first step is to prove that the fixed point
第一步,证明收敛到的不动点是唯一的。使用反证法,假设有两个不动点,那么它们俩的压缩映射都映射到原地(不动点嘛),距离没有 γ 倍压缩,矛盾,从而得证。
The second step is to prove that the fixed point
So, if we can choose a large enough
第二步,证明不动点
Reference - 参考资料
- A blog from Zhihu: https://zhuanlan.zhihu.com/p/419208786
- 非常感谢!很清晰的博客 😊🌹