• RL 基础 | Value Iteration 的收敛性证明



    (其实是专业课作业🤣 感觉算法岗面试可能会问,来存一下档)

    问题:证明 Value Iteration 收敛性

    Please prove the convergence of Value Iteration algorithm in discounted reward setting, and analyze the performance between resulting policy and optimal policy. 证明 Value Iteration 的收敛性。

    背景:

    • Policy Iteration:先得到一个 policy,然后算出它的 value function,基于该 value function 取 max_a Q(s,a) 作为新 policy,再算新 policy 的 value function…
    • Value Iteration:不停对 value function 使用贝尔曼算子,新 V = max_a [r(s,a) + γV(s')]。

    0 Definitions - 定义

    First, we define

    • Discrete state space S={s1,,sn}, discrete action space A={a1,,an}.
    • Value function: V(s)=Eaπ(a|s)[r(s,a)+γEsp(s|s,a)V(s)] .
    • Bellman operator: for all sS ,

    BπV(s)Eaπ(a|s)[r(s,a)+γEsp(s|s,a)V(s)]BV(s)maxa[r(s,a)+γEsp(s|s,a)V(s)]

    The Value Iteration algorithm is to conduct Bellman operator B on the value function V(s). We want to prove that the sequence will converge to the optimal value function V, which should be the of the Bellman equation BV=V .

    我们定义了状态空间、动作空间、值函数(value function)和贝尔曼算子(Bellman operator)。VI 算法,就是对 value function 一直使用 Bellman operator,希望最后能收敛到最优的 value function。这样,该最优 value function 必然是 Bellman equation BV=V 的不动点。

    1 Bellman operator is a Contraction Mapping - 贝尔曼算子是压缩映射

    Then, we need to prove that Bellman Operator is a Contraction Mapping. The definition of Contraction Mapping is as follows:

    d(f(x1),f(x2))γd(x1,x2),

    where d is a measurement over a metric space, γ(0,1), and f is the Contraction Mapping. In our case, the definition of Contraction Mapping is as follows:

    |f(x1)f(x2)|γ|x1x2|,

    where || is the infinite norm (return the max value over all dimensions of a vector).

    接下来,证明贝尔曼算子是压缩映射(Contraction Mapping)。在这里,压缩映射的定义是,两个变量的无穷范数(向量各维之差的最大绝对值)会被 f 映射压缩 γ 倍。

    Let's prove that Bellman Operator B is a Contraction Mapping. The proof is as follows.

    |BV1(s)BV2(s)|=|maxa[r(s,a)+γEs p(s|s,a)V1(s)]maxa[r(s,a)+γEsV2(s)]||maxa[r(s,a)+γEsV1(s)r(s,a)γEsV2(s)]|=|γmaxaEs(V1(s)V2(s))|γmaxa|Es(V1(s)V2(s))|=γmaxa|sp(s|s,a)[V1(s)V2(s)]|γmaxasp(s|s,a)|V1(s)V2(s)|=γsp(s|s,a(s))|V1(s)V2(s)|γsp(s|s,a(s))maxs|V1(s)V2(s)|=γ|V1V2|

    贝尔曼算子是压缩映射的证明:一连串放缩,如上。

    2 Contraction Mapping converges to the optimal value function - 压缩映射能收敛到最优值函数

    If Bellman Operator is a Contraction Mapping, then we can use the Banach fixed point theorem to prove that, the optimal value function V is the unique fixed point of the Bellman Operator B, and the sequence B(B(BV)) converges to V.

    如果贝尔曼算子是压缩映射,则可以用巴拿赫不动点定理(Banach fixed point theorem),证明 V 是 Bellman Equation BV=V 的唯一不动点,并且一定会收敛到该不动点。

    The first step is to prove that the fixed point V is unique. Use proof by contradiction. Suppose there are two converged value V1,V2, then we have |BV1BV2|=|V1V2|. However, Bellman operator B is a Contraction Mapping, so there is |BV1BV2|γ|V1V2|, which contradicts the previous one. Thus, there can be only one converged value, and it is obviously the fixed point V.

    第一步,证明收敛到的不动点是唯一的。使用反证法,假设有两个不动点,那么它们俩的压缩映射都映射到原地(不动点嘛),距离没有 γ 倍压缩,矛盾,从而得证。

    The second step is to prove that the fixed point V really exists. If we can prove that {V,BV,B2V,} is a Cauchy sequence, the fixed point V exists. Take two value Ba,Bb from the sequence, where ab. Use the triangle inequality of || as a measurement, we can get

    |BaBb||BaBa1|+|Ba1Bb||BaBa1|+|Ba1Ba2|+|Ba2Bb||BaBa1|++|Bb+1Bb|γb|BVV|i=0ab1γiγb1γ|BVV|.

    So, if we can choose a large enough b, the value of |BaBb| can be less than any ϵ>0. Thus, there must exists an NN, such that |BaBb|<ϵ, for all a,b>N . The sequence is a Cauchy sequence, and the fixed point V exists.

    第二步,证明不动点 V 存在。去证 {V,BV,B2V,} 是柯西序列(Cauchy sequence),如果是柯西序列,那么就存在收敛的不动点。柯西序列的定义是,对于序列里的 xa,xb,存在正整数 N 使得对于任意 a,b>N,度量 d(xa,xb)<ϵ,其中 ϵ 是任意小的正实数。先使用三角不等式,将 xaxb 的形式拆成 (xaxa1)+(xa1xa2)++(xb+1xb),然后使用压缩映射的性质,放缩成 γ 的等比序列求和,即可得到,它小于任意的小实数。

    Reference - 参考资料



  • 相关阅读:
    【管理运筹学】背诵手册(四)| 整数规划
    【深度学习】吴恩达课程笔记(二)——浅层神经网络、深层神经网络
    修改Tomcat的默认端口号
    Banana Pi BPI-W3 RK3588平台驱动调试篇 [ PCIE篇一 ] - PCIE的开发指南
    Tomcat 源码分析 (Tomcat请求处理流程) (五)
    AugMixDataset的一些示例
    如何用python做游戏(简单易上手版)【送 源码】
    windows本地搭建mmlspark分布式机器平台流程
    【C++】 内存管理
    python-oauth2实现开放接口
  • 原文地址:https://www.cnblogs.com/moonout/p/17783506.html