Return:回报,衡量当前时刻局势的好坏,未来的回报没当前汇报重要
U
t
=
R
t
+
γ
R
t
+
1
+
γ
2
R
t
+
2
+
…
U_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \ldots
Ut=Rt+γRt+1+γ2Rt+2+…
Action-Value Function:近似Ut,依据动作
Q
π
(
s
t
,
a
t
)
=
E
[
U
t
∣
s
t
,
a
t
]
Q_\pi(s_t, a_t) = \mathbb{E} [U_t | s_t, a_t]
Qπ(st,at)=E[Ut∣st,at]
Optimal action-value function:
Q
∗
(
s
t
,
a
t
)
=
max
Q
π
(
s
t
,
a
t
)
Q^*(s_t, a_t) = \max Q_\pi(s_t, a_t)
Q∗(st,at)=maxQπ(st,at)
State-Value Function:近似Ut,依据策略
V
π
(
s
t
)
=
E
A
[
Q
π
(
s
t
,
A
)
]
V_\pi(s_t) = \mathbb{E}_A[Q_\pi(s_t, A)]
Vπ(st)=EA[Qπ(st,A)]
Value-based learning.对选择的动作进行评分
Deep Q network (DQN) for approximating Q*(s, a).
Learn the network parameters using temporal different (TD).
Policy-based learning.计算选择某动作的概率
Policy network for approximating π(a | s).
Learn the network parameters using policy gradient.
Actor-critic method. (Policy network value network. Example: AlphaGo
Model-Free
Value-based learning
DQN
目标:最大化return
问题:如果我们知道价值函数Q*,什么是最好的action?
挑战:Q*是什么?
使用神经网络Q去近似Q*
时间差分学习(temporal-difference learning)
Definition: Optimal action-value function.
Q*(st, at) = maxπE[Ut|St = st, At = at].
DQN: Approximate Q*(s, a)using a neural network (DQN).