The key idea is rather than you need to tell the algorithm what the right output y for every single input is, all you have to do instead is specify a reward function that tells it when it’s doing well and when it’s doing poorly.
The first step is
r
0
r^0
r0.
Select the orientation according to the first two tables
For example,
π
(
2
)
\pi(2)
π(2) is left while
π
(
5
)
\pi(5)
π(5) is right. The number expresses state.
The iteration will be used.
Q
(
s
,
a
)
=
R
(
s
)
+
r
∗
m
a
x
Q
(
s
′
,
a
′
)
Q(s,a) = R(s) + r * max Q(s^{'},a^{'})
Q(s,a)=R(s)+r∗maxQ(s′,a′)
Sometimes it actually ends up accidentally slipping and going in the opposite direction.
Every variable is continuous.
Q is a random value at first. We will train the model to find a better Q.
ε = 0.05
If we choose a bad ε, we may take 100 times as long.
The idea of mini-batch gradient descent is to not use all 100 million training examples on every single iteration through this loop. Instead, we may pick a smaller number, let me call it m prime equals say, 1,000. On every step, instead of using all 100 million examples, we would pick some subset of 1,000 or m prime examples.