Temporal-Difference Learning _New version

2018-05-25

Temporal-Difference Learning

wonseok jung

Introduction

TD learning은 Monte Carlo와 Dynamic programming의 combination이다.
- Directly learns from raw experience
- Updating estimates based in part on other learned estimates. (bootstrap)

1. TD Prediction

TD와 MC는 prediction problem을 풀기위해 직접 경험한 데이터를 사용한다.
$policy \pi$를 따라 얻은 경험을 사용하여 estimate $V(S_t)$를 update한다.
Monte Carlo는 $V(S_t)$를 update하기 위해 $G_t$ (episode의 끝) 까지 가서 value를 update한다. $V(S_t) \leftarrow V(S_t) + \alpha[G_t - V(S_t)]$

1.1 TD prediction

TD methods need to wait only until the next time step $V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + V(S_{t+1}) - V(S_t)]$
Target : $R_{t+1} + V(S_{t+1})$
위와같이 one step bootstrap한 것을 TD(0)라고 한다.

1.2 Monte Carlo와 TD의 Target 비교

Monte Carlo의 Target : $G_t$ $v_\pi(s) = E_\pi[G_t \mid S_t =s]$
Bellman Equation $= E_\pi[R_{t+1} + \gamma G_{t+1} \mid S_t=s]$
TD method의 Target : $R_{t+1} +\gamma v_{\pi}(S_{t+1})$ $= E_{\pi}[R_{t+1} +\gamma v_{\pi}(S_{t+1}) \mid S_t=s]$

1.3 TD error

Monte Carlo와 TD 는 successor state 혹은 state-action pair를 사용하여 update한다.
TD error : time t에서의 value와 successor state에서의 value의 차이 $\delta_{t} = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$

2. Advantage of TD Predection Methods

Do not require a model of the environment -No need to know reward and transition probability
Implemented in an online
- Some applications have very long episode.
- Monte Carlo would not suitable for the application.

3. TD Control

Using of TD prediction for the control problem
TD control is also faced need to trade off exploration and exploitation
Two oapproaches
- On Policy
- Off policy

3.1 TD Control - on-policy

Learn an action-value function
On-policy method estimate $q_\pi(s,a)$ with current behavior policy $\pi$

3.1 TD Control - on-policy

$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_{t},A_{t})]$

Update is done after every transition $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$
Target : $R_{t+1} + \gamma Q(S_{t+1},A_{t+1})$

3.1 On-policy TD control :SARSA

Sarsa Control Alorithm is given in the box

3.2 Off-policy TD control : Q-learning

“One of the early breakthrough in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning” - Watkins,1989

$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma max_{a}Q(s_{t+1},a) - Q(S_t,A_t)]$

Target : $R_{t+1} + \gamma max_{a}Q(s_{t+1},a)$

3.2 Off-policy TD control : Q-learning

Q-learning also learns action-value function

3.2 Off-policy TD control : Q-learning

Q-learning takes the maximum of “Next action” at “Next State”

$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma max_{a}Q(s_{t+1},a) - Q(S_t,A_t)]$

3.3 Expected Sarsa

Expected Sarsa is just like Q-learning (instead of the maximum over next state-action pairs using the expected value)
How likely each action is under the current policy

$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha[R_{t+1} + \gamma E[Q(S_{t+1},A_{t+1})\mid S_{t+1}] - Q(S_t,A_t)]$ $Q(S_t,A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \sum_{a} \pi(a\mid S_{t+1})Q(S_{t+1},a) - Q(S_t,A_t)]$

3.4 Maximization Bias and Double Learning

Q-learning : target policy is $greedy$
Sarsa : target policy is often $\epsilon-greedy$
Maximum over estimated values can lead to positive bias

3.4 Maximization Bias

Example
- Single state $s$ with many actions $a$
- Each true action-value $q(s,a)$ are all 0
- Because of the uncertainty, some of estimate value $Q(s,a)$above and below 0
- The maximum of the true value is 0, but the maximum of estimate value is positive.

3.4 Avoiding maximization bias?

It is due to using the same samples both to determine the maximizing action and to estimate its value.
Let’s divide the plays two sets to learn two independent estimates
$Q_1(a)$, $Q_2(a)$ : each an estimate of true value $q(a)$
Maximum value of two estimates
- $A^*=argmax_{a}Q_1(a)$
- $A^*=argmax_{a}Q_2(a)$

3.4 Double learning

Using one estimate $Q_1$
Determining maximizing action $A^*=argmax_{a}Q_1(a)$
Provide another estimate of value $Q_2$ with $argmax_{a}Q_1(a)$
$Q_2(A^*) = Q_2(argmax_aQ_1(a))$

3.4 Double learning

This estimate will then be unbiased in the
- $E[Q_2(A^)] =q(A^)$
Role of the two estimates reversed
- $Q_1(A^*) = Q_1(argmax_aQ_2(a))$
This is the idea of Double learning

3.4 Double Q-learning

Update rule of Double Q-learnign

$Q_1(S_t,A_t)\leftarrow$

$Q_1(S_t,A_t) +\alpha[R_{t+1} + \gamma Q_2(S_{t+1}, argmax_a Q_1(S_{t+1},a))- Q_1(S_t,A_t)]$

The two approximate value functions are treated symmetrically

3.4 Complete algorithm for Double Q-learning

36577711-38be8756-189c-11e8-8c2d-2ad52b1501a2

Summary : Temporal-Difference Learning

We divided the problem into a prediction and control problem
Prediction problem : TD methods are alternatives to Monte Carlo
Control problem : generalized policy iteration(GPI)
This is the idea that approximate policy and value functions should interact , both move toward their optimal values

Summary : Temporal-Difference Learning

Prediction problem : predict return for the current policy
Control problem : improving (ex:e-greedy) with respect to the current value function
Two methods of TD control : on-policy, off-policy
- On-policy : Sarsa
- Off-policy : Q-learning , Expected Sarsa

References

Reference

Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto Second Edition, in progress MIT Press, Cambridge, MA, 2017

https://wonseokjung.github.io/

 제3편: To the Rainbow Handmade RL -1 