Temporal-Difference Learning _New version
2018-05-25
Temporal-Difference Learning
Introduction
- TD learning은 Monte Carlo와 Dynamic programming의 combination이다.
- Directly learns from raw experience
- Updating estimates based in part on other learned estimates. (bootstrap)
1. TD Prediction
- TD와 MC는 prediction problem을 풀기위해 직접 경험한 데이터를 사용한다.
-
$policy \pi$를 따라 얻은 경험을 사용하여 estimate $V(S_t)$를 update한다.
- Monte Carlo는 $V(S_t)$를 update하기 위해 $G_t$ (episode의 끝) 까지 가서 value를 update한다.
1.1 TD prediction
-
TD methods need to wait only until the next time step
- Target : $R_{t+1} + V(S_{t+1})$
- 위와같이 one step bootstrap한 것을 TD(0)라고 한다.
1.2 Monte Carlo와 TD의 Target 비교
-
Monte Carlo의 Target : $G_t$
-
Bellman Equation
-
TD method의 Target : $R_{t+1} +\gamma v_{\pi}(S_{t+1})$
1.3 TD error
- Monte Carlo와 TD 는 successor state 혹은 state-action pair를 사용하여 update한다.
- TD error : time t에서의 value와 successor state에서의 value의 차이 $\delta_{t} = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$
2. Advantage of TD Predection Methods
- Do not require a model of the environment -No need to know reward and transition probability
- Implemented in an online
- Some applications have very long episode.
- Monte Carlo would not suitable for the application.
3. TD Control
- Using of TD prediction for the control problem
- TD control is also faced need to trade off exploration and exploitation
- Two oapproaches
- On Policy
- Off policy
3.1 TD Control - on-policy
- Learn an action-value function
- On-policy method estimate $q_\pi(s,a)$ with current behavior policy $\pi$
3.1 TD Control - on-policy
-
Update is done after every transition $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$
-
Target : $R_{t+1} + \gamma Q(S_{t+1},A_{t+1})$
3.1 On-policy TD control :SARSA
- Sarsa Control Alorithm is given in the box
3.2 Off-policy TD control : Q-learning
- “One of the early breakthrough in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning” - Watkins,1989
- Target : $R_{t+1} + \gamma max_{a}Q(s_{t+1},a)$
3.2 Off-policy TD control : Q-learning
- Q-learning also learns action-value function
3.2 Off-policy TD control : Q-learning
- Q-learning takes the maximum of “Next action” at “Next State”
3.3 Expected Sarsa
-
Expected Sarsa is just like Q-learning (instead of the maximum over next state-action pairs using the expected value)
-
How likely each action is under the current policy
3.4 Maximization Bias and Double Learning
- Q-learning : target policy is $greedy$
- Sarsa : target policy is often $\epsilon-greedy$
- Maximum over estimated values can lead to positive bias
3.4 Maximization Bias
- Example
- Single state $s$ with many actions $a$
- Each true action-value $q(s,a)$ are all 0
- Because of the uncertainty, some of estimate value $Q(s,a)$above and below 0
- The maximum of the true value is 0, but the maximum of estimate value is positive.
3.4 Avoiding maximization bias?
- It is due to using the same samples both to determine the maximizing action and to estimate its value.
- Let’s divide the plays two sets to learn two independent estimates
- $Q_1(a)$, $Q_2(a)$ : each an estimate of true value $q(a)$
- Maximum value of two estimates
- $A^*=argmax_{a}Q_1(a)$
- $A^*=argmax_{a}Q_2(a)$
3.4 Double learning
- Using one estimate $Q_1$
-
Determining maximizing action $A^*=argmax_{a}Q_1(a)$
-
Provide another estimate of value $Q_2$ with $argmax_{a}Q_1(a)$
- $Q_2(A^*) = Q_2(argmax_aQ_1(a))$
3.4 Double learning
- This estimate will then be unbiased in the
- $E[Q_2(A^)] =q(A^)$
-
Role of the two estimates reversed
- $Q_1(A^*) = Q_1(argmax_aQ_2(a))$
- This is the idea of Double learning
3.4 Double Q-learning
- Update rule of Double Q-learnign
$Q_1(S_t,A_t)\leftarrow$
$Q_1(S_t,A_t) +\alpha[R_{t+1} + \gamma Q_2(S_{t+1}, argmax_a Q_1(S_{t+1},a))- Q_1(S_t,A_t)]$
- The two approximate value functions are treated symmetrically
3.4 Complete algorithm for Double Q-learning
Summary : Temporal-Difference Learning
- We divided the problem into a prediction and control problem
- Prediction problem : TD methods are alternatives to Monte Carlo
- Control problem : generalized policy iteration(GPI)
- This is the idea that approximate policy and value functions should interact , both move toward their optimal values
Summary : Temporal-Difference Learning
- Prediction problem : predict return for the current policy
- Control problem : improving (ex:e-greedy) with respect to the current value function
- Two methods of TD control : on-policy, off-policy
- On-policy : Sarsa
- Off-policy : Q-learning , Expected Sarsa
References
Reference
- Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto Second Edition, in progress MIT Press, Cambridge, MA, 2017
https://wonseokjung.github.io/