n-step bootstrapping - 3

2018-03-02

5.4 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

Importance Sampling 없이 off-policy 방법을 사용하려면 어떻게 해야할까?

One-step TD 일때는 Q-learning을 통해 가능하다는 것을 배웠다.

하지만 n-step과 같이 스텝이 하나가 아닌 여러 스텝이라면?

tree-backup algorithm을 사용하여 n-step에서 importance sampling 없이 update를 해보자

이 알고리즘의 아이디어는 3-step-tree back diagram에서 나왔다.

아래의 diagram을 보면 3개의 sample state, 3개의 reward, 2개의 sample action이 있다.

tree-backup diagram 1

이것들은 맨 처음 state-action pair $S_t, A_t$ 에서 발생되어진 random variable이다.

action의 초록색 node들은 아직 선택되어지지 않은 action이다.

선택되지 않은 action들에 대해서는 sample data가 아직 없다. 그렇기 때문에 bootstrappin을 하여 update를 위한 target을 만든다.

Tree-backup이 backup diagram과 다른점은,

Tree-backup는 선택되지 않은 action(초록색 node)의 estimate 값도 update에 포함시켜 업데이트 한다.

One-step return (target )의 tree-backup algorithm은 다음과 같습니다.

$G_{t:t+1} \doteq R_{t+1} + \gamma \sum_{a} \pi(a\mid s_{t+1})Q_t(S_{t+1},a)$ $=\delta_t ^{'} +Q_{t-1}(S_t,A_t)-Q_{t-1}(S_t,A_t)$

$\delta_t ^{'}$ 의 정의는 아래와 같다.

$\delta_t ^{'} \doteq R_{t+1} + \gamma \sum_{a} \pi(a\mid S_{t+1}) Q(S_{t+1}, a) - Q_{t-1}(S_t,A_t)$

이와같이 n-step tree-backup algorithm은 recursive하게 정의할 수 있다.

TD error는 다음과 같이 정의할 수 있다.

$G_{t:t+n} \doteq R_{t+1}+ \gamma \sum_{a \neq A_{t+1}} \pi(a \mid S_{t+1})Q_t(S_{t+1},a)+\gamma\pi (A_{t+1} \mid S_{t+1})G_{t+1:t+n}$ \

$=\delta_t ^{'} + Q_{t-1}(S_t,A_t) - \gamma \pi (A_{t+1}\mid S_{t+1})Q_t(S_{t+1},A_{t+1})+\gamma\pi (A_{t+1} \mid S_{t+1})G_{t+1:t+n}$ $Q_{t-1}(S_t,A_t) + \delta_t ^{'} + \gamma \pi (A_{t+1}\mid S_{t+1}))(G_{t+1:t+n} - Q_t(S_{t+1},A_{t+1})$ $Q_{t-1}(S_t,A_t) + \delta_t ^{'} + \gamma \pi (A_{t+1}\mid S_{t+1})\delta_{t+1} ^{'} \gamma^2 \pi (A_{t+1}\mid S_{t+1})+ ...$ $=Q_{t-1}(S_t,A_t) +\sum_{k=t}^\min(t+n-1,T-1) \delta_t ^{'} \prod _{i=t+1}^k \gamma \pi(A_i \mid S_i)$

이 Target은 n-step sarsa의 action-value update로 사용할수 있다.

update식과 n-step Tree Backup 을 이용하여 q*를 estimate하는 pseudo code는 아래와 같다.

$Q_{t+n}(S_t,A_t) \doteq Q_{t+n-1}(S_t,A_t) + \alpha[G_{t:t+n}- Q_{t+n-1}(S_t,A_t) ]$

screenshot from 2018-03-03 03-02-27

이로서 one step을 bootstrapping 하는 방법이 아닌 n step을 bootstrapping하는 방법을 알아보았다.

다음에는 Planning and Learning with Tabular Methods에 대하여 알아보겠다.

Reference

Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto Second Edition, in progress MIT Press, Cambridge, MA, 2017

 n-step bootstrapping - 2 Planning and Learning with Tabular Methods - 1 