Function Approximation -2

2018-07-16

Value function approximation (VFA)는 Table 형태를 neneral parameterized form으로 바꿔주는 역할을 한다.

세상의 모든 현상을 파악하려면 엄청나게 많은 계산량 또는 저장량이 필요할 것이다.

하지만 현대 인류는 한정된 용량의 뇌, 자원, 기계를 사용하고 있기에 세상의 모든 현상을 정확히 알수 없으므로 예측을 한다.

작거나 간단한 Environment에서는 모든 data를 저장하고 Optimal Policy를 찾았던 table case 대신

Value function approximation (VFA)는 Table 형태를 general parameterized form으로 바꿔주는 역할을 한다.

1. State value

State $S_t$가 function approximation box의 input으로 들어가며, function approxmiation의 weight(w)에 의해 state value를 output으로 return한다.

학습을 할때 목표값이나 정답값이 존재하여야 하며 그것을 Target이라고 한다.

이 target은 여럭지가 될 수 있다. 사람의 어떠한 상황에서 하는 행동들을 target이라고 볼 수도 있지만, 여기서는 reinforcement learning방법을 사용해 target은 agent가 경험한 data로부터 정의한다.

(예 : bootstrapping과 같이 또 다른 prediction을 target으로 사용)

자세히 알아보기에 앞서 위에서 언급한 value function approximation을 사용하여 parameterized form으로 바꿔 weight에 의한 state value $\hat{v}(S_t,w)$ 계산하는 방법을 알아보도록 하겠다.

Weight에 의한 Time step $t$에서의 State $S$는 다음과 같이 계산 가능하다.

$\hat{v}(S_t,w) =w^T x = \sum_{i=1}^{d} W_i^T X_i$

State value와 마찬가지고 State-action pair를 function approximator를 통해 approximation하는것이 가능하다.

$Q(s,a) \approx q_\pi(s,a) \approx \hat{q}(s,a,w)$ $\hat{q}(S_t,w) =w^T x(s,a) = \sum_{i=1}^{d} W_i^T X_i(s,a)$

gradient descent는 간단히 현재의 위치에서 기울기에 따라 descent하게 움직이고 여기서는 error을 최소화 하기 위해 $minus(-)$ term으로 인해 descent 한다.

$w \leftarrow w - \alpha \bigtriangledown_w Error^2_t$

Reinforcement learning에서는 descent가 아닌 reward 또는 value의 최대를 구하기 위해 $plus(+)$를 사용하여 ascent를 구하며 이것은 추후에 다룰 예정이다.

이 SGD를 다음과 같은 형태로 정의할 수 있다.

$w \leftarrow w - \alpha \bigtriangledown_w Error^2_t$

####

Value Function Approximation</center>

$w \leftarrow w - \alpha \bigtriangledown_w [Target_t - \hat{v}(S_t,w)]^2$

$w \leftarrow w - 2\alpha \bigtriangledown_w [Target_t - \hat{v}(S_t,w)] \bigtriangledown_w [Target_t - \hat{v}(S_t, w)]$

$w \leftarrow w + \alpha \bigtriangledown_w [Target_t - \hat{v}(S_t,w)] \bigtriangledown_w\hat{v}(S_t, w)$

$w \leftarrow w + \alpha \bigtriangledown_w [Target_t - \hat{v}(S_t,w)] x(S_t)$

$w \leftarrow w + \alpha \bigtriangledown_w [Target_t - \hat{q}(S_t,A_t,w)] x(S_t,A_t)$

예를 들어 Monte Carlo Algorithm으로 $\hat{v}$을 approximation 하는 pseudo code는 다음과 같다.

위의 알고리즘을 Randowalk라는 환경에서 실험해보았고, 아래의 graph와 같이 State distribution $\mu$는 Approximation 의 정확도에 영향을 미친다는 것을 알 수 있다.

$MSVE(w) \doteq \sum_{s\in S} \mu(s) [v_\pi(s) - \hat{v}(s,w)]^2$

$\mu(s)$ : state $s$에 머문 time step, 한 episode에서 머문 수 또는 continuing task에서는 얼마나 많은 진행되는 동안 각 state $s$에 머문 time step $t$

Valufe Function Approximation을 구할때 위와같이 state distribution을 계산할수 있는 $\mu$를 넣어 state $s$에 머문만큼 weight를 해준다.

References

Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto Second Edition, in progress MIT Press, Cambridge, MA, 2017

https://wonseokjung.github.io/