Function Approximation

2018-07-11

Function Approximation

wonseok jung

Introduction

Reinforcement learning에서 function approximation이 어떻게 쓰이는지 알아보도록 하겠다.

On-policy의 data를 이용하여 state-value function을 구할 것이며,

이것은 Policy $\pi$ 를 이용하여 생성된 exrience를 사용하여 State value $V_{\pi}$를 approximation한 값이다.

Value fuction을 approximation은 Parameterized function으로 $w \in R^d$ 형태로 표현한다.

주어진 wieght vector $w$에 의한 state $s$의 value를 approximation한 것을 다음과 같이 표시한다.

$\hat{v}(s,w)\approx v_{\pi}(s)$

(neuralnetwork)

$\hat{v}$는 Multi-layer artificial neural network이며 $w$는 모든 layers을 연결하는 weight의 vector이다.

function approximation을 사용하여 agent가 모든 state를 알수 없는 경우인 Partially observable problem에서 Reinforcement learning 방법을 적용할 수 있다.

Introduction 정리 :

지금까지는 tabular case를 사용하여 Reinforcement learning이 어떠한 방식으로 state value와 state-action value를 계산하여 optimal policy를 찾는지 알아보았다.

지금부터 다룰 내용은 tabular case가 아닌 parameterlize structure이다.

parameterlize structure를 이용하기 위해서 function approximation을 할것이고, funtion approximation을 하므로 generalization 와 solution을 approximation하는 것이 가능하다.

여기서의 function은 neural network, decision tree등 이 될 수 있다.

input은 위의 function approximation을 통할때 function approximation의 parameter( weight ) 에 의해 계산되어지며 function을 통해 output으로 return된다.

function approximation이 기존 방식과 가장 다른점은,

state value가 table 형식으로 이루어져 특정 state의 value를 정확히 알 수 있었으며 특정 state 의 value를 업데이트할때 다른 state value에 영향을 끼치지 않았다.

하지만 function approximation를 사용하면 특정 state value를 정확하게 알수 없으며 특정 state를 update할때 다른 state value에도 영향을 미친다.

Parametric form에서는 state value의 update가 더이상 independent하지 않다.

Parameter를 사용하므로서 더이상 이전 tabular 방법과 같이 각 state의 true value를 찾는것이 더이상 가능하지 않다.

1. Deep learning

최근 Supervised learning에서 deep learning이 널리 쓰이고 있다.

Function approximator로 deep learning을 사용하기전에,

먼저 neural network를 알아보기 전에 neuron이 무엇인지 알아보도록 하겠다.

1.1 Neuron

Neuron은 아래의 그림과 같이. Cell body, Dendrites, Axon, Synaptic terminals로 나누어져 있다.

Dendrities 는 input으로 data를 받으며, Cell body는 받은 data를 합산한뒤 threshold를 지나 Axon은 Cell body에서 계산된 값을 출력한다.

Axon의 signal은 binary로 1,0이다.

1.2 Neural Network

Neuron처럼 Neural Network과 비슷하게 Input layer, hidden layer, output layer로 이루어져 있다.

Input layer은 feature vector(강화학습에서 input은 대부분 state이다.)라고 불리며

Inputer layer의 각 Input은 각 weight와 곱하고 합쳐진 뒤 출력된다.

$\sum_{i=1}^d W_iX_i$ $=\vec{W}^T\vec{X}$

이전에는 아래의 식과 같은 식으로 Estimation을 update했다.

$NewEst \leftarrow OldEst + \alpha [Target - OldEst]$

Function approximation방법은 Estimation을 update할때 weight를 update하며,

weight(parameter)의 gradient를 계산하여 $[Target-OldEst]$의 방향으로 update한다.

$\vec{W} \leftarrow \vec{W} + \alpha [Target - OldEst] \bigtriangledown_{w} OldEst$

또한, time step $t$ OldEstimtion은 에서 W vector와 X vector의 곱을 general하게 바꾸면 아래와 같이 정의할수 있다.

$OldEst_t = \vec{W}^T \vec{X}_t = \hat{V}(S_t,\vec{W}_t)$

$\hat{v}$는 time $t$일때 weight vetor $\vec{w}$에 따른 State $s$의 approximation value이다.

또한, $\vec{X_t}$는 function x 이 time step $t$ State $s$에 적용된 input state에 부합하는 feature vector 이다.

$\vec{X_t} = x(S_{t})$

그러므로, weight를 update하는 식은 다음과 같이 정의한다.

$\vec{W_{t+1}} \doteq \vec{W_t} + \alpha [Target_t - \hat{V}(S_t,\vec{W})] \bigtriangledown_w \hat{V}(S_t,\vec{W})$ $\vec{W_{t+1}} \doteq W_{t} + \alpha [R_{t+1} + \gamma \vec{W_t}^T X_{t+1} - \vec{W_t}^T X_{t}]\vec{X}$

Reference

Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto Second Edition, in progress MIT Press, Cambridge, MA, 2017

https://wonseokjung.github.io/

 A3C - Asynchorous Advantage Actor Critic Network Function Approximation -2 