제1편: To the Rainbow

2018-04-25

Deepmind Rainbow를 이해하기까지 -1

Deepmind의 Rainbow를 이해하기 위해서는 선행되어야할 지식이 꽤 많다.

Rainbow는 Deepmind의 DQN에서 파생되어 나온 알고리즘으로,

기존의 DQN에 아래와같이 6개가 더 추가 되었다.

rainbow1

아래의 성능비교 그래프와같이 아주아주 명확하게 성능 차이를 보인다. ㄷㄷㄷ

rainbow2

특히 Muti-step, priority, distibution, noisy의 유무에 따라 성능차이를 많이 보인다.

rainbow3

그래서 Rainbow를 이해하기위해 1.이론 2.코드 분석을 하나씩 하려고 한다.

순서

DQN
Double DQN
Dueling DQN
Prioritized replay memory
noisy networks
distributional RL
Rainbow

1.DQN

1.1 DQN 이론

DQN의 이론은 이전에 작성한 글로 대체

https://wonseokjung.github.io//rl_paper/update/RL-PP-DQN/

1.2 DQN 구현

pytorch를 사용하여 DQN을 구현해보도록 하자

A. 필요한 라이브러리를 import한다.

import math, random

import gym
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.autograd as autograd 
import torch.nn.functional as F

import math : 수학함수를 불러올수 있다.
import random : random number generators
import gym : openAI 강화학습 gym library

openAI gym 사용법 :

https://wonseokjung.github.io//reinforcementlearning/update/openai-gym/

import torch : pytorch를 불러온다.
import torch.nn as nn : Module을 불러올수 있도록 한다.
import torch.optim as optim : 여러가지 optimization algorithm을 불러올수 있다.
import torch.autograd as autograd : 자동으로 gradient를 해준다.
import torch.nn.functional as F : Forward neural network를 불러온다.

B. 그래프를 plot하기 위한 jupyter notebook package

from IPython.display import clear_output
import matplotlib.pyplot as plt
%matplotlib inline

C. CUDA 사용

USE_CUDA = torch.cuda.is_available()
Variable = lambda *args, **kwargs: autograd.Variable(*args, **kwargs).cuda() if USE_CUDA else autograd.Variable(*args, **kwargs)

D. Replay Memory

from collections import deque

class ReplayBuffer(object):
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        state      = np.expand_dims(state, 0)
        next_state = np.expand_dims(next_state, 0)
            
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
        return np.concatenate(state), action, reward, np.concatenate(next_state), done
    
    def __len__(self):
        return len(self.buffer)

deque() : 빠르게 append하거나 pop할수 있는 list
np.expand_dims : expanded array shape.
zip : iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables.

F. Enviornment ( CartPole )

env = gym.make("CartPole-v0")

CartPole 환경을 불러온다.

G.Epsilon greedy exploration

epsilon_start = 1.0
epsilon_final = 0.01
epsilon_decay = 500

epsilon_by_frame = lambda frame_idx: epsilon_final + (epsilon_start - epsilon_final) * math.exp(-1. * frame_idx / epsilon_decay)

epsilon을 1부터 0.01까지 점점 줄인다.
lambda : creation of anonymous functions

H. Plot

plt.plot([epsilon_by_frame(i) for i in range(10000)])

위의 epsilon을 출력한다.

I. DQN Model

class DQN(nn.Module):
    def __init__(self, num_inputs, num_actions):
        super(DQN, self).__init__()
        
        self.layers = nn.Sequential(
            nn.Linear(env.observation_space.shape[0], 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, env.action_space.n)
        )
        
    def forward(self, x):
        return self.layers(x)
    
    
    def act(self, state, epsilon):
        if random.random() > epsilon:
            state   = Variable(torch.FloatTensor(state).unsqueeze(0), volatile=True)
            q_value = self.forward(state)
            action  = q_value.max(1)[1].data[0]
        else:
            action = random.randrange(env.action_space.n)
        return action

def forward() : forward neural network 생성
def act() : epsilon policy로 action 선택

model = DQN(env.observation_space.shape[0], env.action_space.n)

if USE_CUDA:
    model = model.cuda()
    
optimizer = optim.Adam(model.parameters())

replay_buffer = ReplayBuffer(1000)

Cuda를 사용하고, optimizer는 Ada을 사용한다.
replay memory 는 1000으로 한다.

J.Computing Temporal Difference Loss

def compute_td_loss(batch_size):
    state, action, reward, next_state, done = replay_buffer.sample(batch_size)

    state      = Variable(torch.FloatTensor(np.float32(state)))
    next_state = Variable(torch.FloatTensor(np.float32(next_state)), volatile=True)
    action     = Variable(torch.LongTensor(action))
    reward     = Variable(torch.FloatTensor(reward))
    done       = Variable(torch.FloatTensor(done))

    q_values      = model(state)
    next_q_values = model(next_state)

    q_value          = q_values.gather(1, action.unsqueeze(1)).squeeze(1)
    next_q_value     = next_q_values.max(1)[0]
    expected_q_value = reward + gamma * next_q_value * (1 - done)
    
    loss = (q_value - Variable(expected_q_value.data)).pow(2).mean()
        
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    return loss

TD learning을 계산한다.

K. Training

num_frames = 10000
batch_size = 32

gamma      = 0.99


losses = []

all_rewards = []

episode_reward = 0

state = env.reset()
for frame_idx in range(1, num_frames + 1):
    epsilon = epsilon_by_frame(frame_idx)
    action = model.act(state, epsilon)
    
    next_state, reward, done, _ = env.step(action)
    replay_buffer.push(state, action, reward, next_state, done)
    
    state = next_state
    episode_reward += reward
    
    if done:
        state = env.reset()
        all_rewards.append(episode_reward)
        episode_reward = 0
        
    if len(replay_buffer) > batch_size:
        loss = compute_td_loss(batch_size)
        losses.append(loss.data[0])
        
    if frame_idx % 200 == 0:
        plot(frame_idx, all_rewards, losses)

References

1.from anyrl.algos import DQN

https://github.com/unixpickle/anyrl-py/blob/master/anyrl/algos/dqn.py

2.from anyrl.envs import BatchedGymEnv

https://github.com/unixpickle/anyrl-py/blob/b43f0728400fd5c01daf4ae110c797622d0c9ddb/anyrl/envs/gym.py

3.from anyrl.envs.wrappers import BatchedFrameStack

https://github.com/unixpickle/anyrl-py/blob/8a1ab680e56be8de435b0b8fff1fc48d7a37463a/anyrl/envs/wrappers/batched.py

4.from anyrl.models import rainbow_models

https://github.com/unixpickle/anyrl-py/blob/8e119ca25724d537db3e620bd922f39a2ac61ea4/anyrl/models/dqn_dist.py

5.from anyrl.rollouts import BatchedPlayer, PrioritizedReplayBuffer, NStepPlayer

5.1batchplayer:

https://github.com/unixpickle/anyrl-py/blob/1543af96346293e90ff30b6912dab66c681c2ed1/anyrl/rollouts/players.py

5.2PrioritizedReplayBuffer

https://github.com/unixpickle/anyrl-py/blob/5e3fed0b0e249dcd01d844e4a9a47bb80b990699/anyrl/rollouts/replay.py

5.3 NStepPlayer

https://github.com/unixpickle/anyrl-py/blob/1543af96346293e90ff30b6912dab66c681c2ed1/anyrl/rollouts/players.py

6.from anyrl.spaces import gym_space_vectorizer

https://github.com/unixpickle/anyrl-py/blob/8a1ab680e56be8de435b0b8fff1fc48d7a37463a/anyrl/spaces/gym.py

7. Noisy Networks for Exploration

https://arxiv.org/abs/1706.10295

8.

https://github.com/higgsfield/RL-Adventure/blob/master/4.prioritized%20dqn.ipynb

 제2편: 강화학습의 거의 모든것: Multi-armed Bandit 제2편: To the Rainbow- Noisy Networks for Exploration 