제1편: To the Rainbow
Deepmind Rainbow를 이해하기까지 -1
Deepmind의 Rainbow를 이해하기 위해서는 선행되어야할 지식이 꽤 많다.
Rainbow는 Deepmind의 DQN에서 파생되어 나온 알고리즘으로,
기존의 DQN에 아래와같이 6개가 더 추가 되었다.
아래의 성능비교 그래프와같이 아주아주 명확하게 성능 차이를 보인다. ㄷㄷㄷ
특히 Muti-step, priority, distibution, noisy의 유무에 따라 성능차이를 많이 보인다.
그래서 Rainbow를 이해하기위해 1.이론 2.코드 분석을 하나씩 하려고 한다.
순서
-
DQN
-
Double DQN
-
Dueling DQN
-
Prioritized replay memory
-
noisy networks
-
distributional RL
-
Rainbow
1.DQN
1.1 DQN 이론
DQN의 이론은 이전에 작성한 글로 대체
https://wonseokjung.github.io//rl_paper/update/RL-PP-DQN/
1.2 DQN 구현
pytorch를 사용하여 DQN을 구현해보도록 하자
A. 필요한 라이브러리를 import한다.
import math, random
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.autograd as autograd
import torch.nn.functional as F
-
import math : 수학함수를 불러올수 있다.
-
import random : random number generators
-
import gym : openAI 강화학습 gym library
openAI gym 사용법 :
https://wonseokjung.github.io//reinforcementlearning/update/openai-gym/
-
import torch : pytorch를 불러온다.
-
import torch.nn as nn : Module을 불러올수 있도록 한다.
-
import torch.optim as optim : 여러가지 optimization algorithm을 불러올수 있다.
-
import torch.autograd as autograd : 자동으로 gradient를 해준다.
-
import torch.nn.functional as F : Forward neural network를 불러온다.
B. 그래프를 plot하기 위한 jupyter notebook package
from IPython.display import clear_output
import matplotlib.pyplot as plt
%matplotlib inline
C. CUDA 사용
USE_CUDA = torch.cuda.is_available()
Variable = lambda *args, **kwargs: autograd.Variable(*args, **kwargs).cuda() if USE_CUDA else autograd.Variable(*args, **kwargs)
D. Replay Memory
from collections import deque
class ReplayBuffer(object):
def __init__(self, capacity):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
state = np.expand_dims(state, 0)
next_state = np.expand_dims(next_state, 0)
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
return np.concatenate(state), action, reward, np.concatenate(next_state), done
def __len__(self):
return len(self.buffer)
-
deque() : 빠르게 append하거나 pop할수 있는 list
-
np.expand_dims : expanded array shape.
-
zip : iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables.
F. Enviornment ( CartPole )
env = gym.make("CartPole-v0")
- CartPole 환경을 불러온다.
G.Epsilon greedy exploration
epsilon_start = 1.0
epsilon_final = 0.01
epsilon_decay = 500
epsilon_by_frame = lambda frame_idx: epsilon_final + (epsilon_start - epsilon_final) * math.exp(-1. * frame_idx / epsilon_decay)
-
epsilon을 1부터 0.01까지 점점 줄인다.
-
lambda : creation of anonymous functions
H. Plot
plt.plot([epsilon_by_frame(i) for i in range(10000)])
위의 epsilon을 출력한다.
I. DQN Model
class DQN(nn.Module):
def __init__(self, num_inputs, num_actions):
super(DQN, self).__init__()
self.layers = nn.Sequential(
nn.Linear(env.observation_space.shape[0], 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, env.action_space.n)
)
def forward(self, x):
return self.layers(x)
def act(self, state, epsilon):
if random.random() > epsilon:
state = Variable(torch.FloatTensor(state).unsqueeze(0), volatile=True)
q_value = self.forward(state)
action = q_value.max(1)[1].data[0]
else:
action = random.randrange(env.action_space.n)
return action
-
def forward() : forward neural network 생성
-
def act() : epsilon policy로 action 선택
model = DQN(env.observation_space.shape[0], env.action_space.n)
if USE_CUDA:
model = model.cuda()
optimizer = optim.Adam(model.parameters())
replay_buffer = ReplayBuffer(1000)
-
Cuda를 사용하고, optimizer는 Ada을 사용한다.
-
replay memory 는 1000으로 한다.
J.Computing Temporal Difference Loss
def compute_td_loss(batch_size):
state, action, reward, next_state, done = replay_buffer.sample(batch_size)
state = Variable(torch.FloatTensor(np.float32(state)))
next_state = Variable(torch.FloatTensor(np.float32(next_state)), volatile=True)
action = Variable(torch.LongTensor(action))
reward = Variable(torch.FloatTensor(reward))
done = Variable(torch.FloatTensor(done))
q_values = model(state)
next_q_values = model(next_state)
q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1)
next_q_value = next_q_values.max(1)[0]
expected_q_value = reward + gamma * next_q_value * (1 - done)
loss = (q_value - Variable(expected_q_value.data)).pow(2).mean()
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss
- TD learning을 계산한다.
K. Training
num_frames = 10000
batch_size = 32
gamma = 0.99
losses = []
all_rewards = []
episode_reward = 0
state = env.reset()
for frame_idx in range(1, num_frames + 1):
epsilon = epsilon_by_frame(frame_idx)
action = model.act(state, epsilon)
next_state, reward, done, _ = env.step(action)
replay_buffer.push(state, action, reward, next_state, done)
state = next_state
episode_reward += reward
if done:
state = env.reset()
all_rewards.append(episode_reward)
episode_reward = 0
if len(replay_buffer) > batch_size:
loss = compute_td_loss(batch_size)
losses.append(loss.data[0])
if frame_idx % 200 == 0:
plot(frame_idx, all_rewards, losses)
References
1.from anyrl.algos import DQN
https://github.com/unixpickle/anyrl-py/blob/master/anyrl/algos/dqn.py
2.from anyrl.envs import BatchedGymEnv
https://github.com/unixpickle/anyrl-py/blob/b43f0728400fd5c01daf4ae110c797622d0c9ddb/anyrl/envs/gym.py
3.from anyrl.envs.wrappers import BatchedFrameStack
https://github.com/unixpickle/anyrl-py/blob/8a1ab680e56be8de435b0b8fff1fc48d7a37463a/anyrl/envs/wrappers/batched.py
4.from anyrl.models import rainbow_models
https://github.com/unixpickle/anyrl-py/blob/8e119ca25724d537db3e620bd922f39a2ac61ea4/anyrl/models/dqn_dist.py
5.from anyrl.rollouts import BatchedPlayer, PrioritizedReplayBuffer, NStepPlayer
5.1batchplayer:
https://github.com/unixpickle/anyrl-py/blob/1543af96346293e90ff30b6912dab66c681c2ed1/anyrl/rollouts/players.py
5.2PrioritizedReplayBuffer
https://github.com/unixpickle/anyrl-py/blob/5e3fed0b0e249dcd01d844e4a9a47bb80b990699/anyrl/rollouts/replay.py
5.3 NStepPlayer
https://github.com/unixpickle/anyrl-py/blob/1543af96346293e90ff30b6912dab66c681c2ed1/anyrl/rollouts/players.py
6.from anyrl.spaces import gym_space_vectorizer
https://github.com/unixpickle/anyrl-py/blob/8a1ab680e56be8de435b0b8fff1fc48d7a37463a/anyrl/spaces/gym.py
7. Noisy Networks for Exploration
https://arxiv.org/abs/1706.10295
8.
https://github.com/higgsfield/RL-Adventure/blob/master/4.prioritized%20dqn.ipynb