Workshop 4 – Reinforcement Learning (ML) [2 weeks]
Objectives:
Gain hands-on experience with reinforcement learning.
Familiarise yourself with some of the modern approaches practical methods used for deep
reinforcement learning.
Solve basic reinforcement learning problems using Python and Keras.
Connect theoretical knowledge and practical usage by doing it yourself.
Common objectives of all workshops: Gain hands-on experience and learn by doing!
Understand how theoretical knowledge discussed in lectures relates to practice. Develop
motivation for gaining further theoretical and practical knowledge beyond the subject
material.
Overview:
Reinforcement Learning (RL) has been making headlines the last few years and there are good reasons for
it! Extensions of the methods you will see in this workshop have been used to make computers learn how to
play Atari games by themselves (https://openai.com/blog/openai-baselines-dqn/) (see also this)
(https://deepmind.com/research/publications/playing-atari-deep-reinforcement-learning/). The recent
advances in solving most challenging board games (https://deepmind.com/research/alphago/) have been
very impressive. Until even ten years ago, many people believed that computers would never learn how to
play the game "Go" due to its combinatorial complexity. Today, AlphaGo variants are the first computer
program to defeat a professional human Go player, the first program to defeat a Go world champion, and
arguably the strongest Go player in history. It is a testament to the power of RL that AlphaGo Zero
(https://deepmind.com/blog/alphago-zero-learning-scratch/) learns to play simply by playing games against
itself, starting from completely random play.
The theoretical foundations of RL have been known for a long while as presented in lectures. Today's
successes basically come from well-engineered or designed software that runs on powerful computing
systems. Multiple heuristic algorithms and designs verified through extensive experimentation seem to be
the key methodology. Despite introducing state-of-the-art concepts, tools, and implementations, this
workshop provides only a initial starting point to the world of modern RL.
Learning more on RL requires good coding skills and a powerful computer (often with a good
CUDA-supporting graphic card) or a cloud computing account with one of the major
providers. Computer and board games have been the natural playground of modern RL.
However, applications of RL to engineering disciplines
(https://blog.insightdatascience.com/using-reinforcement-learning-to-design-a-betterrocket-engine-4dfd1770497a)
remains an under-explored and very exciting domain!
Workshop Preparation: [before you arrive to the lab]
Additional packages to install
In this workshop, we will use the Openai gym (http://gym.openai.com) package for convenience. You can
install a minimum version (https://github.com/openai/gym#installation) simply by using
pip install gym
from within the Anaconda environment. This minimal install is sufficient for our purposes.
(You may or may not need to install Build Tools for Visual Studio 2019 (right click to download install
instructions) (./files/Windows_build_tools_install_2019.pdf) if you are using Windows 10.)
Ask for help from your demonstrator in case you need it.
Section 1. Multi-armed Bandits
In a k-armed (multi-armed) bandit problem, a decision making agent repeatedly chooses one of different
actions. Each action can be interpreted as pulling one of the k-levers. After each choice, the agent receives
a reward obtained from a probability distribution that depends on the selected action. The objective is to
maximise the expected total reward over a time horizon, for example, over 1000 action selections, or time
steps. Multi-armed bandits have a variety of important applications
(https://medium.com/@CornellResearch/whats-behind-your-navigation-app-79d2754e6878) ranging from
clinical trials and routing (including navigation) to recommender systems.
As a special case of reinforcement learning, the multi-armed bandit (https://en.wikipedia.org/wiki/Multiarmed_bandit)
problem has actually only a single state. The agent still has to learn the environment
represented by the underlying probability distributions and rewards. The problem provides a nice
introduction to reinforcement learning and an opportunity to explore the fundamental exploration versus
exploitation trade-offs involved.
Hint: Example implementations online (randomly selected, not guaranteed to be correct):
https://www.analyticsvidhya.com/blog/2018/09/reinforcement-multi-armed-bandit-scratch-python/
(https://www.analyticsvidhya.com/blog/2018/09/reinforcement-multi-armed-bandit-scratch-python/)
https://peterroelants.github.io/posts/multi-armed-bandit-implementation/
(https://peterroelants.github.io/posts/multi-armed-bandit-implementation/)
https://lilianweng.github.io/lil-log/2018/01/23/the-multi-armed-bandit-problem-and-its-solutions.html
(https://lilianweng.github.io/lil-log/2018/01/23/the-multi-armed-bandit-problem-and-its-solutions.html)
https://towardsdatascience.com/comparing-multi-armed-bandit-algorithms-on-marketing-use-cases-
8de62a851831 (https://towardsdatascience.com/comparing-multi-armed-bandit-algorithms-onmarketing-use-cases-8de62a851831)
k
In [1]: %matplotlib notebook
import pandas as pd
import numpy as np
import time
import random
import matplotlib.pyplot as plt
import matplotlib
import keras.models
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import gym
10-armed bandit data set
Let's first (create or) load a random 10-armed data set that approximately matches the description in Section
2.3 of Sutton and Barto RL book. (http://incompleteideas.net/book/the-book-2nd.html)
Using TensorFlow backend.
--------------------------------------------------------------------
-------
ModuleNotFoundError Traceback (most recent cal
l last)
<ipython-input-1-e064bb07ced9> in <module>
11 from keras.layers import Dense
12 from keras.optimizers import Adam
---> 13 import gym
ModuleNotFoundError: No module named 'gym'
In [ ]: def gen_data(num_bandits=10, T=2000, filename='10armdata'):
## function generates a synthetic data set with given parameters
## and saves the result to files folder under the given name
# init data array
tenarm_data = np.zeros((T,num_bandits))
# random mean awards
mean_rewards = np.random.normal(size=num_bandits)
#print(mean_rewards)
#print(np.random.normal(0,1,num_bandits))
for t in range(T):
tenarm_data[t,:]=np.random.normal(mean_rewards,1,num_bandits)
np.save('./files/'+filename, tenarm_data)
#gen_data()
#tenarm_data.shape
#tenarm_data[0:10,:]
# use generated data
tenarm_datal = np.load('./files/10armdata.npy')
tenarm_datal.shape
Multi-armed Bandit Algorithms
We now implement a simple random strategy for selecting actions. The results are also random as expected.
This can be considered as pure exploration since the algorithm keeps randomly choosing actions. However,
note that we do not make proper use of the randomly collected observations yet.
In [ ]: def bandit_random(data=tenarm_datal):
# random selection bandit algorithm
num_bandits = tenarm_datal.shape[1]
T = tenarm_datal.shape[0]
# init storage arrays
selections = np.zeros(T) # sequence of lever selections
step_rewards = np.zeros(T) # sequence of step selections
cum_rewards = np.zeros(T) # sequence of cumulative rewards
# main loop
for t in range(T):
sel = random.randrange(num_bandits)
selections[t] = sel
step_rewards[t] = data[t,sel]
if t>0:
cum_rewards[t] = step_rewards[t]+cum_rewards[t-1]
else:
cum_rewards[t] = step_rewards[t]
avg_reward = np.divide(cum_rewards[1:T],np.arange(1,T)) # average
reward over steps
return (selections, step_rewards, cum_rewards, avg_reward)
(selections, step_rewards, cum_rewards, avg_reward) = bandit_random()
print(cum_rewards[-1]) # total reward
plt.figure()
plt.title('Average Reward vs Time')
plt.xlabel('Time')
plt.ylabel('Average Reward')
plt.plot(avg_reward)
plt.show()
Let us consider next a more meaningful strategy, known as -greedy algorithm. The idea is to explore with
a pre-determined fixed probability and exploit, i.e. get the maximum reward given current knowledge,
with probability . The observations are now used to estimate the values of actions by averaging. This
well-known algorithm is discussed in Section 2.7 of Sutton and Barto book
(http://incompleteideas.net/book/the-book-2nd.html) and described below:
We provide a rudimentary implementation below as a single run.
In [ ]: def bandit_epsgreedy(data=tenarm_datal, eps=0.1):
# epsilon-greedy bandit algorithm
# parameters
num_bandits = data.shape[1]
T = data.shape[0]
# init storage arrays
Q = np.zeros(num_bandits)
N = np.zeros(num_bandits)
selections = np.zeros(T) # sequence of lever selections
step_rewards = np.zeros(T) # sequence of step selections
cum_rewards = np.zeros(T) # sequence of cumulative rewards
# main loop
for t in range(T):
# pull lever
if np.random.rand() < eps:
# make a random selection
sel = random.randrange(num_bandits)
else:
# choose the best expected reward
sel = np.argmax(Q)
# update nbr of selections made
N[sel] = N[sel] + 1
# update mean reward estimate
Q[sel] = Q[sel] + (1/N[sel])*(data[t,sel]- Q[sel])
# store values
selections[t] = sel
step_rewards[t] = data[t,sel]
if t>0:
cum_rewards[t] = step_rewards[t]+cum_rewards[t-1]
else:
cum_rewards[t] = step_rewards[t]
avg_reward = np.divide(cum_rewards[1:T],np.arange(1,T)) # average
reward over steps
return (selections, step_rewards, cum_rewards, avg_reward)
(selections, step_rewards, cum_rewards, avg_reward) = bandit_epsgreedy
(eps=0.15)
print(cum_rewards[-1]) # total reward
plt.figure()
plt.title('Average Reward vs Time')
plt.xlabel('Time')
plt.ylabel('Average Reward')
plt.plot(avg_reward)
plt.show()
Next, we run the algorithm over multiple simulations, which we generate by permutating the input data. The
obtained average results are naturally less "noisy". It may take many simulations to get low-variance,
averaged results, especially if the bandits are truly random. Here we use a static dataset so we get good
averages with a small number of runs.
In [ ]: def bandit_epsgreedy_sims(datasim=tenarm_datal, epsilon=0.1, nbr_sims=
10):
# parameters
num_bandits = datasim.shape[1]
T = datasim.shape[0]
# store values
sim_cum_rewards = np.zeros((nbr_sims,T))
sim_mean_rewards = np.zeros((nbr_sims,T-1))
for s in range(nbr_sims):
(dummy,dummy, cum_rewards, avg_reward) = bandit_epsgreedy(data
=np.random.permutation(datasim),
ep
s=epsilon)
sim_cum_rewards[s,:] = cum_rewards
sim_mean_rewards[s,:] = avg_reward
return (sim_cum_rewards, sim_mean_rewards)
(sim_cum_rewards, sim_mean_rewards) = bandit_epsgreedy_sims(epsilon=0.
15, nbr_sims=10)
print('Expected total reward (over sims) = ', np.average(sim_cum_rewar
ds[:,-1]))
In [ ]: sim_avg_rewards = np.average(sim_mean_rewards, axis=0)
sim_avg_rewards.shape
plt.figure()
plt.title('Averaged Reward (over sims and time) vs Time')
plt.xlabel('Time')
plt.ylabel('Averaged Reward')
plt.plot(sim_avg_rewards)
plt.show()
Exploration vs Exploitation Trade-off
It is important to investigate the relationship between the outcome (average cumulative reward over time)
and parameter. For small , the algorithm is more greedy and chooses the best action (given knowledge
level, here Q estimate) most of the time. This is called exploitation in reinforcement learning (RL)
(https://en.wikipedia.org/wiki/Reinforcement_learning). For large , the algorithm spends more time in
exploration mode and obtains better Q estimates. This exploration vs exploitation trade-off is fundamental
to all RL approaches (https://www.coursera.org/lecture/practical-rl/exploration-vs-exploitation-3ExUr), not
just multi-armed bandits. The same concepts are also relevant to dual control
(https://en.wikipedia.org/wiki/Dual_control_theory) as well as adaptive control
(https://en.wikipedia.org/wiki/Adaptive_control).
ε ε
ε
Question 1 (20 pts) A Multi-armed bandit for CDN Optimisation
In this question, the problem of real-world data retrieval from multiple redundant sources is investigated.
This communication network problem is commonly known as the Content Distribution Network (CDN)
problem (see a relevant paper, right click to download) (./files/performance_of_CDN.pdf). An agent must
retrieve data through a network with several redundant sources available. For each retrieval, the agent
selects one source and waits until the data is retrieved. The objective of the agent is to minimize the sum of
the delays for the successive retrievals. This problem is investigated in Section 4.2 of this paper, (right click
to download) (./files/bandit.pdf) and related project (http://bandit.sourceforge.net/) as well as discussed in
this practical book (http://shop.oreilly.com/product/0636920027393.do).
We will use a subset of the publicly available (./files/license.txt) universities web latency networking data set
from the bandit project (http://bandit.sourceforge.net/), which contains retrieval delay/latency measurements
from over 700 universities' homepages in milliseconds. Let's decrease the number of bandits (columns)
randomly to 20 to make it computationally less time consuming (but you can change this later if you wish).
The rewards are the negatives of the delays.
In [ ]: univ_data = pd.read_csv('./files/univ-latencies.csv')
univ_data = -univ_data.sample(n=20, axis=1) #choose 20 columns randoml
y for computational simplicity
univ_data.head()
Answer the following by implementing and simulating well-known multi-armed bandit algorithms.
1. (8 pts) Apply -greedy algorithm to the CDN problem, i.e. universities web latency data set using
negative of latencies as rewards. Try different values to investigate exploration vs exploitation trade-off
and the best total average reward.
2. (12 pts) Implement and apply upper confidence bound (UCB) action selection algorithm to the same
data set. Compare your results and briefly discuss your findings.
Answer as text here
In [ ]: ''' Answer as code here '''
Section 2. Model-free Reinforcement Learning with
CartPole
In this section, we are going to use a simple (v1) cart-pole environment
(http://gym.openai.com/envs/CartPole-v1/) from openai gym (https://gym.openai.com/). In this classic
control problem, a pole or inverted pendulum (https://en.wikipedia.org/wiki/Inverted_pendulum) is attached
vertically to a cart, which moves along a frictionless track. The system is controlled by applying a horizontal
force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A
reward of +1 is provided for every time step that the pole remains upright. The episode ends when the pole
is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
Note that, there are detailed models of the cartpole environment, see for example this
(https://danielpiedrahita.wordpress.com/portfolio/cart-pole-control/), which can be used to develop a variety
of model-based solutions. However, we will use this simple environment to illustrate a model-free, purely
data-oriented modern reinforcement learning approach as discussed in the lectures.
First, let's have a look at the environment by taking random actions. When you run the cell below, you
should see a new window with the cartpole environment for a short duration.
In [ ]: env = gym.make('CartPole-v1')
env.reset()
for _ in range(100):
env.render()
env.step(env.action_space.sample()) # take a random action
time.sleep(2) # wait for 2 seconds
env.close()
Environment
The figure below left depicts the standard SARSA (State-Action-Reward-State'-Action') paradigm. As
discussed during lectures, the idea is for the agent to learn the environment and achieve the objective
concurrently by taking a mixture of random actions and actions that maximise the cumulative reward.
The figure below right represents an implementation within the openai gym environment.
(http://gym.openai.com/docs/#observations) In this implementation, observation corresponds to state.
The environment is described in detail in the classic publication cited below (right click to download)
(./files/Barto1983.pdf). Note that this predates deep learning so it has more historical value than practical
value, other than the model itself.
AG Barto, RS Sutton and CW Anderson, "Neuronlike Adaptive Elements That Can Solve Difficult Learning
Control Problem", IEEE Transactions on Systems, Man, and Cybernetics, 1983.
The states and actions are described within the source code of the environment.
(https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)
Actions are {0, 1}, corresponding to "Push cart to the left" and "Push cart to the right", respectively.
In [ ]: env = gym.make('CartPole-v0')
print(env.action_space)
env.action_space.n
In [ ]: print(env.observation_space)
The states are [Cart Position, Cart Velocity, Pole Angle, Pole Velocity At Tip]. The angle is zero when to pole
is pointing up.
In [ ]: env = gym.make('CartPole-v1')
observation = env.reset()
print(np.array([env.reset()]))
In [ ]: env = gym.make('CartPole-v1')
for episode in range(10):
observation = env.reset()
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample() # take a random action
observation, reward, done, info = env.step(action) # episode e
nds when done=True based on conditions
if done:
print("Episode finished after {} timesteps".format(t+1))
break
env.close()
DQN Agent
We now create a DQN agent class, which implements the behaviour of the agent based on the famous paper
(right click to download) (./files/dqn.pdf). This implementation is based on Algorithm 1 (Deep Q-learning with
Experience Replay) on page 5.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and
Martin Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
In [ ]: class CartPoleDQN:
def __init__(self, environment):
# from cartpole environment
self.env = environment
self.nbr_states = self.env.observation_space.shape[0]
self.nbr_actions = self.env.action_space.n
# to store (state,action,reward)
self.memory = deque(maxlen=1000000)
self.scores=[]
# parameters
self.gamma = 0.95 # discount rate
self.epsilon = 1 # exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.batch_size = 16
# build the DNN model
self.build_model()
def build_model(self):
""" Define Keras DNN model """
self.model = Sequential()
self.model.add(Dense(16, input_shape=(self.nbr_states,), activ
ation='relu'))
self.model.add(Dense(4, activation='relu'))
self.model.add(Dense(self.nbr_actions, activation='linear'))
self.model.compile(loss='mean_squared_error', optimizer=Adam(l
r=0.001))
def save_model(self, filename):
self.model.save(filename+".h5")
def load_model(self, filename):
self.model = keras.models.load_model(filename)
def remember(self, state, action, reward, next_state, done):
# store the SARSA from observation
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
if np.random.rand() < self.epsilon:
return self.env.action_space.sample()
else:
q_values = self.model.predict(state)[0]
return np.argmax(q_values)
def update_model(self):
if len(self.memory) < self.batch_size:
return
# sample the mini-batch
mini_batch=random.sample(self.memory, self.batch_size)
# experience replay
for state, action, reward, state_next, done in mini_batch:
if done:
q_update = reward
else:
q_update = reward + self.gamma*np.amax(self.model.pred
ict(state_next)[0])
q = self.model.predict(state)
q[0][action] = q_update
self.model.fit(state, q, batch_size=self.batch_size, verbo
se=0)
# update exploration/exploitation balance
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
else:
self.epsilon=self.epsilon_min
Experiments and Training - Agent in Environment
The interaction of the agent with the environment is simulated below. Each episode ends when control fails,
defined as (https://gym.openai.com/envs/CartPole-v1/) the pole being more than 15 degrees from vertical, or
the cart moving more than 2.4 units from the centre.
A reward of is given for every time step that the pole remains uprigh, so the score per episode is the
number of steps where the control is successful.
+1
In [ ]: env=gym.make('CartPole-v1')
dqn_agent = CartPoleDQN(environment=env)
for episode in range(50):
state = env.reset()
state = np.reshape(state, [1, dqn_agent.nbr_states]) # reshape for
keras
score = 0
done = False
while not done:
#env.render()
action = dqn_agent.act(state)
state_next, reward, done, info = env.step(action)
state_next = np.reshape(state_next, [1, dqn_agent.nbr_states])
# reshape for keras
if done:
reward = 0 # no reward for last time step
score = score + reward
dqn_agent.remember(state, action, reward, state_next, done)
state = state_next
dqn_agent.update_model()
if done:
dqn_agent.scores.append(score)
print("Episode:", episode, " Score:", score, " memory le
ngth:",
len(dqn_agent.memory), " epsilon:", dqn_agent.e
psilon)
break
env.close()
print("training completed!")
Performance Evaluation
The original criterion for success is defined as (https://gym.openai.com/envs/CartPole-v1/) getting an
average reward of greater or equal than 195.0 over 100 consecutive trials. We relax this in order to (literally)
save time. Let's consider a more modest version with an average reward >=100 over 20 consecutive trials.
In [ ]: np.average(dqn_agent.scores[30:50])
In [ ]: plt.figure()
plt.title('Score per Episode')
plt.xlabel('Episode number')
plt.ylabel('Score')
plt.plot(dqn_agent.scores)
plt.plot(np.ones((len(dqn_agent.scores),1))*100)
plt.show()
Resulting DQN Model
We now look a bit closer to the DNN underlying the DQN model we have trained.
In [ ]: dqn_agent.model.summary()
print(dqn_agent.model.get_weights())
In [ ]: ## install pydot in your environment using Anaconda Navigator or you c
an skip this step
## by commenting out the commands below!
## Plot model graph
#keras.utils.plot_model(dqn_agent.model, show_shapes=True, show_layer_
names=True, to_file='model.png')
#from IPython.display import Image
#Image(retina=True, filename='model.png')
Visual test
We save our trained model to a file for reuse, which makes a lot of sense given the time spent on training.
Then, we illustrate the performance of the model visually. Note that because we have cut the training short
to save time, the result is rather sub-optimal.
In [ ]: # we save the trained model
dqn_agent.save_model('dqn_train2')
In [ ]: # A few episodes to watch the behaviour
env = gym.make('CartPole-v1')
dqn_agent_test = CartPoleDQN(environment=env)
dqn_agent_test.load_model('dqn_train2.h5')
dqn_agent_test.epsilon = 0 # no random movements
for episode in range(5):
state = env.reset()
state = np.reshape(state, [1, dqn_agent_test.nbr_states]) # reshap
e for keras
for t in range(150): # increase to 200 to see termination
env.render()
action = dqn_agent_test.act(state)
state_next, reward, done, info = env.step(action)
state = np.reshape(state_next, [1, dqn_agent_test.nbr_states])
# reshape for keras
if done:
print("Episode finished after {} timesteps".format(t+1))
break
env.close()
Question 2 (20 pts)
This question aims to motivate you to fully understand and improve upon the example possibly suboptimal
DQN implementation above.
Hint: a good model is quite similar to what is provided so minor tweaking should work!
1. How does this DQN differ from classical Q-learning (see e.g. Section 6.5 of Sutton and Barto RL book)?
Clearly identify and discuss similarities as well as differences. In addition, briefly describe the
relationship to value iteration and dynamic programming (via Bellman's equation).
2. Improve the performance of the DQN by changing the underlying DNN architecture, training method,
and epsilon parameter and its decay which controls the exploration vs exploitation trade-off. Interpret
and discuss your findings, e.g. what works and does not work and why?
3. [Optional, no marks] Do you see a high-level similarity between epsilon parameter and diminishing step
size in gradient optimisation algorithms?
4. [Optional, no marks] Try different DNN architectures as well as approaches, e.g. A3C or beyond
(https://towardsdatascience.com/advanced-reinforcement-learning-6d769f529eb3).
Hint: example implementations from the Web (which may be incorrect):
https://towardsdatascience.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288
(https://towardsdatascience.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288)
https://gym.openai.com/envs/CartPole-v0/ (https://gym.openai.com/envs/CartPole-v0/)
https://github.com/rlcode/reinforcement-learning (https://github.com/rlcode/reinforcement-learning)
Answer as text here
In [ ]: ''' Answer as code here '''
In [ ]:
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。