联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2019-05-30 11:09

Workshop 4 – Reinforcement Learning (ML) [2 weeks]

Objectives:

Gain hands-on experience with reinforcement learning.

Familiarise yourself with some of the modern approaches practical methods used for deep

reinforcement learning.

Solve basic reinforcement learning problems using Python and Keras.

Connect theoretical knowledge and practical usage by doing it yourself.

Common objectives of all workshops: Gain hands-on experience and learn by doing!

Understand how theoretical knowledge discussed in lectures relates to practice. Develop

motivation for gaining further theoretical and practical knowledge beyond the subject

material.

Overview:

Reinforcement Learning (RL) has been making headlines the last few years and there are good reasons for

it! Extensions of the methods you will see in this workshop have been used to make computers learn how to

play Atari games by themselves (https://openai.com/blog/openai-baselines-dqn/) (see also this)

(https://deepmind.com/research/publications/playing-atari-deep-reinforcement-learning/). The recent

advances in solving most challenging board games (https://deepmind.com/research/alphago/) have been

very impressive. Until even ten years ago, many people believed that computers would never learn how to

play the game "Go" due to its combinatorial complexity. Today, AlphaGo variants are the first computer

program to defeat a professional human Go player, the first program to defeat a Go world champion, and

arguably the strongest Go player in history. It is a testament to the power of RL that AlphaGo Zero

(https://deepmind.com/blog/alphago-zero-learning-scratch/) learns to play simply by playing games against

itself, starting from completely random play.

The theoretical foundations of RL have been known for a long while as presented in lectures. Today's

successes basically come from well-engineered or designed software that runs on powerful computing

systems. Multiple heuristic algorithms and designs verified through extensive experimentation seem to be

the key methodology. Despite introducing state-of-the-art concepts, tools, and implementations, this

workshop provides only a initial starting point to the world of modern RL.

Learning more on RL requires good coding skills and a powerful computer (often with a good

CUDA-supporting graphic card) or a cloud computing account with one of the major

providers. Computer and board games have been the natural playground of modern RL.

However, applications of RL to engineering disciplines

(https://blog.insightdatascience.com/using-reinforcement-learning-to-design-a-betterrocket-engine-4dfd1770497a)

remains an under-explored and very exciting domain!

Workshop Preparation: [before you arrive to the lab]

Additional packages to install

In this workshop, we will use the Openai gym (http://gym.openai.com) package for convenience. You can

install a minimum version (https://github.com/openai/gym#installation) simply by using

pip install gym

from within the Anaconda environment. This minimal install is sufficient for our purposes.

(You may or may not need to install Build Tools for Visual Studio 2019 (right click to download install

instructions) (./files/Windows_build_tools_install_2019.pdf) if you are using Windows 10.)

Ask for help from your demonstrator in case you need it.

Section 1. Multi-armed Bandits

In a k-armed (multi-armed) bandit problem, a decision making agent repeatedly chooses one of different

actions. Each action can be interpreted as pulling one of the k-levers. After each choice, the agent receives

a reward obtained from a probability distribution that depends on the selected action. The objective is to

maximise the expected total reward over a time horizon, for example, over 1000 action selections, or time

steps. Multi-armed bandits have a variety of important applications

(https://medium.com/@CornellResearch/whats-behind-your-navigation-app-79d2754e6878) ranging from

clinical trials and routing (including navigation) to recommender systems.

As a special case of reinforcement learning, the multi-armed bandit (https://en.wikipedia.org/wiki/Multiarmed_bandit)

problem has actually only a single state. The agent still has to learn the environment

represented by the underlying probability distributions and rewards. The problem provides a nice

introduction to reinforcement learning and an opportunity to explore the fundamental exploration versus

exploitation trade-offs involved.

Hint: Example implementations online (randomly selected, not guaranteed to be correct):

https://www.analyticsvidhya.com/blog/2018/09/reinforcement-multi-armed-bandit-scratch-python/

(https://www.analyticsvidhya.com/blog/2018/09/reinforcement-multi-armed-bandit-scratch-python/)

https://peterroelants.github.io/posts/multi-armed-bandit-implementation/

(https://peterroelants.github.io/posts/multi-armed-bandit-implementation/)

https://lilianweng.github.io/lil-log/2018/01/23/the-multi-armed-bandit-problem-and-its-solutions.html

(https://lilianweng.github.io/lil-log/2018/01/23/the-multi-armed-bandit-problem-and-its-solutions.html)

https://towardsdatascience.com/comparing-multi-armed-bandit-algorithms-on-marketing-use-cases-

8de62a851831 (https://towardsdatascience.com/comparing-multi-armed-bandit-algorithms-onmarketing-use-cases-8de62a851831)

k

In [1]: %matplotlib notebook

import pandas as pd

import numpy as np

import time

import random

import matplotlib.pyplot as plt

import matplotlib

import keras.models

from collections import deque

from keras.models import Sequential

from keras.layers import Dense

from keras.optimizers import Adam

import gym

10-armed bandit data set

Let's first (create or) load a random 10-armed data set that approximately matches the description in Section

2.3 of Sutton and Barto RL book. (http://incompleteideas.net/book/the-book-2nd.html)

Using TensorFlow backend.

--------------------------------------------------------------------

-------

ModuleNotFoundError Traceback (most recent cal

l last)

<ipython-input-1-e064bb07ced9> in <module>

11 from keras.layers import Dense

12 from keras.optimizers import Adam

---> 13 import gym

ModuleNotFoundError: No module named 'gym'

In [ ]: def gen_data(num_bandits=10, T=2000, filename='10armdata'):

## function generates a synthetic data set with given parameters

## and saves the result to files folder under the given name


# init data array

tenarm_data = np.zeros((T,num_bandits))

# random mean awards

mean_rewards = np.random.normal(size=num_bandits)

#print(mean_rewards)

#print(np.random.normal(0,1,num_bandits))

for t in range(T):

tenarm_data[t,:]=np.random.normal(mean_rewards,1,num_bandits)


np.save('./files/'+filename, tenarm_data)

#gen_data()

#tenarm_data.shape

#tenarm_data[0:10,:]

# use generated data

tenarm_datal = np.load('./files/10armdata.npy')

tenarm_datal.shape

Multi-armed Bandit Algorithms

We now implement a simple random strategy for selecting actions. The results are also random as expected.

This can be considered as pure exploration since the algorithm keeps randomly choosing actions. However,

note that we do not make proper use of the randomly collected observations yet.

In [ ]: def bandit_random(data=tenarm_datal):

# random selection bandit algorithm


num_bandits = tenarm_datal.shape[1]

T = tenarm_datal.shape[0]

# init storage arrays

selections = np.zeros(T) # sequence of lever selections

step_rewards = np.zeros(T) # sequence of step selections

cum_rewards = np.zeros(T) # sequence of cumulative rewards

# main loop

for t in range(T):

sel = random.randrange(num_bandits)

selections[t] = sel

step_rewards[t] = data[t,sel]

if t>0:

cum_rewards[t] = step_rewards[t]+cum_rewards[t-1]

else:

cum_rewards[t] = step_rewards[t]


avg_reward = np.divide(cum_rewards[1:T],np.arange(1,T)) # average

reward over steps


return (selections, step_rewards, cum_rewards, avg_reward)


(selections, step_rewards, cum_rewards, avg_reward) = bandit_random()

print(cum_rewards[-1]) # total reward

plt.figure()

plt.title('Average Reward vs Time')

plt.xlabel('Time')

plt.ylabel('Average Reward')

plt.plot(avg_reward)

plt.show()

Let us consider next a more meaningful strategy, known as -greedy algorithm. The idea is to explore with

a pre-determined fixed probability and exploit, i.e. get the maximum reward given current knowledge,

with probability . The observations are now used to estimate the values of actions by averaging. This

well-known algorithm is discussed in Section 2.7 of Sutton and Barto book

(http://incompleteideas.net/book/the-book-2nd.html) and described below:

We provide a rudimentary implementation below as a single run.

In [ ]: def bandit_epsgreedy(data=tenarm_datal, eps=0.1):

# epsilon-greedy bandit algorithm


# parameters

num_bandits = data.shape[1]

T = data.shape[0]


# init storage arrays

Q = np.zeros(num_bandits)

N = np.zeros(num_bandits)

selections = np.zeros(T) # sequence of lever selections

step_rewards = np.zeros(T) # sequence of step selections

cum_rewards = np.zeros(T) # sequence of cumulative rewards

# main loop

for t in range(T):


# pull lever

if np.random.rand() < eps:

# make a random selection

sel = random.randrange(num_bandits)

else:

# choose the best expected reward

sel = np.argmax(Q)


# update nbr of selections made

N[sel] = N[sel] + 1

# update mean reward estimate

Q[sel] = Q[sel] + (1/N[sel])*(data[t,sel]- Q[sel])


# store values

selections[t] = sel

step_rewards[t] = data[t,sel]

if t>0:

cum_rewards[t] = step_rewards[t]+cum_rewards[t-1]

else:

cum_rewards[t] = step_rewards[t]


avg_reward = np.divide(cum_rewards[1:T],np.arange(1,T)) # average

reward over steps


return (selections, step_rewards, cum_rewards, avg_reward)

(selections, step_rewards, cum_rewards, avg_reward) = bandit_epsgreedy

(eps=0.15)

print(cum_rewards[-1]) # total reward

plt.figure()

plt.title('Average Reward vs Time')

plt.xlabel('Time')

plt.ylabel('Average Reward')

plt.plot(avg_reward)

plt.show()

Next, we run the algorithm over multiple simulations, which we generate by permutating the input data. The

obtained average results are naturally less "noisy". It may take many simulations to get low-variance,

averaged results, especially if the bandits are truly random. Here we use a static dataset so we get good

averages with a small number of runs.

In [ ]: def bandit_epsgreedy_sims(datasim=tenarm_datal, epsilon=0.1, nbr_sims=

10):

# parameters

num_bandits = datasim.shape[1]

T = datasim.shape[0]


# store values

sim_cum_rewards = np.zeros((nbr_sims,T))

sim_mean_rewards = np.zeros((nbr_sims,T-1))


for s in range(nbr_sims):

(dummy,dummy, cum_rewards, avg_reward) = bandit_epsgreedy(data

=np.random.permutation(datasim),

ep

s=epsilon)

sim_cum_rewards[s,:] = cum_rewards

sim_mean_rewards[s,:] = avg_reward


return (sim_cum_rewards, sim_mean_rewards)

(sim_cum_rewards, sim_mean_rewards) = bandit_epsgreedy_sims(epsilon=0.

15, nbr_sims=10)

print('Expected total reward (over sims) = ', np.average(sim_cum_rewar

ds[:,-1]))

In [ ]: sim_avg_rewards = np.average(sim_mean_rewards, axis=0)

sim_avg_rewards.shape

plt.figure()

plt.title('Averaged Reward (over sims and time) vs Time')

plt.xlabel('Time')

plt.ylabel('Averaged Reward')

plt.plot(sim_avg_rewards)

plt.show()

Exploration vs Exploitation Trade-off

It is important to investigate the relationship between the outcome (average cumulative reward over time)

and parameter. For small , the algorithm is more greedy and chooses the best action (given knowledge

level, here Q estimate) most of the time. This is called exploitation in reinforcement learning (RL)

(https://en.wikipedia.org/wiki/Reinforcement_learning). For large , the algorithm spends more time in

exploration mode and obtains better Q estimates. This exploration vs exploitation trade-off is fundamental

to all RL approaches (https://www.coursera.org/lecture/practical-rl/exploration-vs-exploitation-3ExUr), not

just multi-armed bandits. The same concepts are also relevant to dual control

(https://en.wikipedia.org/wiki/Dual_control_theory) as well as adaptive control

(https://en.wikipedia.org/wiki/Adaptive_control).

ε ε

ε

Question 1 (20 pts) A Multi-armed bandit for CDN Optimisation

In this question, the problem of real-world data retrieval from multiple redundant sources is investigated.

This communication network problem is commonly known as the Content Distribution Network (CDN)

problem (see a relevant paper, right click to download) (./files/performance_of_CDN.pdf). An agent must

retrieve data through a network with several redundant sources available. For each retrieval, the agent

selects one source and waits until the data is retrieved. The objective of the agent is to minimize the sum of

the delays for the successive retrievals. This problem is investigated in Section 4.2 of this paper, (right click

to download) (./files/bandit.pdf) and related project (http://bandit.sourceforge.net/) as well as discussed in

this practical book (http://shop.oreilly.com/product/0636920027393.do).

We will use a subset of the publicly available (./files/license.txt) universities web latency networking data set

from the bandit project (http://bandit.sourceforge.net/), which contains retrieval delay/latency measurements

from over 700 universities' homepages in milliseconds. Let's decrease the number of bandits (columns)

randomly to 20 to make it computationally less time consuming (but you can change this later if you wish).

The rewards are the negatives of the delays.

In [ ]: univ_data = pd.read_csv('./files/univ-latencies.csv')

univ_data = -univ_data.sample(n=20, axis=1) #choose 20 columns randoml

y for computational simplicity

univ_data.head()

Answer the following by implementing and simulating well-known multi-armed bandit algorithms.

1. (8 pts) Apply -greedy algorithm to the CDN problem, i.e. universities web latency data set using

negative of latencies as rewards. Try different values to investigate exploration vs exploitation trade-off

and the best total average reward.

2. (12 pts) Implement and apply upper confidence bound (UCB) action selection algorithm to the same

data set. Compare your results and briefly discuss your findings.

Answer as text here

In [ ]: ''' Answer as code here '''

Section 2. Model-free Reinforcement Learning with

CartPole

In this section, we are going to use a simple (v1) cart-pole environment

(http://gym.openai.com/envs/CartPole-v1/) from openai gym (https://gym.openai.com/). In this classic

control problem, a pole or inverted pendulum (https://en.wikipedia.org/wiki/Inverted_pendulum) is attached

vertically to a cart, which moves along a frictionless track. The system is controlled by applying a horizontal

force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A

reward of +1 is provided for every time step that the pole remains upright. The episode ends when the pole

is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

Note that, there are detailed models of the cartpole environment, see for example this

(https://danielpiedrahita.wordpress.com/portfolio/cart-pole-control/), which can be used to develop a variety

of model-based solutions. However, we will use this simple environment to illustrate a model-free, purely

data-oriented modern reinforcement learning approach as discussed in the lectures.

First, let's have a look at the environment by taking random actions. When you run the cell below, you

should see a new window with the cartpole environment for a short duration.

In [ ]: env = gym.make('CartPole-v1')

env.reset()

for _ in range(100):

env.render()

env.step(env.action_space.sample()) # take a random action


time.sleep(2) # wait for 2 seconds

env.close()

Environment

The figure below left depicts the standard SARSA (State-Action-Reward-State'-Action') paradigm. As

discussed during lectures, the idea is for the agent to learn the environment and achieve the objective

concurrently by taking a mixture of random actions and actions that maximise the cumulative reward.

The figure below right represents an implementation within the openai gym environment.

(http://gym.openai.com/docs/#observations) In this implementation, observation corresponds to state.

The environment is described in detail in the classic publication cited below (right click to download)

(./files/Barto1983.pdf). Note that this predates deep learning so it has more historical value than practical

value, other than the model itself.

AG Barto, RS Sutton and CW Anderson, "Neuronlike Adaptive Elements That Can Solve Difficult Learning

Control Problem", IEEE Transactions on Systems, Man, and Cybernetics, 1983.

The states and actions are described within the source code of the environment.

(https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)

Actions are {0, 1}, corresponding to "Push cart to the left" and "Push cart to the right", respectively.

In [ ]: env = gym.make('CartPole-v0')

print(env.action_space)

env.action_space.n

In [ ]: print(env.observation_space)

The states are [Cart Position, Cart Velocity, Pole Angle, Pole Velocity At Tip]. The angle is zero when to pole

is pointing up.

In [ ]: env = gym.make('CartPole-v1')

observation = env.reset()

print(np.array([env.reset()]))

In [ ]: env = gym.make('CartPole-v1')

for episode in range(10):

observation = env.reset()

for t in range(100):

env.render()

print(observation)

action = env.action_space.sample() # take a random action

observation, reward, done, info = env.step(action) # episode e

nds when done=True based on conditions

if done:

print("Episode finished after {} timesteps".format(t+1))

break

env.close()

DQN Agent

We now create a DQN agent class, which implements the behaviour of the agent based on the famous paper

(right click to download) (./files/dqn.pdf). This implementation is based on Algorithm 1 (Deep Q-learning with

Experience Replay) on page 5.

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and

Martin Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).

In [ ]: class CartPoleDQN:

def __init__(self, environment):


# from cartpole environment

self.env = environment

self.nbr_states = self.env.observation_space.shape[0]

self.nbr_actions = self.env.action_space.n


# to store (state,action,reward)

self.memory = deque(maxlen=1000000)

self.scores=[]


# parameters

self.gamma = 0.95 # discount rate

self.epsilon = 1 # exploration rate

self.epsilon_min = 0.01

self.epsilon_decay = 0.995

self.batch_size = 16


# build the DNN model

self.build_model()



def build_model(self):

""" Define Keras DNN model """

self.model = Sequential()


self.model.add(Dense(16, input_shape=(self.nbr_states,), activ

ation='relu'))

self.model.add(Dense(4, activation='relu'))

self.model.add(Dense(self.nbr_actions, activation='linear'))


self.model.compile(loss='mean_squared_error', optimizer=Adam(l

r=0.001))



def save_model(self, filename):


self.model.save(filename+".h5")



def load_model(self, filename):


self.model = keras.models.load_model(filename)


def remember(self, state, action, reward, next_state, done):

# store the SARSA from observation

self.memory.append((state, action, reward, next_state, done))



def act(self, state):


if np.random.rand() < self.epsilon:

return self.env.action_space.sample()

else:

q_values = self.model.predict(state)[0]

return np.argmax(q_values)



def update_model(self):


if len(self.memory) < self.batch_size:

return


# sample the mini-batch

mini_batch=random.sample(self.memory, self.batch_size)


# experience replay

for state, action, reward, state_next, done in mini_batch:

if done:

q_update = reward

else:

q_update = reward + self.gamma*np.amax(self.model.pred

ict(state_next)[0])

q = self.model.predict(state)

q[0][action] = q_update

self.model.fit(state, q, batch_size=self.batch_size, verbo

se=0)


# update exploration/exploitation balance

if self.epsilon > self.epsilon_min:

self.epsilon *= self.epsilon_decay

else:

self.epsilon=self.epsilon_min

Experiments and Training - Agent in Environment

The interaction of the agent with the environment is simulated below. Each episode ends when control fails,

defined as (https://gym.openai.com/envs/CartPole-v1/) the pole being more than 15 degrees from vertical, or

the cart moving more than 2.4 units from the centre.

A reward of is given for every time step that the pole remains uprigh, so the score per episode is the

number of steps where the control is successful.

+1

In [ ]: env=gym.make('CartPole-v1')

dqn_agent = CartPoleDQN(environment=env)

for episode in range(50):

state = env.reset()

state = np.reshape(state, [1, dqn_agent.nbr_states]) # reshape for

keras

score = 0

done = False

while not done:

#env.render()

action = dqn_agent.act(state)

state_next, reward, done, info = env.step(action)

state_next = np.reshape(state_next, [1, dqn_agent.nbr_states])

# reshape for keras


if done:

reward = 0 # no reward for last time step

score = score + reward


dqn_agent.remember(state, action, reward, state_next, done)

state = state_next


dqn_agent.update_model()


if done:

dqn_agent.scores.append(score)

print("Episode:", episode, " Score:", score, " memory le

ngth:",

len(dqn_agent.memory), " epsilon:", dqn_agent.e

psilon)

break

env.close()

print("training completed!")

Performance Evaluation

The original criterion for success is defined as (https://gym.openai.com/envs/CartPole-v1/) getting an

average reward of greater or equal than 195.0 over 100 consecutive trials. We relax this in order to (literally)

save time. Let's consider a more modest version with an average reward >=100 over 20 consecutive trials.

In [ ]: np.average(dqn_agent.scores[30:50])

In [ ]: plt.figure()

plt.title('Score per Episode')

plt.xlabel('Episode number')

plt.ylabel('Score')

plt.plot(dqn_agent.scores)

plt.plot(np.ones((len(dqn_agent.scores),1))*100)

plt.show()

Resulting DQN Model

We now look a bit closer to the DNN underlying the DQN model we have trained.

In [ ]: dqn_agent.model.summary()

print(dqn_agent.model.get_weights())

In [ ]: ## install pydot in your environment using Anaconda Navigator or you c

an skip this step

## by commenting out the commands below!

## Plot model graph

#keras.utils.plot_model(dqn_agent.model, show_shapes=True, show_layer_

names=True, to_file='model.png')

#from IPython.display import Image

#Image(retina=True, filename='model.png')

Visual test

We save our trained model to a file for reuse, which makes a lot of sense given the time spent on training.

Then, we illustrate the performance of the model visually. Note that because we have cut the training short

to save time, the result is rather sub-optimal.

In [ ]: # we save the trained model

dqn_agent.save_model('dqn_train2')

In [ ]: # A few episodes to watch the behaviour

env = gym.make('CartPole-v1')

dqn_agent_test = CartPoleDQN(environment=env)

dqn_agent_test.load_model('dqn_train2.h5')

dqn_agent_test.epsilon = 0 # no random movements

for episode in range(5):

state = env.reset()

state = np.reshape(state, [1, dqn_agent_test.nbr_states]) # reshap

e for keras

for t in range(150): # increase to 200 to see termination

env.render()

action = dqn_agent_test.act(state)

state_next, reward, done, info = env.step(action)

state = np.reshape(state_next, [1, dqn_agent_test.nbr_states])

# reshape for keras

if done:

print("Episode finished after {} timesteps".format(t+1))

break

env.close()

Question 2 (20 pts)

This question aims to motivate you to fully understand and improve upon the example possibly suboptimal

DQN implementation above.

Hint: a good model is quite similar to what is provided so minor tweaking should work!

1. How does this DQN differ from classical Q-learning (see e.g. Section 6.5 of Sutton and Barto RL book)?

Clearly identify and discuss similarities as well as differences. In addition, briefly describe the

relationship to value iteration and dynamic programming (via Bellman's equation).

2. Improve the performance of the DQN by changing the underlying DNN architecture, training method,

and epsilon parameter and its decay which controls the exploration vs exploitation trade-off. Interpret

and discuss your findings, e.g. what works and does not work and why?

3. [Optional, no marks] Do you see a high-level similarity between epsilon parameter and diminishing step

size in gradient optimisation algorithms?

4. [Optional, no marks] Try different DNN architectures as well as approaches, e.g. A3C or beyond

(https://towardsdatascience.com/advanced-reinforcement-learning-6d769f529eb3).

Hint: example implementations from the Web (which may be incorrect):

https://towardsdatascience.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288

(https://towardsdatascience.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288)

https://gym.openai.com/envs/CartPole-v0/ (https://gym.openai.com/envs/CartPole-v0/)

https://github.com/rlcode/reinforcement-learning (https://github.com/rlcode/reinforcement-learning)

Answer as text here

In [ ]: ''' Answer as code here '''

In [ ]:


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp