联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> OS作业OS作业

日期:2024-04-17 03:14

CEG 5301 Assignment 5

AY 23/24 Sem 2

Assignment 5: Q-Learning, Double Q-learning, and DQN

Due: Apr. 19th, 2024, 23:59pm

Question 1. Consider a reinforcement-learning system with two states (namely, s1 and s2) and two actions (namely, a1 and a2). Suppose that a Q-learning trial has been conducted with the agent transitioning through the following state sequence by taking the actions as indicated below:

where the number following the action above an arrow indicates the reward received by the agent upon taking that action. For instance, the first arrow implies that the agent at state s1 takes action a1, which results in the agent remaining in state s1 and receiving a reward of 1. Complete the table below to show the values of the Q-function at the end of each action taken by the agent during the trial. For instance, the value of Q(s1, a1) is to be entered in the top left empty cell in the table shown. Assume that the initial values of the Q-learning are 0. Use a fixed learning rate of α = 0.5 and a discount rate of γ = 0.5.

Note: Show your detailed calculation steps for obtaining these Q-function values. There are four actions for this trial, so your answer should include four such tables, one for each action taken.

Question 2 (Programming). Consider the grid world shown in the following figure. This is a standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left. If the agent takes an action that would move it off the grid, it remains in its current position instead. The reward is -1 on all transitions except those into the black region. Stepping into the black region incurs a reward of -100 and sends the agent instantly back to the start.

1. Implement Q-learning and SARSA on this task respectively with probability for exploration ϵ = 0.1, step size α = 0.1, and discount factor λ = 1. Choose the number of episodes sufficiently large (e.g., 500) so that a stable policy is learned.

• Plot a figure with two curves that show the “Sum of rewards during the episode” against “Episodes” for Q-learning and SARSA respectively.

• Plot the learned policy for each method.

Note that a script. is given for your convenience of programming, so you can focus more on coding the Q-function update (search for TODO in the code source file to locate where you need to complete the code). To have a smoother plot, the given script. performs 100 independent runs.

2. Consider now running both Q-learning and SARSA for 500 training episodes first using α = 0.1, γ = 1, and ϵ-greedy policy for collecting samples where ϵ = 0.1. Then, take the greedy policy resulting from the learned Q-functions of both methods and run another 200 episodes following the obtained policy (that is, stop training after 500 episodes, but test the learned policies from both methods afterward).

• Plot the cumulative rewards for each episode for both SARSA and Q-Learning.

Note that

• Since the greedy policy obtained from the learned Q-function is deterministic, there is no need to perform. independent runs to smooth the rewards curves anymore after 500 episodes.

• It is possible that if you train the agent for not enough number of episodes (say the training Episodes = 100), the learned policy could not even lead the agent to the Goal.

3. Change ϵ = 10t+1/1 , where t is the number of episodes for SARSA. Choose the total number episodes of training to be 500. Work on Task 2 again. Show the policy learned by SARSA and Q-learning. Plot the cumulative rewards for each episode for both SARSA and Q-learning.

In your report, include all the required plots for all 3 tasks. Comment on your simulation results. Submit your codes together with the report.

Question 3 (Programming). Implement Deep Q-learning Network (DQN) and DDQN (Double DQN) to balance an inverted pendulum from the OpenAI Gym environment. In this task, there is only one incomplete code, which is referred to as dqn.py, such that upon completion, one should be able to run the train.py file and successfully control the inverted pendulum to keep upright after training. All other codes are complete.

Figure 1: Balancing a pendulum using DQN, DDQN

Tasks:

1. Fill in the blanks in the dqn.py file marked with TODO to develop the following agents

a) DQN agent, where Q-network takes 1 image and the angular velocity (i.e., the state s), and the toque (i.e., the action a) as input, and outputs the value Q(s, a) ∈ R at the current (s, a) pair.

b) DQN agent, where Q-network takes 4 consecutive images (i.e., the state s) and the toque (i.e., the action a) as input, and outputs the value Q(s, a) ∈ R at the current (s, a) pair.

c) DDQN agent, where Q-network takes 4 consecutive images as input (i.e., state s), and outputs the values at all actions, that is, outputs a vector Q(s, ·) ∈ R |A| and each element of the vector corresponding to Q(s, a) at an action a.

Note: The observation from the environment we provided in the code consists of four consecutive frames of images and a velocity scalar, i.e., bservation = [image, velocity], where dim(image) = 4x42x42 (please refer to the step function in pendulum.py and the GenerateFrame42 class in wrappers.py for details). Note that in preprocessing, the RGB channels of the image are combined into one channel in the code. In the basic experimental setup, the Q-network takes the observation and the action torque as the input and outputs a Q-value for this state-action pair (please refer to the forward function in dqn.py).

2. Run the train.py file to train a model such that it can keep the inverted pendulum upright. Record a video to show that the inverted pendulum is upright under the developed 3 different agents. Submit the completed code and the video.

3. Plot a figure that shows the “mean ep 100 return” against the “used step” as shown in train.py for each agent. Play with the parameters in train.py file to see how it will affect the return curves. Comment on your exploration/discovery in your accompany report.

4. In the train.py file, we use an eight-dimensional vector to denote the discrete action space. Explore more high-dimensional vectors to see whether they can lead to better learning results.

Notes:

• Two example videos are given to you to show the initial output and the desired output for Question 3.

• One should submit a report, all completed codes, and an output video.

• You can modify all the codes freely as long as you can complete all the tasks successfully.

• Refer to the Appendix below for tips on coding environment setup, dependence installation, etc.

Submission

• Make sure to submit one PDF report, one MP4 video, and all the PYTHON codes. Include all your files into a folder and compress it to a zip file. Only submit one zip file for this assignment.

• Naming convention: Assignment5 YourName.zip

• Submission deadline: Apr. 19th, 2024, 23:59pm.

Appendix

Working Environment:

• We recommend using VScode/PyCharm + Anaconda as your development environment, either in Windows, Mac, or Linux. See Anaconda for the installation of Anaconda.

Installing Dependency:

• You can use anaconda to create a python3 environment by installing the required packages listed in environment.yml:

1   cd DQN_DIRECTORY

2   conda env create -f environment.yml

• If some error messages from Anaconda are raised, you could choose to install the required python3 package manually. Run the following command with CMD in Windows or Shell in Linux or MacOS:

1   pip3 install pytorch/pygame/gym==0.26.1/opencv_python

Test Your Built Environment (Dependency):

• When testing the built environment, you could let the code idle by running the following command in the terminal:

1   cd DQN_DIRECTORY

2   python3 train.py --idling

• If there is no error, that means you have installed all dependencies successfully. You can proceed to fill in the blanks in the dqn.py file marked with TODO.

How to use:

• After completing all blanks in dqn.py, you can run train.py either in Vscode/PyCharm or in termimal by the following commands

1   cd DQN_DIRECTORY

2   python3 train.py

• We comment off the USE GPU part in the code, you can enable it if you want to use GPU.

• For more details, please refer to readme.html file in DQN folder.





版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp