COM3240: Reinforcement Learning
Computer Science
Spring 2024
Assignment
March 14, 2024
1 Grid world of a“Lava” environment
A search-and-guide robot seeking artifacts navigates through a grid world, using four possible actions: North, South, East, and West. Its goal every time is to learn to guide others to the artifact with the highest value, which is represented as a high reward. However, near the artifacts, there is a lava area. If the robot enters that area, it incurs the penalty of a negative reward and starts in a fixed starting position, as shown in Figure 1. The robot cannot move off the grid’s edge or the wall; attempting to do so causes it to rebound to its former location. It is assumed the robot knows its coordinates in the space through GPS or similar mechanisms, which are not modelled here. Therefore, the robot is aware of its current coordinates but at the start doesn’t know where the artifacts are; it has to find them by searching the area. The experiment contains multiple episodes over which the robot is allowed to explore its environment until it finds the fixed-position target.
Figure 1: The grid world, where yellow is the starting location, blue represents the targets, red is the lava area, and black represents a wall. The grid world has two configurations: (a) with a single target and deterministic rewards and (b) with dual targets and stochastic rewards.
Consider the following three scenarios within the lava environment, each presenting progressively greater learning challenges:
. Single Target, Deterministic Rewards: The environment has a single artifact associated with a fixed reward. The transition dynamics of the environment are deterministic.
. Dual Targets with Stochastic Rewards: Here, the environment contains two different artifacts, one is better than the other. Sometimes due to issues in the vision of the robot, it misjudges the artifact’s value (reward). The artifacts are represented in the target locations, and each is assigned a stochastic reward drawn from a Gaussian distribution at the start of each episode. The robot’s goal is to identify and systematically reach the target offering the larger expected reward. The transition dynamics of the environment are, again, deterministic.
. Stochastic Transitions and Rewards: Building on the second scenario, this setting considers the fact that sometimes the robot slips, for instance it takes the north action but remains in the same position. This suggests that there are now stochastic transition dynamics on top of the stochastic rewards. The robot’s goal remains to find and reach the target with the higher expected reward.
The assignment will focus on using reinforcement learning algorithms, which allow the agent to explore the environment and learn the most rewarding routes to the target in the three environment scenarios mentioned above, in the following ways:
1. At the start of each episode, the robot is placed at a fixed location (square) within the grid world.
2. The robot should aim to learn the location of the most precious artifact,i.e. achieve the maximum theoretical reward. This means if two rewards are present, it should learn to go to the highest one.
3. A positive reward is given when the robot reaches a target location, with the reward structure varying according to the scenario (deterministic or stochastic rewards).
4. The episode ends when the target is reached, when a predefined number of steps is exceeded, or if the robot steps into a lava square. A new episode begins from step 1 until one of the three terminating conditions is met. The rewarded position remains consistent across episodes.
5. If the algorithm works correctly, you should observe a decrease in the number of steps required for the robot to find the artifact (reward) across episodes.
6. At the end of a predefined number of episodes, the goal locations change, and a new set of episodes begins. We say that one run is completed. Repeat several times (at least 50 times, meaning 50 runs) to reduce stochasticity in the averages across runs. The exact number of episodes required to learn the location of the most precious artifact should be explored, but it is expected to be more than 1000.
7. Your implementations must include both SARSA and Q-Learning, with an appro- priate exploration algorithm such as ϵ-greedy.
1.1 Task 1- Deterministic Rewards
Please produce the following figures/results for the report, accompanied with appropriate discussion. Please note that the figures should depict the performance of both SARSA and Q-Learning.
1. A figure of your learning curves. These should be the cumulative reward col- lected/number of steps taken per episode, plotted against the number of episodes, and averaged across runs.
2. Explain your selection of model hyperparameters and contextualise their impor- tance in shaping the learning algorithm’s performance.
3. Discuss and provide evidence on how the selection of the exploration parameter affects performance. Use a comparable figure to demonstrate your point.
4. Plot the preferred movement direction at each grid location after training, ensuring the method clearly depicts these directions.
5. Change your policy to “Softmax” (or if you have chosen “Softmax” as your original policy, change to “epsilon-greedy”). Compare the different policies and their effects on learning, discuss and explain the different behaviours observed in each. Provide evidence for your claims in the form of a figure or table.
1.2 Task 2- Dual Targets, Stochastic Rewards
1. Implement the reinforcement learning algorithms on the environment with two reward locations.
2. Again, you will need to optimise parameters to your new environment. Provide evidence on how you performed this optimisation.
3. Plot performance curves and compare them to the earlier cases.
4. Discuss the challenges presented by the addition of a second reward location and stochastic reward values for learning.
5. Consider (and ideally implement) modifications that aid the algorithm in maximis- ing reward, despite the challenges identified above. Explain why these algorithmic modifications are beneficial for overcoming the scenario’s difficulties.
1.3 Task 3- Stochastic Transitions and Rewards
Apply the algorithm to this more challenging scenario. Re-optimise your parameters. How is the performance compared to the previous cases? Why do you think this is the case? Explain any changes you have implemented to improve performance.
1. Implement the reinforcement learning algorithms on the environment with two reward locations and stochastic transition probabilities.
2. Again, you will need to optimise parameters for your new environment. Provide relevant evidence of your optimisation methodology, e.g. in the form of a plot, table, etc.
3. Plot performance curves and compare them to the earlier cases.
4. Discuss what problems the addition of stochastic transitions, plus stochastic reward values, present for learning.
5. Consider (and ideally implement) any modifications that will aid the algorithm in avoiding the problems listed above to maximise reward. Explain why you think these algorithmic modifications would help overcome the difficulties of this scenario
2 Report
This is an individually written report of a scientific standard, i.e. in a journal-paper like format and style. It is recommended that you use the web to find relevant journal papers and mimic their structure. Results should be clearly presented and justified. Figures should have captions, legends and readable axes labels. Your report should NOT exceed 6 pages (excluding Appendix, References and Cover Page). Additional pages will be ignored. Two-column format is recommended. The minimum font size is 11pt. Kindly note that the readability and clarity of your document plays a major role in the assessment.
Your report should open with a brief introduction to reinforcement learning and the tasks being considered here. Define the update rules that you will use in the assignment and give a description of the algorithms. Brevity is key- stick to short, clear explanations of the critical points.
Including heading numbers for each of the environments will aid in the clarity of your report, and the piece as a whole should fit together as part of a larger narrative of how the algorithm you have programmed has evolved to cope with the increasing demands of the more complex environments.
In the report you should include:
1. A response to all points requested by the assignment (including graphs and expla- nations). It is suggested to adopt a similar numbering scheme to make clear that you have responded all questions.
2. An Appendix with snippets of your code referring to the algorithm implementa- tions, with comments/explanations.
3. A description of how your results can be reproduced, see also “Important Note” .
Important Note: Please make sure that your results are reproducible. Together with the assignment, please upload your code well commented and with sufficient information (e.g. Readme file) so that we can easily test how you produced the results. If your results are not reproducible by your code (or you have not provided sufficient information on how to run your code in order to reproduce the figures), the assessment cannot receive full points. If no code is uploaded, the submission is considered incomplete and will not be marked.
3 Programming Language and Restrictions
Python, Numpy, and Matplotlib constitute the exclusive set of programming languages and libraries permitted for this assignment. The use of any other languages or libraries is strictly prohibited.
The assignment will be distributed and submitted via GitHub Classroom, which provides a template for your work. For consistency and to safeguard your progress, it is strongly recommended to use this platform throughout the code development process, rather than exclusively at the conclusion. This approach ensures that you have a comprehensive history of your development efforts and facilitates easier troubleshooting and revision.
4 Access the Assignment Template
Follow the steps below to access and complete your assignment. Given the assignment’s complexity, you are encouraged to submit your work at intermediate steps. Choose between using the Command Line or GitHub Desktop, based on your comfort level with Git.
1. Click on the assignment link provided by your instructor: https://classroom. github.com/a/kZEV9AJR.
2. You will be prompted to link your GitHub account to your name on a manual roster. Find your name on the list to proceed. Contact your instructor if you cannot find your name.
3. After linking your name, accept the assignment. GitHub Classroom will then set up your personal repository.
Then you have two options, depending on your preferred working style.
Option 1: Using the Command Line
1. Clone Your Repository: Open your command line or terminal. Clone the repository using:
git clone <your-repository-URL>
Replace <your-repository-URL> with the actual URL from GitHub.
2. Work on Your Assignment: Navigate to the project folder and start working. Remember to save your changes locally.
3. Submit Your Intermediate Steps: As you make progress, periodically submit
your work using:
git add .
git commit -m "Intermediate submission"
git push
Option 2: Using GitHub Desktop
1. Install GitHub Desktop: Download and install fromhttps://desktop.github. com/ if not already done.
2. Clone Your Repository: Open GitHub Desktop, navigate to File > Clone Repository, and choose your assignment repository. Click Clone.
3. Work on Your Assignment: Open the repository folder on your computer. Complete your assignment using your preferred editor or IDE.
4. Submit Your Intermediate Steps: Commit changes with a descriptive message and push them back to GitHub periodically by clicking Commit to main followed by Push origin.
You can push multiple times, but only the latest version of your solution will be con- sidered. Please keep your submissions on the main/master branch. Remember to push your work after committing it; otherwise,we will not have access to your latest solution.
5 Marking
Assignments will be marked with the following breakdown: results and discussions con- tribute up to 70%, scientific presentation and code documentation up to 15% and origi- nality in modelling the task for up to 15%.
A mark greater than 39% indicates an understanding of the basic concepts covered in this course. A mark greater than 69% indicates a deep knowledge of the concepts covered in this course, with evidence of independent thinking or engagement beyond the level of material directly covered during lectures/laboratory sessions.
To maximise your mark, make sure you explain your model and that results are supported by appropriate evidence (e.g. plots with captions, scientific arguments). Figures should be combined wherever appropriate to show a clear message. Any interesting additions you may wish to employ should be highlighted and well explained. Your figures should be able to be interpreted from the caption and the image alone, though discussion will be focussed in the main text. Your axis labels, axis ticks, and any data plotted should all be interpretable if printed on A4.
6 Submission
The deadline for uploading the assignment to the Virtual Learning Environment (Black- board) is Friday, 17th May 2024, at 23:59. The assignment submission should be in PDF format. Your code submission must include detailed documentation (e.g., a README file) to facilitate the reproduction of your results, specifying which files to run. Addi- tionally, you are required to have pushed both intermediate and final versions of your code to the GitHub Classroom.
7 Plagiarism, and Collusion
You are permitted to use Python code developed for the lab as a basis for the code you produce for this assignment with appropriate acknowledgement. This will not affect your mark. You may discuss this assignment with other students, but the work you submit must be your own. Do not share any code or text you write with other stu- dents. Credit will not be given to material that is copied (either unchanged or minimally modified) from published sources, including web sites. Any sources of information that you use should be cited fully. Please refer to the guidance on “Plagiarism, Collusion & Unfair Means” in the Undergraduate or MSc Handbooks [1, 2]. Note that for the links to work you need to have logged in on MUSE.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。