联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2021-10-15 10:20

COMP3702 Artificial Intelligence (Semester 2, 2021)

Assignment 3: DragonGame Reinforcement Learning

Key information:

• Due: 4pm, Monday 1 November

• This assignment will assess your skills in developing algorithms for solving Reinforcement Learning

Problems.

• Assignment 3 contributes 20% to your final grade.

• This assignment consists of two parts: (1) programming and (2) a report.

• This is an individual assignment.

• Both code and report are to be submitted via Gradescope (https://www.gradescope.com/).

• Your program (Part 1, 60/100) will be graded using the Gradescope code autograder, using testcases

similar to those in the support code provided at https://gitlab.com/3702-2021/a3-support.

• Your report (Part 2, 40/100) should fit the template provided, be in .pdf format and named according

to the format a3-COMP3702-[SID].pdf, where SID is your student ID. Reports will be graded by the

teaching team.

The DragonGame AI Environment

“Untitled Dragon Game”1 or simply DragonGame, is a 2.5D Platformer game in which the player must

collect all of the gems in each level and reach the exit portal, making use of a jump-and-glide movement

mechanic, and avoiding landing on lava tiles. DragonGame is inspired by the “Spyro the Dragon” game

series from the original PlayStation. In Assignment 3, actions may again have non-deterministic outcomes,

but in addition, the transition probabilities and reward values are unknown.

To solve a level, your AI agent must explore the environment and determine a policy (mapping from states

to actions) which collects all gems and reaches the exit while incurring the minimum expected cost, which is

equivalent to maximising the expected reward.

DragonGame as a Reinforcement Learning problem

In this assignment, you will write the components of a program to play DragonGame, with the objective of

finding a high-quality solution to the problem using various reinforcement learning algorithms. This assignment

will test your skills in defining reinforcement learning algorithms for a practical problem and understanding of

key algorithm features and parameters.

What is provided to you

We will provide supporting code in Python only, in the form of:

1. A class representing a DragonGame game map and a number of helper functions

2. A parser method to take an input file (testcase) and convert it into a DragonGame map

3. A policy visualiser

1The full game title was inspired by Untitled Goose Game, an Indie game developed by some Australians in 2019

1

COMP3702 Assignment 3: DragonGame Reinforcement Learning

4. A simulator script to evaluate the performance of your solution

5. Testcases to test and evaluate your solution

6. A solution file template

The support code can be found at: https://gitlab.com/3702-2021/a3-support. See the README.md for

more details. Autograding of code will be done through Gradescope, so that you can test your submission and

continue to improve it based on this feedback — you are strongly encouraged to make use of this feedback.

Your assignment task

Your task is to develop two reinforcement learning algorithms for computing paths (series of actions) for the

agent (i.e. the Dragon), and to write a report on your algorithms’ performance. You will be graded on both

your submitted program (Part 1, 60%) and the report (Part 2, 40%). These percentages will be scaled

to the 20% course weighting for this assessment item.

The provided support code provides a generative DragonGame environment, and your task is to submit

code implementing both of the following Reinforcement Learning algorithms:

1. Q-learning

2. SARSA

There isn’t an explicit requirement to use a particular learning type for each testcase, but the testcases are

designed to make using a specific type advantageous for that testcase. To achieve separation between Qlearning

and SARSA results, the total reward received during training is tracked in addition to the reward

received during evaluation, with separate reward targets specified for each in the testcases.

Once you have implemented and tested the algorithms above, you are to complete the questions listed in the

section “Part 2 - The Report” and submit the report to Gradescope.

More detail of what is required for the programming and report parts are given below.

Part 1 — The programming task

Your program will be graded using the Gradescope autograder, using testcases similar to those in the support

code provided at https://gitlab.com/3702-2021/a3-support.

Interaction with the testcases and autograder

We now provide you with some details explaining how your code will interact with the testcases and the

autograder (with special thanks to Nick Collins for his efforts making this work seamlessly, yet again!).

First, note that the Assignment 3 version of the class GameEnv (in game_env.py) differs from previous

assignments in that the transition and reward functions are now randomised and unknown to the agent.

The action outcome probabilities (for glide, supercharge, superjump actions and the ladder fall probability)

and costs/penalties (action_cost, collision_penalty, game_over_penalty) are randomised within some

fixed range based on the seed of the filename, and are all stored in private variables. Your agent does not

know these values, and therefore must interact with the environment to determine the optimal policy.

Implement your solution using the supplied solution.py template file. You are required to fill in the following

method stubs:

• __init__()

• run_training()

• select_action()

Page 2

COMP3702 Assignment 3: DragonGame Reinforcement Learning

You may add to the init method if required, and can add additional helper methods and classes (either in

solution.py or in files you create) if you wish. To ensure your code is handled correctly by the autograder,

you should avoid using any try-except blocks in your implementation of the above methods (as this can

interfere with our time-out handling). Also, unlike in the previous assignments, the autograder now

does not allow you to upload your own copy of game_env.py.

Refer to the documentation in solution.py for more details.

Grading rubric for the programming component (total marks: 60/100)

For marking, we will use five different testcases of ascending level of difficulty to evaluate your solution.

There will be a total of 60 code marks, consisting of:

• 20 Threshold Marks

– Program runs without errors (+5 marks)

– Program approximately solves at least 1 testcase within 2x time limit (+7.5 marks)

– Program approximately solves at least 2 testcases within 2x time limit (+7.5 marks)

• 40 Testcase Marks

– 5 testcases worth 8 marks each

– A maximum of 8 marks for each testcase, with deductions for taking more than the time limit or

solution having higher than the target costs (training and evaluation reward targets) proportional

to the amount exceeded

– The code used to compute your score is in simulator.py

– Program will be terminated after 2× time limit has elapsed

Part 2 — The report

The report tests your understanding of Reinforcement Learning and the methods you have used in your code,

and contributes 40/100 of your assignment mark.

Question 1. Q-learning is closely related to the Value Iteration algorithm for Markov decision processes.

a) (5 marks) Describe two key similarities between Q-learning and Value Iteration.

b) (5 marks) Give one key difference between Q-learning and Value Iteration.

For Questions 2, 3 and 4, consider testcase a3-t5.txt, and compare Q-learning and SARSA.

Question 2.

a) (5 marks) With reference to Q-learning and SARSA, explain the difference between off-policy and

on-policy reinforcement learning algorithms.

b) (5 marks) How does the difference between off-policy and on-policy algorithms affect the way in which

Q-learning and SARSA solves testcase a3-t5.txt? Give an example of an expected difference between

the way the algorithms learn a policy.

For questions 3 and 4, you are asked to plot the solution quality at each episode, as given by the 50-step

moving average reward received by your learning agent. At time step t, the 50-step moving average reward is

the average reward earned by your learning agent in the episodes [t − 50, t], including episode restarts. If the

Q-values imply a poor quality policy, this value will be low. If the Q-values correspond to a high-value policy,

the 50-step moving average reward will be higher. We are using a moving average here because the reward is

received only occasionally and there are sources of randomness in the transitions and the exploration strategy.

Page 3

COMP3702 Assignment 3: DragonGame Reinforcement Learning

Question 3.

a) (5 marks) Plot the quality of the policy learned by Q-learning in testcase a3-t5.txt against episode

number for three different fixed values of the learning_rate (which is called α in the lecture notes and

in many texts and online tutorials), as given by the 50-step moving average reward (i.e. for this question,

do not adjust α over time, rather keep it the same value throughout the learning process). Your plot

should display the solution quality up to an episode count where the performance stabilises, with a

minimum of 2000 episodes (note the policy quality may still be noisy, but the algorithm’s performance

will stop increasing and its average quality will level out).

b) (5 marks) With reference to your plot, comment on the effect of varying the learning_rate.

Question 4.

a) (5 marks) Plot the quality of the learned policy against episode number under Q-learning and SARSA

in testcase a3-t5.txt, as given by the 50-step moving average reward. Your plot should display the

solution quality up to an episode count where the performance of both algorithms stabilise, with a

minimum of 2000 episodes.

b) (5 marks) With reference to your plot, compare the learning trajectory of the two algorithms, and their

final solution quality. Discuss the way the solution quality of Q-learning and SARSA change as they

learn to solve the testcase, both as they learn and once they have stabilised.

Page 4

COMP3702 Assignment 3: DragonGame Reinforcement Learning

Academic Misconduct

The University defines Academic Misconduct as involving “a range of unethical behaviours that are designed

to give a student an unfair and unearned advantage over their peers.” UQ takes Academic Misconduct very

seriously and any suspected cases will be investigated through the University’s standard policy (https://

ppl.app.uq.edu.au/content/3.60.04-student-integrity-and-misconduct). If you are found guilty,

you may be expelled from the University with no award.

It is the responsibility of the student to ensure that you understand what constitutes Academic Misconduct

and to ensure that you do not break the rules. If you are unclear about what is required, please ask.

It is also the responsibility of the student to take reasonable precautions to guard against unauthorised access

by others to his/her work, however stored in whatever format, both before and after assessment.

In the coding part of COMP3702 assignments, you are allowed to draw on publicly-accessible resources and

provided tutorial solutions, but you must make reference or attribution to its source, by doing the following:

• All blocks of code that you take from public sources must be referenced in adjacent comments in your

code.

• Please also include a list of references indicating code you have drawn on in your solution.py docstring.

However, you must not show your code to, or share your code with, any other student under any

circumstances. You must not post your code to public discussion forums (including Ed Discussion)

or save your code in publicly accessible repositories (check your security settings). You must not

look at or copy code from any other student.

All submitted files (code and report) will be subject to electronic plagiarism detection and misconduct proceedings

will be instituted against students where plagiarism or collusion is suspected. The electronic plagiarism

detection can detect similarities in code structure even if comments, variable names, formatting etc. are

modified. If you collude to develop your code or answer your report questions, you will be caught.

For more information, please consult the following University web pages:

• Information regarding Academic Integrity and Misconduct:

– https://my.uq.edu.au/information-and-services/manage-my-program/student-integrity-andconduct/academic-integrity-and-student-conduct

– http://ppl.app.uq.edu.au/content/3.60.04-student-integrity-and-misconduct

• Information on Student Services:

– https://www.uq.edu.au/student-services/

Late submission

Students should not leave assignment preparation until the last minute and must plan their workloads to meet

advertised or notified deadlines. It is your responsibility to manage your time effectively.

Late submission of the assignment will not be accepted. Unless advised, assessment items received after the

due date will receive a zero mark unless you have been approved to submit the assessment item after the due

date.

In the event of exceptional circumstances, you may submit a request for an extension. You can find guidelines

on acceptable reasons for an extension here https://my.uq.edu.au/information-and-services/manage-myprogram/exams-and-assessment/applying-extension

All requests for extension must be submitted on the UQ

Application for Extension of Progressive Assessment form at least 48 hours prior to the submission deadline.

Page 5


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp