联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2019-05-14 08:18

School of Computing and Information Systems

The University of Melbourne

COMP30027, Machine Learning, 2019

Project 2: Short Text Location Prediction

Task: Gain understanding about using textual information to build a location classifier

Due: Stage I: Friday 24 May, 1pm UTC+10

Stage II: Friday 31 May, 1pm UTC+10

Submission: Stage I: Report (PDF) to Turnitin; test output(s) and code to LMS

Stage II: Reviews and Reflection (via Turnitin PeerMark)

Marks: The Project will be marked out of 20, and will contribute 20% of your total mark.

Groups: Groups of 1 or 2, with commensurate expectations for each (see below).

1 Overview

The goal of this Project is to build and critically analyse some supervised Machine Learning methods, with the

aim of automatically identifying the location from which a textual message was sent, as one of four Australian

cities. Although this is a simplification of the more general problem of geotagging, it is still a very difficult

problem, which has been well-studied, but a solution remains elusive.

This aims to reinforce the largely theoretical lecture concepts surrounding learners, data representation, and

evaluation, by applying them to a sophisticated problem. You will also have an opportunity to practice your

general problem-solving skills, written communication skills, and creativity.

2 Deliverables

1. Stage I: the output(s) of your classifier(s), comprising predictions of labels for the test instances

2. Stage I: one or more programs, written in Python1

, which implement machine learning methods that build

the model, make predictions, and evaluate where necessary

3. Stage I: an anonymous written report, of 1100-1350 words (for a group of one student) or 2200-2400

words (for a group of two students)

4. Stage II: reviews of two reports written by other students, of 200-400 words each

5. Stage II: a written reflection piece, with the same structure as a review

3 Terms of Use

The data has collected from Twitter (https://www.twitter.com), specifically for this Project. According

to Twitter’s Terms of Service, you cannot share this data for any other purpose, or reproduce it in any form,

other than as isolated examples. (Which is to say, you can include a few tweets in your report.)

Please note that the dataset is a sample of actual data posted to the World Wide Web. As such, it may contain

information that is in poor taste, or that could be construed as offensive. We would ask you, as much as possible,

to look beyond this to the task at hand. (For example, it is generally not necessary to read individual tweets.)

The opinions expressed within the data are those of the (anonymised) authors, and in no way express the official

views of the University of Melbourne or any of its employees; using the data in an educative capacity does not

consitute endorsement of the content contained therein.

If you object to these terms, please contact us (nj@unimelb.edu.au) as soon as possible.

1We will waive the Python requirement under certain circumstances.

4 Data

The data files are available via the LMS, and are described in a corresponding README.

Briefly, you will be provided with a set of training tweets, and a set of development tweets. These have been

labelled with a “class” according to the location, as one of four Australian cities: Sydney, Melbourne, Brisbane,

Perth2

. There is also a set of test tweets, which will not be labelled, for whose instances you will submit

predictions as part of your submission.

Your job is to come up with one or more implemented Machine Learning system(s), which are trained using

(a representation of) the training dataset, and evaluated using the development dataset. You will then run

the trained classifier over the test dataset, and submit the corresponding predicted labels. Three possible preprocessed

data representations have been provided, which you may use or ignore according to your needs.

5 Assessment

The Project will be marked out of 20, and is worth 20% of your overall mark for the subject. The mark

breakdown will be:

Ranking of your best-performing classifier 3 marks

Report 12 marks

Reviews 3 marks

Reflection 2 marks

TOTAL 20 marks

The report will be marked according to the accompanying rubric; the details of the Stage II assessment will be

announced via the LMS.

The mark for the system ranking will be calculated by first determining the accuracy of each set of predictions

for every group. We will then apply equal-frequency binning of the systems in the final system ranking, and

assign a score to each group based on the output which occurs in the highest-ranking bin. This procedure will

be applied separately for groups of 1 member, and groups of 2 members. We may assign a bonus mark to

remarkable submissions.

Since all of the tweets exist on the World Wide Web, it is inconvenient but possible to “cheat” and identify some

of the author ages from the test tweets using non-Machine Learning methods. If there is any evidence of this,

the system ranking will be ignored, and you will instead receive a mark of 0 for this component. The code will

not be directly assessed, but if you do not submit it, it will be assumed that you are attempting to circumvent

the Machine Learning requirement, and you will receive a mark of 0 for the system ranking component.

6 Submission

All submission will be via the LMS. Stage I submissions will open one week before the due date. Stage II

submissions will be open as soon as the reports are available, immediately following the Stage I submission

deadline.

2Obviously, there were other cities that could also have been included. However, the problem is already very difficult, and having

more classes would have made it moreso. Please do not read into the absence of various large urban areas; there is nothing inherently

significant about the choice of class set.

7 Individual vs. Two-Person Participation

You have the option of participating as a group of one member, or in a group of two. In the case that you opt to

participate individually, you will be required to enter the predictions (and accompaying code to generate them)

for at least 1, and up to 4 distinct systems. Groups of two will be required to enter at least 3 and up to 4 distinct

systems, of which one must be either a semi-supervised system, or a stacked ensemble system3

. The report

length requirement also differs, as detailed below:

Group size Distinct system submissions required Report length

1 1–4 1,100–1,350 words

2 3–4 2,200–2,400 words

If you wish to form a two-student group,

by Friday 10 May, indicating this. Note that once you have signed up for a given group, you will not be allowed

to change groups. If you do not contact us, we will assume that you will be participating as an individual, even

if you were in a two-student group for Project 1.

8 Report

The report should be 1,100-1,350 words (groups of one student) or 2,200-2,400 words (groups of two students)

in length (±10%) and will include:

1. a basic description the task;

2. a short summary of some related literature;

3. a conceptual description of what you have done, including any learners that you have used, or features

that you have engineered4

,

4. an evaluation of your classifier(s) over the development tweets;

You should also aim to have a more detailed discussion, which:

5. contextualises the behaviour of the method(s), in terms of the theoretical properties we have identified in

the lectures;

6. attempts some error analysis of the method(s);

7. and, summarises the principal conclusions — which is to say, what a reasonably-informed reader will

have learned from your efforts;

And don’t forget:

8. A bibliography, which includes related work from your literature summary

Note that we are more interested in seeing evidence of you having thought about the task and determined

reasons for the relative performance of different methods, rather than the raw scores of the different methods

you select. This is not to say that you should ignore the relative performance of different runs over the data, but

rather that you should think beyond simple numbers to the reasons that underlie them.

We will provide LATEXand RTF style files that we would prefer that you use in writing the report. Reports are to

be submitted in the form of a single PDF file. If a report is submitted in any format other than PDF, we reserve

the right to return the report with a mark of 0.

Your name and student ID should not appear anywhere in the report, including the metadata (filename, etc.).

3scikit-learn provides some support for both of these concepts, but ultimately quite different to how they are defined in the

context of this subject. Consequently, there will be a non-trivial implementation component.

4Again, this should be at a conceptual level; a detailed description of the code is not appropriate for the report.

9 Reviews

During the reviewing process, you will read two submissions by other students. This is to help you contemplate

some other ways of approaching the Project, and to ensure that students get some extra feedback. For each

paper, you should aim to write 200-400 words total, responding to three “questions”:

Briefly summarise what the author has done

Indicate what you think that the author has done well, and why

Indicate what you think could have been improved, and why

10 Changes/Updates to the Project Specifications

We will use the LMS to advertise any (hopefully small-scale) changes or clarifications in the Project specifications.

Any addendums made to the Project specifications via the LMS will supersede information contained in

this version of the specifications.

11 Late Submissions

Late submissions will seriously create havoc with the reviewing process. You are strongly encouraged to submit

by the date and time specified above. If circumstances do not permit this, then the marks will be adjusted as

follows:

Each business day (or part thereof) that the report is submitted after the due date (and time) specified

above, 10% will be deducted from the marks available, up until 5 business days (1 week) has passed,

after which regular submissions will no longer be accepted. A late report submission will mean that your

report might not participate in the reviewing process, and so you will probably receive less feedback.

There is no mechanism by which the reviews may be uploaded to the system after the deadline, consequently,

it is a major hassle to accept late submissions. Any late submission of the reviews will incur

a 50% penalty (i.e. 1.5 of the 3 marks), and will not be accepted more than a week after the reviewing

deadline.

The reflective task will largely be non-sensical to attempt after the deadline. We will reluctantly accept

late submissions at a 50% penalty (1 of the 2 marks) up until a week after the task deadline.

12 Academic Honesty

While it is acceptable to discuss the Project with in general terms, excessive collaboration with students outside

of your group is considered cheating. We will be vetting system submissions for originality and will invoke

the University’s Academic Misconduct policy (http://academichonesty.unimelb.edu.au/policy.

html) where either inappropriate levels of collaboration or plagiarism are deemed to have taken place.

13 Important Dates

Release of training and development data 1 May 2019

Deadline for group registration 10 May 2019

Deadline for submission of results over test data 24 May 2019 (1:00pm)

Deadline for submission of written report 24 May 2019 (1:00pm)

Deadline for submission of reviews 31 May 2019 (1:00pm)

Deadline for submission of reflection 31 May 2019 (1:00pm)


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp