School of Computing and Information Systems
The University of Melbourne
COMP30027, Machine Learning, 2019
Project 2: Short Text Location Prediction
Task: Gain understanding about using textual information to build a location classifier
Due: Stage I: Friday 24 May, 1pm UTC+10
Stage II: Friday 31 May, 1pm UTC+10
Submission: Stage I: Report (PDF) to Turnitin; test output(s) and code to LMS
Stage II: Reviews and Reflection (via Turnitin PeerMark)
Marks: The Project will be marked out of 20, and will contribute 20% of your total mark.
Groups: Groups of 1 or 2, with commensurate expectations for each (see below).
1 Overview
The goal of this Project is to build and critically analyse some supervised Machine Learning methods, with the
aim of automatically identifying the location from which a textual message was sent, as one of four Australian
cities. Although this is a simplification of the more general problem of geotagging, it is still a very difficult
problem, which has been well-studied, but a solution remains elusive.
This aims to reinforce the largely theoretical lecture concepts surrounding learners, data representation, and
evaluation, by applying them to a sophisticated problem. You will also have an opportunity to practice your
general problem-solving skills, written communication skills, and creativity.
2 Deliverables
1. Stage I: the output(s) of your classifier(s), comprising predictions of labels for the test instances
2. Stage I: one or more programs, written in Python1
, which implement machine learning methods that build
the model, make predictions, and evaluate where necessary
3. Stage I: an anonymous written report, of 1100-1350 words (for a group of one student) or 2200-2400
words (for a group of two students)
4. Stage II: reviews of two reports written by other students, of 200-400 words each
5. Stage II: a written reflection piece, with the same structure as a review
3 Terms of Use
The data has collected from Twitter (https://www.twitter.com), specifically for this Project. According
to Twitter’s Terms of Service, you cannot share this data for any other purpose, or reproduce it in any form,
other than as isolated examples. (Which is to say, you can include a few tweets in your report.)
Please note that the dataset is a sample of actual data posted to the World Wide Web. As such, it may contain
information that is in poor taste, or that could be construed as offensive. We would ask you, as much as possible,
to look beyond this to the task at hand. (For example, it is generally not necessary to read individual tweets.)
The opinions expressed within the data are those of the (anonymised) authors, and in no way express the official
views of the University of Melbourne or any of its employees; using the data in an educative capacity does not
consitute endorsement of the content contained therein.
If you object to these terms, please contact us (nj@unimelb.edu.au) as soon as possible.
1We will waive the Python requirement under certain circumstances.
4 Data
The data files are available via the LMS, and are described in a corresponding README.
Briefly, you will be provided with a set of training tweets, and a set of development tweets. These have been
labelled with a “class” according to the location, as one of four Australian cities: Sydney, Melbourne, Brisbane,
Perth2
. There is also a set of test tweets, which will not be labelled, for whose instances you will submit
predictions as part of your submission.
Your job is to come up with one or more implemented Machine Learning system(s), which are trained using
(a representation of) the training dataset, and evaluated using the development dataset. You will then run
the trained classifier over the test dataset, and submit the corresponding predicted labels. Three possible preprocessed
data representations have been provided, which you may use or ignore according to your needs.
5 Assessment
The Project will be marked out of 20, and is worth 20% of your overall mark for the subject. The mark
breakdown will be:
Ranking of your best-performing classifier 3 marks
Report 12 marks
Reviews 3 marks
Reflection 2 marks
TOTAL 20 marks
The report will be marked according to the accompanying rubric; the details of the Stage II assessment will be
announced via the LMS.
The mark for the system ranking will be calculated by first determining the accuracy of each set of predictions
for every group. We will then apply equal-frequency binning of the systems in the final system ranking, and
assign a score to each group based on the output which occurs in the highest-ranking bin. This procedure will
be applied separately for groups of 1 member, and groups of 2 members. We may assign a bonus mark to
remarkable submissions.
Since all of the tweets exist on the World Wide Web, it is inconvenient but possible to “cheat” and identify some
of the author ages from the test tweets using non-Machine Learning methods. If there is any evidence of this,
the system ranking will be ignored, and you will instead receive a mark of 0 for this component. The code will
not be directly assessed, but if you do not submit it, it will be assumed that you are attempting to circumvent
the Machine Learning requirement, and you will receive a mark of 0 for the system ranking component.
6 Submission
All submission will be via the LMS. Stage I submissions will open one week before the due date. Stage II
submissions will be open as soon as the reports are available, immediately following the Stage I submission
deadline.
2Obviously, there were other cities that could also have been included. However, the problem is already very difficult, and having
more classes would have made it moreso. Please do not read into the absence of various large urban areas; there is nothing inherently
significant about the choice of class set.
7 Individual vs. Two-Person Participation
You have the option of participating as a group of one member, or in a group of two. In the case that you opt to
participate individually, you will be required to enter the predictions (and accompaying code to generate them)
for at least 1, and up to 4 distinct systems. Groups of two will be required to enter at least 3 and up to 4 distinct
systems, of which one must be either a semi-supervised system, or a stacked ensemble system3
. The report
length requirement also differs, as detailed below:
Group size Distinct system submissions required Report length
1 1–4 1,100–1,350 words
2 3–4 2,200–2,400 words
If you wish to form a two-student group,
by Friday 10 May, indicating this. Note that once you have signed up for a given group, you will not be allowed
to change groups. If you do not contact us, we will assume that you will be participating as an individual, even
if you were in a two-student group for Project 1.
8 Report
The report should be 1,100-1,350 words (groups of one student) or 2,200-2,400 words (groups of two students)
in length (±10%) and will include:
1. a basic description the task;
2. a short summary of some related literature;
3. a conceptual description of what you have done, including any learners that you have used, or features
that you have engineered4
,
4. an evaluation of your classifier(s) over the development tweets;
You should also aim to have a more detailed discussion, which:
5. contextualises the behaviour of the method(s), in terms of the theoretical properties we have identified in
the lectures;
6. attempts some error analysis of the method(s);
7. and, summarises the principal conclusions — which is to say, what a reasonably-informed reader will
have learned from your efforts;
And don’t forget:
8. A bibliography, which includes related work from your literature summary
Note that we are more interested in seeing evidence of you having thought about the task and determined
reasons for the relative performance of different methods, rather than the raw scores of the different methods
you select. This is not to say that you should ignore the relative performance of different runs over the data, but
rather that you should think beyond simple numbers to the reasons that underlie them.
We will provide LATEXand RTF style files that we would prefer that you use in writing the report. Reports are to
be submitted in the form of a single PDF file. If a report is submitted in any format other than PDF, we reserve
the right to return the report with a mark of 0.
Your name and student ID should not appear anywhere in the report, including the metadata (filename, etc.).
3scikit-learn provides some support for both of these concepts, but ultimately quite different to how they are defined in the
context of this subject. Consequently, there will be a non-trivial implementation component.
4Again, this should be at a conceptual level; a detailed description of the code is not appropriate for the report.
9 Reviews
During the reviewing process, you will read two submissions by other students. This is to help you contemplate
some other ways of approaching the Project, and to ensure that students get some extra feedback. For each
paper, you should aim to write 200-400 words total, responding to three “questions”:
Briefly summarise what the author has done
Indicate what you think that the author has done well, and why
Indicate what you think could have been improved, and why
10 Changes/Updates to the Project Specifications
We will use the LMS to advertise any (hopefully small-scale) changes or clarifications in the Project specifications.
Any addendums made to the Project specifications via the LMS will supersede information contained in
this version of the specifications.
11 Late Submissions
Late submissions will seriously create havoc with the reviewing process. You are strongly encouraged to submit
by the date and time specified above. If circumstances do not permit this, then the marks will be adjusted as
follows:
Each business day (or part thereof) that the report is submitted after the due date (and time) specified
above, 10% will be deducted from the marks available, up until 5 business days (1 week) has passed,
after which regular submissions will no longer be accepted. A late report submission will mean that your
report might not participate in the reviewing process, and so you will probably receive less feedback.
There is no mechanism by which the reviews may be uploaded to the system after the deadline, consequently,
it is a major hassle to accept late submissions. Any late submission of the reviews will incur
a 50% penalty (i.e. 1.5 of the 3 marks), and will not be accepted more than a week after the reviewing
deadline.
The reflective task will largely be non-sensical to attempt after the deadline. We will reluctantly accept
late submissions at a 50% penalty (1 of the 2 marks) up until a week after the task deadline.
12 Academic Honesty
While it is acceptable to discuss the Project with in general terms, excessive collaboration with students outside
of your group is considered cheating. We will be vetting system submissions for originality and will invoke
the University’s Academic Misconduct policy (http://academichonesty.unimelb.edu.au/policy.
html) where either inappropriate levels of collaboration or plagiarism are deemed to have taken place.
13 Important Dates
Release of training and development data 1 May 2019
Deadline for group registration 10 May 2019
Deadline for submission of results over test data 24 May 2019 (1:00pm)
Deadline for submission of written report 24 May 2019 (1:00pm)
Deadline for submission of reviews 31 May 2019 (1:00pm)
Deadline for submission of reflection 31 May 2019 (1:00pm)
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。