• QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2020-11-15 08:29

F20ML 2020-21: Coursework 2 [25%]

In this coursework, we will look at three different topics: effectively preprocessing and feature

engineering a text corpus, implementing the Perceptron Algorithm and implementing the

Gradient Descent algorithm for a Linear Classifier in order to perform binary classification.

The task we will attempt is sentiment classification of sentences from movie reviews, using a

classic corpus by Pang and Lee (2004) .


The coursework consists of two parts: i) this description, ii) an accompanying folder with

the dataset and several .py files containing the necessary scaffolding for you to implement

the missing parts in order to answer the questions below. You will NOT have to submit a

document report in .pdf or .doc but instead you will enter your responses (upload the

completed .py files in a zipped folder and a couple of plot images) in the

corresponding Test created on Vision. Read the instructions in the last Section

‘Deliverables’ below for more details.

Note: We will use the existing implementations for text preprocessing and vectorisation from

the Python library scikit-learn, and we will implement the Perceptron and Gradient Descent

Algorithms from scratch.


You have just started working in the advertisement department of a media conglomerate as

a Data Science Consultant. One of your team’s jobs is to sieve through reviews written by

users about TV shows, movies, and radio broadcasts on your online platform, and moderate

them based on polarity (positive/negative sentiment). So far the team has been doing this

job...manually! :(

The data scientist in you cannot tolerate this anymore…

You decide to take some action and build a classifier that reads each review sentence by

sentence and assigns a positive or negative class. Hm… that sounds familiar! You

immediately think of one of those classic datasets from Pang and Lee on sentiment

classification (movie reviews from rotten tomatoes) and download it.

...you have already started your favourite IDE and have imported scikit-learn and numpy!

1 Bo Pang and Lillian Lee. A Sentimental Education: Sentiment Analysis Using Subjectivity

Summarization Based on Minimum Cuts, Proceedings of the ACL, 2004.

Structure of the Code

We strongly suggest you create a project via Spyder IDE or PyCharm. The structure of the

code should look like this:

a) The main entry point (where the __main__ block of code is implemented) is inside

sentiment_classifier.py. There we get to load the various dataset splits via

load_dataset(), create different vector representations of our inputs via

compute_features(), create an instance of our own perceptron algorithm

implementation (Perceptron class in models/perceptron.py), train via

Perceptron.train() and predict via Perceptron.predict() with it, as well

as create an instance of our own gradient descent algorithm for a linear model using

the squared loss and L2-norm regulariser, train via GD.train() and predict via


b) The dataset reader class ReviewsDataset is inside

readers/reviews_dataset.py. This is very similar (actually simpler) to what

we’ve seen in the previous coursework.

c) Our implementation of the perceptron algorithm goes inside the class Perceptron

of models/perceptron.py (surprise, surprise!). You will need to finish the

implementation of Perceptron.train() and Perceptron.predict(); see in

Part II how.

d) Our implementation of the Gradient Descent Algorithm goes inside the class GD of

models/gd.py. You will need to finish the implementation of GD.train() and

GD.predict(). It’s going to be very similar to Perceptron; Part III explains how.

Part I - Feature Engineering [7 marks]

The dataset (train/dev/test splits) should already be in place inside the source code project

directory data/. A short description of the dataset can be found here.

Load the data inside __main__ in sentiment_classifier.py using the dataset loader

provided for you (function load_dataset() in sentiment_classifier.py). Have a

look at the object structure by printing out a few examples, inspecting with the debugger and

by looking at the code (ReviewsDataset in readers/reviews_dataset.py). It

should consist of three arrays: X, X_truecased and y.

- X is a list of examples. Each example corresponds to another list containing the

words of every sentence, lower-cased.

- X_truecased is identical to X but the sentences have been “true-cased”

automatically, i.e., we have tried to recover the original case of words (e.g., beginning

of sentence, proper names, etc).

- y is a list of labels [positive, negative], one for each example.

Note: The dataset has been already tokenised by its authors, so you don’t need to use

sent_tokenize() from NLTK.

OK! Let’s start thinking of good feature representations!

We will do this in two steps: encoding features, and more advanced representation features.

Note: We will be using the term features and feature templates interchangeably; you can

consider the latter as a “function” that generates similar-looking features (e.g., all unigram

occurrences is a feature template that results in a feature vector which has the size of the

vocabulary, and binary values for the words (or unigrams) that are present in a sentence).

One standard process (which is based on our intuition and expectation of what each feature

should contribute to) is the following (use the MultinomialNB as your main classifier

without any parameters; this is an implementation of Naive Bayes that is known to work very

well for text classification tasks such as this one):

for every feature template in list_of_features do

create feature vector representation

train classifier -> predict on dev -> compute F-1 score

choose best feature template based on F-1 score

As you might realise this seems a reasonable strategy, but it doesn’t take into account the

combination of feature templates. This is where your intuition should come into the game.

Which feature templates should you try to combine (i.e., apply in your Pipeline together)

and expect to get an increase in F-1 score?

Of course, you could try all of the combinations in a Grid Search fashion (using

GridSearchCV). But that could take several hours...

Note: What we recommend and ask you to implement is the bare minimum. However, this

part is a bit more research-y, so feel free to experiment with anything else that comes to

mind and you feel capable of implementing!

[implement compute_features() in sentiment_classifier.py]

Encoding Features

1. Implement each of the following feature templates; you should only need to use

CountVectorizer, and TfIdVectorizer. Check Week 5’s Lab Sheet and the

documentation of how to use and set the corresponding parameters). Report

directly on the Vision test (Coursework 2) the Accuracy, and provide a very

brief justification why you think the Accuracy changed, and comment on any

feature combinations you try out. The bare minimum you are requested to try

are the templates in (a-e).


- The first step is the simplest vector representation and will be considered as

our baseline, i.e, the point of reference for all future experiments. Hopefully,

everything else should (in theory) have a higher score. If not, either our

intuition didn’t work (don’t be surprised how often this happens), or we have a

bug! Don’t worry about the former case, but obviously double check you don’t

have a bug first, before moving on.

- You will also have to decide on combinations of features, i.e., which feature

templates to leave “on” before going to the next one. Don’t worry about

making an exhaustive search; use your intuition. Of course, you will notice

that some of the feature templates are mutually exclusive.

a. Represent only word occurrences with binary values (baseline)

b. Remove stop words (these are the most common words in the English


c. Represent term frequencies of words.

Another refinement on top of term frequencies is to downscale weights for words that

occur in many examples in the corpus and are therefore less informative than those

that occur only in a smaller portion of the corpus. This downscaling is called tf–idf for

“Term Frequency times Inverse Document Frequency”.

d. Represent words using tf-idf.

Sometimes using low-frequency words can hurt the performance of our classifier, as

there aren’t too many examples that contain them. The solution is to completely

remove them from our vocabulary before creating any features. The question to ask

is: which ones should we remove? A common technique is to apply a threshold on

the occurrences and remove any that falls below that. E.g., if we set the

threshold=1 we will remove all words from our vocabulary that occur only once.

e. Apply a threshold to low-frequency words. Also, report the size of the

vocabulary that remains. Consider the threshold as a hyperparameter. Hint:

Since this is a fairly small dataset don’t try too big thresholds!

Representation Features

2. Most likely the following feature templates should be used in combination with the

ones you tried above. Because by design they are attempting to capture more

“context” in principle they should add to our overall performance. However, since

they increase the feature space immensely you might not have enough data in the

end to attain the results you want. How will you find out? Well, by reporting the

Accuracy, and providing a small justification why you think the Accuracy changed,

and comment on any feature combinations you try out. Report directly on the

Vision test (Coursework 2) similarly to Part I - Question 1.

a. Represent sentences as a bag of bigrams, or as a bag of trigrams instead of

unigrams, which is what all of the Encoding Features were doing.

b. Try also combinations of ngram templates.

Report Results

Time to make a decision. Which feature combination should you choose?

3. Collect all your results and present them in a table (rows represent each experiment,

column contain Accuracy score). Choose the best feature combination and briefly

justify your answer.

Part II - Perceptron Algorithm [9 marks]

[Implement the (rest of the) body of train(), init_parameters() and predict()in


Now it’s finally the time to get your hands dirty: let’s actually implement a Machine Learning

algorithm from scratch! Hint: Take some time to familiarise yourself with the scaffolding

implementation of Perceptron class, and how it’s getting initialised from main() in

sentiment_classifier.py before you start implementing!

1. You will have to implement both the training and prediction methods for the

Perceptron Algorithm according to the following pseudocode (You don’t have to

answer any part on the Vision test; we will inspect the code you will upload.)

Algorithm 1: train(train_X, train_y, dev_X, dev_y)

1: init_parameters() // initialise parameters

2: for epoch = 1 … num_epocs do // for every epoch

3: preds ← []

4: for all (x, y) ∈ train_X, train_y do

5: a ← ∑ x // compute activation for this example



wd d + b

6: y? = sign(a) // predict the label

7: preds ← preds + y? // append the predicted label

8: if ya ≤ 0 then // we made a mistake

9: w ← w + yx // update weights

10: b ← b + y // update bias

11: end if

12: end for

13: end for

Algorithm 2: init_parameters()

1: w , for all d = 1 … D // initialise weights d ← 0

2: b ← 0 // initialise bias term

Algorithm 3: predict(X)

1: preds ← []

2: for all x ∈ X do

3: a ← ∑ x



wd d + b // compute activation for this example

4: y? = sign(a) // predict the label

5: preds ← preds + y? // append the predicted label

6: return preds


Now is the time to take your algorithm for a spin… Use as input vectors (train_X and dev_X)

the best feature representation from Part I, in order to train your perceptron.

2. Perform hyperparameter tuning.

a. Plot the training and development accuracy curves (in a single plot) against

your hyperparameter values (Just upload the file).

b. Briefly explain what is going on in this plot.

Note: Think of what is missing from the Pseudocode in Algorithm 1, in order to

perform this step.

Extra Features

Randomly shuffling the training set before you start a new epoch usually helps (e.g.,

especially when dealing with irrelevant features) and makes the model converge faster.

3. (a) Implement shuffling, compute train/dev accuracy curves and add them to the plot

you computed in the question above. (Upload a new file). (b) Answer very briefly (in

one line) if it helps.

A few useful tips (Don’t skip them!):

- Use the method safe_sparse_dot from scikit-learn (we’ve done the import for

you) when computing dot products with the input vector x. The reason is that the

Vectorizer functions you will use in Part I to create feature vectors, output a sparse

vector for every example (i.e., only contains values for non-zero elements).

- Be careful when computing the labels! Remember that you should output either -1

(negative class), or +1 (positive class).

- Debug your ML algorithm! Don’t just implement the pseudocode… Here are a few

typical steps (based on our experience):

- Inspect the value of the activation and the values of the weights/bias after a

perceptron update step on a single example. Do they do the right thing?

- Try first running your algorithm for 1 epoch; does it give you any errors? Is the

accuracy reasonable?

- What is the accuracy of a majority classifier (simply count the occurrences of

each class, pick the biggest and divide by the total number)? Does your

model do better than that? (We know it should)

- Overfit your model on a very small dataset. Use as training set a very small

portion of train_vecs and make sure your accuracy is increasing. Use the

following dataset sizes=1,2,5,10,100. For sizes=1,2 (maybe bigger too) your

accuracy should be 100%. If not, probably you have a bug!

Part III - Gradient Descent Algorithm [9 marks]

[Implement the (rest of the) body of train(), init_parameters() and predict()in


Let’s try to implement a different algorithm: Gradient Descent for a Linear Model using the

Squared Loss and L2-norm Regulariser. Note: Do not expect it to be as optimised as the

scikit-learn implementation (SGDClassifier). This is the reason why we will only train on a

smaller portion of the dataset (10000 examples).

1. Compute the gradients wrt to the weight vector w and bias term b. Enter just the

resulting terms on your response (you don’t need to show the derivation steps).

2. Implement the Gradient Descent Algorithm by adapting the pseudocode found in

Slides 14 and 27 of Week 4. Note: Careful! Slide 27 has an implementation using the

Exponential Loss.

Use as input vectors (train_X and dev_X) the best feature representation from Part I, in order

to train your linear model.

3. Perform hyperparameter tuning. Answer the following questions on the Vision Test.

a. Plot the training and development accuracy curves (in a single plot) against

max_iter for 3 combinations of eta and λ. (Upload the file with the plot that

should contain exactly 6 curves).

b. Briefly explain what is going on in this plot.

c. Plot the training and development average loss curves (in a single plot)

against max_iter for the same combinations as in Part III - 3a above. The

average loss is simply taking the average of the total loss of all examples in

the dataset.

Tips: This is a bit more convoluted since you have 3 hyperparameters (welcome to

the real world of Machine Learning!): max_iter, eta (η) and lambda (λ). You could try

a full grid search but it is going to be very slow. We advise you to set max_iter to

something low (e.g., 10 or 20) and leave it there. Then try with moderate eta (e.g.,

0.0001) and no regularisation (λ=0) and progressively increase λ (don’t go too far).

Try big changes to eta (e.g., 0.001, 0.0001, 0.00001) and do not try all the

intermediate values. When experimenting do not be afraid to “kill” an experiment

early if you see the accuracy not increasing (or loss not decreasing)… Your biggest

enemy here is overshooting (take my word for it, it’s very common!). Finally, at this

stage just print out the accuracies/losses (so you can “kill” the experiment if it

misbehaves) before going to the plots.

The moment of truth...

4. Finally, use the best Perceptron model from Part II and best Linear Model and predict

on the test set. Answer the following questions (very briefly -a few lines) on the Vision


a. What are their accuracies?

b. Is the comparison fair?

c. How do the two models compare to each other?


You will have to submit the Test in ‘Assignments>Coursework 2 - Test’ on Vision. Over

there, you should (i) upload the completed source code in a .zip file: PLEASE don’t include

the dataset files! (it should be named: Coursework_2_H0XXXXXXX.ipynb, by replacing

the Xs with your Student ID) , (ii) enter your answers to the questions outlined in the

description above, (iii) upload a few of plot images as part of your answers clearly indicated

within the test. It is advisable to properly comment your code. Note: Any 3rd party source

used must be properly cited (see also below).

Important Notes!

- The assignment counts for 25% of the course assessment.

- You are permitted to discuss the coursework with your classmates and of course with

me and the teaching assistant. However, coursework reports must be written in your

own words and the accompanied code must be your own. If some text or code in the

coursework has been taken from other sources, these sources must be properly

referenced. In particular, for pieces of code you get from the web (e.g., from

StackOverflow), minimally you should provide the link where you found it as an inline

comment in the jupyter notebook. Failure to reference work that has been obtained

from other sources or to copy the words and/or code of another student is

plagiarism and if detected, this will be reported to the School's Discipline Committee.

If a student is found guilty of plagiarism, the penalty could involve voiding the course.

- You should never give hard or soft copies of your coursework report or code to

another student. You must always refuse any request from another student for a

copy of your report and/or code. Sharing a coursework report and/or code with

another student is collusion, and if detected, this will be reported to the School's

Discipline Committee. If found guilty of collusion, the penalty could involve voiding

the course. You can find more information about the University’s policy on Plagiarism

here: https://www.hw.ac.uk/uk/students/studies/examinations/plagiarism.htm

- Pay special attention to all the Labs as they should provide lots of insight as to how

to tackle most of the questions in this coursework. The Final Lab is specifically

designed for you to ask any questions related to the coursework. Also, CHECK THE

DOCUMENTATION OF scikit-learn when you are in doubt.

- The test should be submitted by 15:00 UK time on Friday 13 November. You will

find the coursework under the ‘Assessment’ section of the course Vision page. No

individual extensions are permitted under any circumstances. Students who submit

after the deadline but within 5 working days of the deadline will be award

0.7x(awarded coursework mark). Submissions that are more than 5 days late will

receive 0 marks.

- You will receive the final mark, answers, sample code solution and cohort-wide

feedback no later than 15 working days.

版权所有:编程辅导网 2018 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com