联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2020-09-15 10:28

COMP4650/6490 Document Analysis

Assignment 2 – ML

In this assignment, you will build several text classification models on a provided movie review

dataset. The dataset consists of 50,000 review articles written movies on IMBD, each labelled

with the sentiment of the review – either positive or negative. In addition there are another 50,000

reviews without labels.

In this assignment you will:

1. Become familiar with the Pytorch framework for implementing neural network-based

machine learning models.

2. Develop a better understanding of how machine learning models are trained in practice,

including partitioning of datasets and evaluation.

3. Study how changing the values of various model hyper-parameters impacts their

performance.

Throughout this assignment you will make changes to the provided code to improve the models.

In addition, you will produce an answers file with your responses to each question. Your answers

file must be a .pdf file named u1234567.pdf where u1234567 is your Uni ID. You should submit a

.zip file containing all of the code files and your answers pdf file, BUT NO DATA.

Your answers to coding questions will be marked based on the quality of your code (is it efficient,

is it readable, is it extendable, is it correct).

Your answers to discussion questions will be marked based on how convincing your

explanations are (are they sufficiently detailed, are they well-reasoned, are they backed by

appropriate evidence, are they clear).

Question 1 – A simple linear classifier (15%)

You have been provided with various code files to create, train and evaluate models on the given

dataset. To begin with, read through the code of and then run train_vector_classifier.py. For the

vector classifier, the documents will be represented as tf-idf vectors and a simple linear classifier

(LogisticRegressor) is trained to predict classes.

Your first task is to change the various settings in this file to try and improve the validation accuracy

of the model. The options you can change are:

1. The pre-processor in preprocessor.py. You can re-use your pre-processor from assignment 1,

or develop a new one.

2. The get_vector_representation function in data_loader.py. This function specifies how lists

of tokens will be transformed into tf-idf vectors. You can find documentation for the

CountVectorizer parameters here: https://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

3. The settings of the LogisticRegressor model in train_vector_classifier.py. You can find

documentation for the available parameters here: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Pick three changes you made that you think caused the most increase in performance, and for each

one briefly describe what effect this change has on your model and why you believe this lead to the

increased performance.

Leave your code file with the best performing settings.

Question 2 – Embedding based classifier (15%)

Next you will implement an embedding based classifier in pytorch. The model is a simple variation

on FastText (you can read more about it here if you are interested

https://arxiv.org/pdf/1607.01759.pdf). In models.py there is a class FastText, your task is to

implement the forward function of this class. Once you have implemented it you can then run

train_embedding_classifier.py to run your model.

For each input sentence, the model computes its output in 3 steps:

1. Embed the input IDs to create vector representations of each word in the sentence.

2. Aggregate the vectors together to create one vector that represents the entire sentence.

You can use any aggregation method you want, e.g. sum, max, average (note that average is

significantly more difficult to implement due to padding), etc.

3. Run the aggregated vector through a linear layer to compute predicted logits for each class.

Question 3 – Tuning a pytorch model (15%)

Similarly to question 1, change the various settings of your FastText model to try and improve its

validation accuracy. Here are some things you may want to change:

1. The aggregation method used by FastText.

2. The dimension of the embedding vectors.

3. The optimizer, including learning rate, momentum or even algorithm.

4. The number of epochs trained for.

5. The batch size.

6. The preprocessor.

In your answers report five different settings you tried, along with their accuracies. You may

optionally include some plots of losses over time. For each setting briefly describe why you believe

this led to increased/decreased performance.

Leave your code file with the best performing settings.

Question 4 – Comparison of models (10%)

Compare the performance of your best LogisticRegressor model to your best FastText model. Which

one performed better? You should explain what measures you are using to judge these models and

why. You should also explain why you think the LogisticRegressor/FastText performed better than

the other on this dataset.

Question 5 – Implement Word2Vec (15%)

You will now train a Word2Vec model to produce embeddings for the words in the vocabulary. You

will need to complete the batch_method function of the TextDS class in data_loader.py. This

function receives as input a DataFrame, where each cell of the “ids” column is a list of the token IDs

from some movie review. Each movie review should be converted into a list of context windows,

with the centre word of each window used as the target and the other words used as input. The

same FastText model from before will be used to do the prediction. Once you have implemented the

function, you will be able to run train_word2vec.py.

Question 6 – Compare pre-trained embeddings (10%)

After train_word2vec.py has finished running it will produce a file of word embeddings, and you will

then be able to run train_pretrained_classifier.py. train_pretrained_classifier.py trains a FastText

classifier, except instead of the embeddings being learned from scratch, they will be read from the

provided file and then frozen. Make sure that your train_pretrained_classifier.py script uses the

same settings as train_embedding_classifier.py so that you can compare them.

In your answers report the accuracy of the pretrained classifier with the original word2vec settings,

along with Three other settings. Did your FastText model perform better with or without pre-trained

embeddings? Why do you think this is?

Leave your code file with the best performing settings.

Question 7 – Further Modification (20%)

Your final task is to implement some further modification to the system to improve performance.

Exactly what kind of modification you implement is left up to you.

Some examples of modifications you could make:

• Replace the linear classifier in FastText with a multi-layer neural network.

• Implement a position dependent word2vec model.

• Replace the aggregation in FastText with a sequence model – such as GRU or LSTM.

• Use a Transformer to perform the classification.

• Anything else you can think of.

You should implement your changes to this question in train_final_classifier.py and leave the

existing files as they are.

You can change the proportion of training/validation/testing data in __settings__.py if you wish to

use more of the data in training, though it will make the scripts take longer to run.

In your answer to this question provide a list every change that you tried along with its

corresponding accuracies.

Academic Misconduct Policy: All submitted written work and code must be your own (except for any

provided starter code, of course) – submitting work other than your own will lead to both a failure

on the assignment and a referral of the case to the ANU academic misconduct review procedures:

ANU Academic Misconduct Procedures.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp