联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp2

您当前位置:首页 >> Python编程Python编程

日期:2020-07-25 11:28

COMP9414: Artificial Intelligence

Assignment 2: Sentiment Analysis

Value: 25%

This assignment is inspired by a typical real-life scenario. Imagine you have been hired as a

Data Scientist by a major airline company. Your job is to analyse the Twitter feed to determine

customer sentiment towards your company and its competitors.

In this assignment, you will be given a collection of tweets about US airlines. The tweets have been

manually labelled for sentiment. Sentiment is categorized as either positive, negative or neutral.

Important: Do not distribute these tweets on the Internet, as this breaches Twitter’s

Terms of Service.

You are expected to assess various supervised machine learning methods using a variety of features

and settings to determine what methods work best for sentiment classification in this domain. The

assignment has two components: programming to produce a collection of models for sentiment

analysis, and a report to evaluate the effectiveness of the models. The programming part involves

development of Python code for data preprocessing of tweets and experimentation of methods using

NLP and machine learning toolkits. The report involves evaluating and comparing the models

using various metrics, and comparison of the machine learning models to a baseline method.

You will use the NLTK toolkit for basic language preprocessing, and scikit-learn for feature construction

and evaluating the machine learning models. You will be given an example of how to

use NLTK and scikit-learn for this assignment (example.py). For the sentiment analysis baseline,

NLTK includes a hand-crafted (crowdsourced) sentiment analyser, VADER,1 which may perform

well in this domain because of the way it uses emojis and other features of social media text to

intensify sentiment, however the accuracy of VADER is difficult to anticipate because: (i) crowdsourcing

is in general highly unreliable, and (ii) this dataset might not include much use of emojis

and other markers of sentiment.

Data and Methods

A training dataset is a tsv (tab separated values) file containing a number of tweets, with one

tweet per line, and linebreaks within tweets removed. Each line of the tsv file has three fields:

instance number, tweet text and sentiment (positive, negative or neutral). A test dataset is a tsv

file in the same format as the training dataset except that your code should ignore the sentiment

field. Training and test datasets can be drawn from a supplied file dataset.tsv (see below).

For all models except VADER, consider a tweet to be a collection of words, where a word is a string

of at least two letters, numbers or the symbols #, @, , $ or %, delimited by a space, after removing

all other characters (two characters is the default minimum word length for CountVectorizer in

scikit-learn). URLs should be treated as a space, so delimit words. Note that deleting “junk”

characters may create longer words that were previously separated by those characters.

Use the supervised learning methods discussed in the lectures: Decision Trees (DT), Bernoulli

Naive Bayes (BNB) and Multinomial Naive Bayes (MNB). Do not code these methods: instead use

1https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8109

the implementations from scikit-learn. Read the scikit-learn documentation on Decision Trees2 and

Naive Bayes,3 and the linked pages describing the parameters of the methods. Look at example.py

to see how to use CountVectorizer and train and test the machine learning algorithms, including

how to generate metrics for the models developed.

The programming part of the assignment is to produce DT, BNB and MNB models and your own

model for sentiment analysis in Python programs that can be called from the command line to train

and classify tweets read from correctly formatted tsv files. The report part of the assignment is to

analyse these models using a variety of parameters, preprocessing tools, scenarios and baselines.

Programming

You will produce and submit four Python programs: (i) DT sentiment.py (ii) BNB sentiment.py,

(iii) MNB sentiment.py and (iv) sentiment.py. The first three of these are standard models as

defined below. The last is a model that you develop following experimentation with the data. Use

the given dataset (dataset.tsv) containing 5000 labelled tweets to develop the models.

These programs, when called from the command line with two file names as arguments, the

first a training dataset and the second a test dataset, should print (to standard output), the

instance number and sentiment produced by the classifier of each tweet in the test set when

trained on the training set (one per line with a space between them) – each sentiment being the

string “positive”, “negative” or “neutral”. For example:

python3 DT sentiment.py training.tsv test.tsv > output.txt

should write to the file output.txt the instance number and sentiment of each tweet in test.tsv,

as determined by the Decision Tree classifier trained on training.tsv.

When reading in training and test datasets, make sure your code reads all the instances (some

Python readers use “excel” format, which uses double quotes as separators).

Standard Models

Train the three standard models on the supplied dataset of 5000 tweets (the whole of dataset.tsv).

For Decision Trees, use scikit-learn’s Decision Tree method with criterion set to ’entropy’ and with

random state=0. Scikit-learn’s Decision Tree method does not implement pruning, rather you

should make sure Decision Tree construction stops when a node covers fewer than 50 examples

(1% of the training set). Decision Trees are likely to lead to fragmentation, so to avoid overfitting

and reduce computation time, for all Decision Tree models use as features only the 1000 most

frequent words from the vocabulary (after preprocessing to remove “junk” characters as described

above). Write code to train and test a Decision Tree model in DT sentiment.py.

For both BNB and MNB, use scikit-learn’s implementations, but use all of the words in the

vocabulary as features. Write two Pythons programs for training and testing Naive Bayes models,

one a BNB model and one an MNB model, in BNB sentiment.py and MNB sentiment.py.

Your Model

Develop your best model for sentiment classification by varying the number and type of input

features for the learners, the parameters of the learners, and the training/test set split, as described

in your report (see below). Submit one program, sentiment.py, that trains and tests a model.

2https://scikit-learn.org/stable/modules/tree.html

3https://scikit-learn.org/stable/modules/naive bayes.html

Report

In the report, you will first evaluate the standard models, then present your own model. For

evaluating all models, report the results of training on the first 4000 tweets in dataset.tsv (the

“training set”) and testing on the remaining 1000 tweets (the “test set”), rather than using the

full dataset of 5000 tweets for training, so stopping the Decision Tree classifiers when nodes cover

less than 40 tweets rather than 50. Use the metrics (micro- and macro-accuracy, precision, recall

and F1) and classification reports from scikit-learn. Show the results in either tables or plots, and

write a short paragraph in your response to each item below. The answer to each question should

be self contained. Your report should be at most 10 pages. Do not include appendices.

1. (1 mark) Give simple descriptive statistics showing the frequency distribution for the sentiment

classes for the whole dataset of 5000 tweets. What do you notice about the distribution?

2. (2 marks) Develop BNB and MNB models from the training set using (a) the whole vocabulary,

and (b) the most frequent 1000 words from the vocabulary (as defined using CountVectorizer, after

preprocessing by removing “junk” characters). Show all metrics on the test set comparing the two

approaches for each method. Explain any similarities and differences in results.

3. (2 marks) Evaluate the three standard models with respect to the VADER baseline. Show all

metrics on the test set and comment on the performance of the baseline and of the models relative

to the baseline.

4. (2 marks) Evaluate the effect of preprocessing the input features by applying NLTK English

stop word removal then NLTK Porter stemming on classifier performance for the three standard

models. Show all metrics with and without preprocessing on the test set and explain the results.

5. (2 marks) Evaluate the effect that converting all letters to lower case has on classifier performance

for the three standard models. Show all metrics with and without conversion to lower case

on the test set and explain the results.

6. (6 marks) Describe your best method for sentiment analysis and justify your decision. Give

some experimental results for your method trained on the training set of 4000 tweets and tested

on the test set of 1000 tweets. Provide a brief comparison of your model to the standard models

and the baseline (use the results from the previous questions).

Submission

? Submit all your files using a command such as (this includes Python code and report):

give cs9414 ass2 DT*.py BNB*.py MNB*.py sentiment.py report.pdf

? Your submission should include:

– Your .py files for the specified models and your model, plus any .py “helper” files

– A .pdf file containing your report

? When your files are submitted, a test will be done to ensure that one of your Python files

runs on the CSE machine (take note of any error messages printed out)

? When running your code on CSE machines:

– Set SKLEARN SITE JOBLIB=TRUE to avoid warning messages

– Do not download NLTK in your program: CSE machines have NLTK installed

? Check that your submission has been received using the command:

9414 classrun -check ass2

Assessment

Marks for this assignment are allocated as follows:

? Programming (auto-marked): 10 marks

? Report: 15 marks

Late penalty: 5 marks per day or part-day late off the mark obtainable for up to 3

(calendar) days after the due date.

Assessment Criteria

? Correctness: Assessed on standard input tests, using calls such as:

python3 DT sentiment.py training.tsv test.tsv > output.txt

Each such test will give two files, a training dataset and a test dataset, which contain any

number of tweets (one on each line) in the correct format. The training and test datasets

can have any names, not just training.tsv and test.tsv, so read the file names from

sys.argv. The output should be a sequence of lines (one line for each tweet) giving the

instance number and classified sentiment, separated by a space and with no extra spaces or

lines. There are 2 marks allocated for correctness of each of the three standard models.

For your own method, 4 marks are allocated for correctness of your methods on test sets of

tweets that include unseen examples.

? Report: Assessed on correctness and thoroughness of experimental analysis, and clarity and

succinctness of explanations.

There are 9 marks allocated to items 1–5 as above, and 6 marks for item 6. Of these 6 marks,

2 marks are for the explanation of your choice of model, 2 marks are for the experimental

analysis of your model, and 2 marks are for the evaluation of your model in comparison to

the standard models and baseline.

Plagiarism

Remember that ALL work submitted for this assignment must be your own work and no code

sharing or copying is allowed. You may use code from the Internet only with suitable attribution

of the source in your program. Do not use public code repositories. All submitted assignments will

be run through plagiarism detection software to detect similarities to other submissions, including

from past years. You should carefully read the UNSW policy on academic integrity and plagiarism

(linked from the course web page), noting, in particular, that collusion (working together on an

assignment, or sharing parts of assignment solutions) is a form of plagiarism. There is also a new

plagiarism policy starting this term with more severe penalties.


相关文章

【上一篇】:到头了
【下一篇】:没有了

版权所有:编程辅导网 2018 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。