CSCI446/946 - Spring Session 2019 Page 1
University of Wollongong
School of Computing and Information Technology
CSCI446/946 Big Data Analytics Spring 2019
Assignment 2 (Due: 9 October 2019, Wednesday) 20 marks
Aim
This assignment is intended to provide basic experience in conducting text analytics experiments with R. After
having completed this assignment you should know how to perform text classification, topic modeling, and sentiment
analysis.
Preliminaries
Read through the lecture notes and recommended readings on text analysis. Study all example programs therein so
that you fully understand these techniques and know how to perform them with R.
Task 1 – Text Classification (6 marks)
The 20 Newsgroups data set is a benchmark for text classification. It consists of approximately 20,000 newsgroup
documents, which have been categorised into 20 different newsgroups. Information on this dataset can be obtained
from the webpage http://qwone.com/~jason/20Newsgroups/ . Download the “20news-bydate-matlab.tgz” from
this webpage and unzip it to obtain the training and testing data sets. Train the Naïve Bayes classifier with the
training data set and test it on the testing data set.
In your report, you need to
1. Describe this 20 Newsgroups data set.
2. Describe how each document is represented in your implementation.
3. Describe Naïve Bayes classifier and how you use it to classify the 20 Newsgroups data set.
4. Report the classification accuracy and plot the confusion matrix.
5. Attach your code at the end of the report.
Task 2 – Topic Modeling (6 marks)
Perform LDA topic modeling on the Reuters-21578 corpus using R (or Python) and LDA. The NLTK has already
come with the Reuters-21578 corpus. To import this corpus, enter the following comment in the Python prompt:
from nltk.corpus import reuters
R comes with an lda package that has built-in functions. The LDA has also been implemented by several Python
libraries such as gensim. Either use one such package/library or implement your own LDA to perform topic
modeling on the Reuters-21578 corpus.
In your report, you need to
1. Describe the Reuters-21578 corpus.
2. Describe how each document is represented in your implementation.
3. Describe the whole procedure on applying LDA to this corpus to perform topic modeling.
4. Describe the parameter setting that you use in the LDA and explain their meanings.
5. Describe the output of your code and visualize the obtained topics in appropriate ways.
6. Attach your code at the end of the report.
Task 3 – Sentiment Analysis (8 marks)
Choose a topic of your interest, such as a movie, a celebrity, or any buzz word. Then collect 200 tweets related to this
topic. Hand-tag them as positive, neutral, or negative. Next, randomly split them into 150 tweets as the training set
and the remaining 50 as the testing set. Run one or more classifiers (such as Naïve Bayes, Maximum Entropy, or
Support Vector Machines) over these tweets to perform sentiment analysis. Report the classification accuracy and
CSCI446/946 - Spring Session 2019 Page 2
plot the confusion matrix. When you run more than one classifiers, find methods to evaluate which classifier
performs better than the others. (* It is not compulsory for the students of CSCI446 to run more than one
classifier.)
In your report, you need to
1. Describe the procedure of collecting the tweets and manually tagging them.
2. Describe the statistics of the obtained data set.
3. Describe how you represent each tweet for classification.
4. For each classifier, describe its working principle, classification procedure, and parameter setting.
5. For each classifier, report the classification accuracy and plot the confusion matrix.
6. (CSCI946 only) When you run more than one classifiers, report which classifier performs better than the
others and describe the methods you use to reach this conclusion.
7. Attach your code at the end of the report.
Submit:
Important:
1. The report must be in PDF format.
2. The report shall contain sufficient and detailed description, explanation, justification and
discussion. Marks will be deducted for a BRIEF report.
3. Sufficient annotation shall be provided in your code to make it easy to understand.
Neatly print your report and code (i.e. first the report then the code) on A4 pages with an appropriate cover sheet and
hand it in during the lecture on the 9th of October 2019. Make sure your report and code are correctly formatted and
titled. (Marks will be deducted for untidy or incorrectly formatted work.) Also, submit your report and the source
code in a Zipped file named A2.zip via the submit link provided in the Moodle site.
Note: Failure of your code to run may attract zero marks. Code or reports considered to be unreasonably same due to
copying will attract zero marks. You may be requested to demonstrate and explain your program when necessary.
Marks will be awarded for correct design, implementation and style. Any request for an extension of the submission
deadline or demonstration time limit must be made to the Subject Coordinator before the submission deadline.
Supporting documentation must accompany the request for any extension. Late assignment submissions without
granted extension will be marked but the mark awarded will be reduced by 25% of the assignment mark for each day
(including weekends) late.
--- END ---
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。