CS2034: Data Analytics Project: Building a sentiment classifier
Winter 2020
Out of / 80 (Tentative 18% of final grade)
In this project, you will select a dataset from the two available options. You will then train a machine
learning classifier to make predictions based on your features developed in VBA.
There are two datasets available: 1) Yelp reviews (project_yelp.txt) and 2) IMDB movie reviews
(project_imdb.txt).
Citation: [1]
Both of these datasets are in the format: <Review><TAB><Class>, where <Class> = 0 for negative
sentiment and 1 for positive sentiment.
You can select ONE of these datasets to engineer features for, using Excel and VBA. It is possible your
code might work well on both.
Submission Requirements: Submit an XLSM file <your_last_name>_project.xlsm and an accuracy.txt file
with the copied output for the LinearSVMBinary from Visual Studio (on Windows) or the custom Mac
software.
Project Requirements:
0. Import the CSV data into a macro-enabled Excel workbook. Give the first column a heading
called “REVIEW” and the second column, with the class labels, a heading called “SENTIMENT
CLASS” (0 marks).
1. Develop VBA features, implemented as Subs, to process text – requirement of 12. (4 * 12 = 48
marks).
(0 – Poor, 1, 2 – Marginal Quality, 2 – Acceptable Quality, 3 – Good Quality)
You will write 12 features, implemented as VBA subs, to process the text in this data to be fed
into a machine learning classifier.
Each feature will have its own column. The values for each feature MUST be numeric only.
Marks will be assigned based on: Code Quality, Originality, and suspected performance (ie. The
potential for the feature to improve the classifier accuracy). The features should be reasonably
distinct from each other.
IMPORTANT: Not all of these must be complex. The TA will take into consideration the balance
of complexity of your overall project. For instance, a high scoring project’s features might look
like: first 4 simple, next 4 moderate complexity, last 4 original and complex; might demonstrate
your creativity and ability to code in VBA.
IMPORTANT # 2: If you cannot come up with 12 features, the Instructor will assist you with
ideas.
IMPORTANT # 3: You can write more features if you want; you must clearly label the ones you
want us to mark.
2. A Sub Main, to call all of your features on the data. (6 marks).
3. Overall sufficient code comments in the Module, including a comment header with separate
lines consisting of your name, the course code, “Winter 2020”, and the Instructor name. There
should be good naming of Subs + the Module that contains your features (8 marks -
subjective).
4. Overall Good code organization (6 marks - subjective).
5. Good Accuracy scores (up to 12 marks, with a potential of 3 bonus for a maximum of 15
marks).
The instructor will release baseline scores for the classification task for the data. Those who
score below the baseline will get < 6 marks.
Baseline will get 6 marks.
Greater than baseline with be > 6 marks.
Training your ML Classifier (Mac):
The software to do this, including instructions, will be posted on OWL in a separate file in the project
assignment dropbox.
Training your ML Classifier (Windows Only):
1. Download Visual Studio Community 2019 for Windows (left) from:
https://visualstudio.microsoft.com/
2. In the Installer setup program, only select Desktop development.
3. Register for the COMMUNITY edition of the software (this is free for education use). You
might be able to use your UWO login.
4. Download and install the ML.NET Model Builder (you can OPEN this file and it will set
everything up for you) https://marketplace.visualstudio.com/items?itemName=MLNET.07
5. Export your XLSM feature data into a new CSV file. You should be able to just copy+paste it.
6. Select New -> Project -> Console Application
7.
Press “Create”
8. Right click on “ConsoleApp1” below “Solution ConsoleApp1” and hover over “Add” then go to
“Machine Learning” and click “Custom Scenario” (NOT sentiment analysis”
9. Test out your CSV file. Make sure the column to predict is the SENTIMENT CLASS label (0 –
negative sentiment or 1 – positive sentiment).
10. Train for 10 seconds under “binary-classification”
11. Then click “Evaluate” to get the accuracy score. Check what it is for the LinearSVMBinary in
the output window (if this is hidden click View - > Output)
Dataset References “From group to individual labels using deep
features,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, 2015, doi: 10.1145/2783258.2783380.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。