联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2019-05-23 11:02

2019S1 BUSS6002 1

BUSS6002 Assignment 2

Due Date: Friday 31 May 2019

Value: 25% of the total mark

Rationale

This group assignment has been designed to help students develop valuable collaboration and

communication skills when working in a team, as well as to allow students to contextualise

their data science skills on a real-world problem from business.

Instructions

1. Required submission items via Canvas:

1. ONE written report (PDF format).

Assignments > Report Submission (Assignment 2)

2. ONE Jupyter Notebook .ipynb

Assignments > Upload Your Code File (Assignment 2)

3. ONE csv file of test results

Assignments > Test Results Submission (Assignment 2)

2. The assignment is due at 12:00pm (noon) on Friday, 31 May 2019. The late penalty

for the assignment is 5% of the assigned mark per day, starting after 12:00pm on the

due date. The closing date Friday, 07 June 2019, 12:00pm (noon) is the last date on

which an assessment will be accepted for marking.

3. As per anonymous marking policy, please include the Group ID and Student IDs of all

group members. Do NOT include names. The name of the report and code file must

follow: GroupID_BUSS6002_Assignment2_S12019, and the name of test results

must follow: GroupID_BUSS6002_Assignment2_Test_Results.csv.

4. Your answers shall be provided as a word-processed report giving full explanation

and interpretation of any results you obtain. Output without explanation will receive

zero marks. You are required to also submit your code that can reproduce your

reported results, as reproducibility is a key component to data science. Not submitting

your code will lead to a loss of 50% of the assignment mark.

5. Be warned that plagiarism between individuals is always obvious to the markers of

the assignment and can be easily detected by Turnitin.

6. Presentation of the assignment is part of the assignment. There will be 10% marks for

the presentation of your report and code submission.

7. Numbers with decimals should be reported to the third-decimal point.

Meeting Minutes

1. Each group is required to submit at least 3 meeting minutes as the appendix attached

to the final report. Templates will be provided for preparing meeting agendas and

meetings minutes. You may use the templates provided or a template you choose.

2. In case of a problem within a group we will request minutes of the previous meetings.

We can make an individual adjustment to the group mark if there is sufficient

2019S1 BUSS6002 2

evidence that a student has done very little. If the student has done nothing, we will

award a mark of zero.

Peer Review

1. If you encounter any issues with your group members, please report and discuss with

your unit coordinator as early as possible.

2. We may ask for peer review from each student within a group. The instructions about

how to do this will be released later on.

3. Each group will be awarded a group mark per the marking criteria. In special cases,

individual marks may be applied if there is dispute in a group and the quality/quantity

of contributions made by individuals are significantly different, in which cases the

unit coordinator will seek peer review reports from individuals within a group to

decide on individual marks.

Group Competition

We will allocate 10% marks for competition among groups. The winning group with the

highest test score will secure a full 10% mark, while other groups will be awarded a mark

according to their test scores relative to the best test score.

Project Description and Dataset

Today, e-commerce has revolutionized the way companies do business and how consumers

make purchasing decisions. It has become common practice for consumers to read online

reviews to inform their decision making. The increasing popularity of online reviews also

stimulates the business of “illegal” activities (e.g., fake review writing) that try to mislead

readers by producing deceptive reviews to influence readers’ opinions and decision making.

Typical examples include posting of undeserving positive reviews in order to promote a

product and/or by posting of fake negative reviews for competitors in order to damage their

reputation.

In this assignment, you are tasked with developing a data science technique that can

automatically detect fake reviews from a collection of reviews submitted to a leading online

shopping portal. You must summarise your findings in a report according to the task

instructions given below.

You are provided with two data files: review_train.csv and review_test.csv. The

files contain a collection of reviews submitted for a range of products and customers’

purchase information. Only review_train.csv contains Label values (target), where 1

indicates “fake” and 0 indicates “normal”. You should train and validate your model

building using review_train.csv and report the test score on review_test.csv.

It is your goal to identify fake reviews based on the text of reviews and other features

available in the data sets. The meanings of features are obvious according to their variable

names in the data sets. However, it may not be feasible to directly use these features

(especially reviews represented as raw text) to build a good classification model. One of your

tasks is to carefully extract or construct meaningful features as input for the modelling

process.

2019S1 BUSS6002 3

Tasks

Please note that most tasks are deliberately designed to be open ended as this is a real-world

problem.

Exploratory Data Analysis (EDA): You should conduct a thorough EDA for the given data

sets. For example, check/deal with missing data, visualise the distributions of features,

identify which features can better distinguish different target values, and perform feature

correlation analysis, etc. Carefully present your analysis and findings in your report.

Feature Engineering: You may consider feature engineering strategies to extract useful

features for your modeling task. Given that reviews are represented as raw text, you may

consider pre-processing raw text and transforming them into text features, e.g., unigrams,

bigrams, or TF-IDF. Carefully think about what features can be discriminative for

fake/normal reviews, for example, length of review, percentage of capitalized words, etc.

Justify your choice of feature engineering strategies and report your findings.

Benchmark Model: Build a logistic regression model to assess the feasibility of the project

and establish a baseline model. You may choose to split the given training data into a

training/validation set or use cross-validation to validate your model building. You need to

validate your model building against whatever hyperparameters apply to the model. For this

task, you can build your baseline model using simplistic text-based features. Document your

analysis and findings.

Improving your Model: You are requested to improve the performance of your benchmark

model as much as you can. You may consider the feature selection strategies, i.e., choosing

the top K most important features to re-build your model, or when reasonable, adding new

constructed features to re-build your model. Report your setting and comparisons with the

benchmark model.

Note: For this task, if you want to use a classification model that is not taught in this unit, you

must clearly explain the principle of the model, justify why you choose that model, and

present your analysis. For any model you choose, you need to validate your model building

against whatever hyperparameters apply.

Final Test Results: Finally, according to your analysis, decide your best model and apply the

model on the test data. You are asked to report the classification results on the test data. Save

your results into a csv file containing two columns, one for the Review indices (REVIEW_ID

from review_test.csv) and the other column for the predicted labels (0’s or 1’s). Name

your file as GroupID_BUSS6002_Assignment2_Test_Results.csv. The results will be

assessed by our markers in order to decide your group performance among the entire class

(group competition!).

Note that, we will use F1-score as the test score for group competition.

2019S1 BUSS6002 4

Presentation

The assignment material to be submitted will consist of a final report that:

1) Takes a research article form in which you shall have a number of sections

such as introduction, methodology, experiment results, findings/interpretation,

and conclusion. All references should be properly cited and take a full

bibliographical format. Here are a few examples

http://cs229.stanford.edu/proj2015/007_report.pdf

http://cs229.stanford.edu/proj2015/188_report.pdf

http://cs229.stanford.edu/proj2015/031_report.pdf

2) Details ALL steps and decisions taken by the group regarding requirements

above.

3) Demonstrates an understanding of the problem being addressed and the

relevant principles of data science techniques used.

4) Includes an executive summary to conclude your best performing model and

the recommended features for modeling.

5) Clearly and appropriately presents any relevant graphs and tables.

The report should be NOT more than 20 pages with font size no smaller than 11pt,

including everything like text, figures, tables, small sections of inserted code, etc., but

excluding the cover page and the appendix containing the meeting minutes. Think

about the best and most structured way to present your work, summarise the

procedures implemented, support your results/findings and prove the originality of

your work. You will provide your code as a separate submission to the report;

however, you may insert small sections of your code into the report when necessary.

Your code submission has no length limit, however marks are assigned for code

presentation, so make your code as concise as possible and add comments when

necessary to explain the functionality of your code segments. Make sure to remove

any unnecessary code and ensure that your code can be run without error.

Your group is required to submit at least 3 meetings minutes. Your group may use the

attached templates for preparing meeting agendas and minutes. Documentation should

include attendance, discussion points, actions decided, etc. You may use your own

form or find something online.

You, as a member of a group, may be also required to submit your peer review. Please

use the provided criteria sheet and assessment form for this purpose. You will be

advised how to use an online form when it becomes available.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp