联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2018-11-02 10:43

QBUS2820 Predictive Analytics

Semester 2, 2018

Assignment 2

Key information

Required submissions: Written report (word or pdf format, through Turnitin submission)

and Jupyter Notebook (through Ed). Group leader needs to submit the Written report and

Jupyter Notebook.

Due date: Saturday 3

rd November 2018, 2pm (report and Jupyter notebook submission).

The late penalty for the assignment is 10% of the assigned mark per day, starting after 2pm

on the due date. The closing date Saturday 10th November 2018, 2pm is the last date on

which an assessment will be accepted for marking.

Weight: 30 out of 100 marks in your final grade.

Groups: You can complete the assignment in groups of up to three students. There are no

exceptions: if there are more than three you need to split the group.

Length: The main text of your report (including Task 1 and Task 2) should have a

maximum of 20 pages. Especially for Task 2, you should write a complete report. You may

refer to Assignment 1-Task 2 as reference for the structure of the report.

If you wish to include additional material, you can do so by creating an appendix. There is

no page limit for the appendix. Keep in mind that making good use of your audience’s time

is an essential business skill. Every sentence, table and figure has to count. Extraneous

and/or wrong material will reduce your mark no matter the quality of the assignment.

Anonymous marking: As the anonymous marking policy of the University, please only

include your student ID and group ID in the submitted report, and do NOT include your

name. The file name of your report should follow the following format. Replace "123" with

your group SID. Example: Group123Qbus2820Assignment2S22018.

Presentation of the assignment is part of the assignment. Markers might assign up to 10%

of the mark for clarity of writing and presentation. Numbers with decimals should be

reported to the third decimal point.

Key rules:

Carefully read the requirements for each part of the assignment.

Please follow any further instructions announced on Canvas, particularly for submissions.

You must use Python for the assignment.

Reproducibility is fundamental in data analysis, so that you will be required to submit a

Jupyter Notebook that generates your results. Unfortunately, Turnitin does not accept

multiple files, so that you will do this through Ed instead. Not submitting your code will

lead to a loss of 50% of the assignment marks.

Failure to read information and follow instructions may lead to a loss of marks.

Furthermore, note that it is your responsibility to be informed of the University of Sydney

and Business School rules and guidelines, and follow them.

Referencing: Harvard Referencing System. (You may find the details at:

http://libguides.library.usyd.edu.au/c.php?g=508212&p=3476130)

Task 1 (35 Marks)

Part A: Logistic Regression (15 Marks)

Use Logistic Regression to predict diagnosis of breast cancer patients on the Breast Cancer

Wisconsin (Diagnostic) Dataset “wdbc.data”. See Section “About the datasets” as detailed

data description.

(a) Write Python code to load the data. For the target feature Diagnosis, change its literal M

(malignant) to 1 and B (benign) to 0.

Then define and train a logistic regression model with intercept by using scikit-learn’s

LogisticRegression model, using default parameter values.

Based on the estimated parameters from your model, calculate the probability of sample ID

8510426 (20th sample) having a benign diagnosis.

(b) Based on slides 26 to 31 of Lecture 9, write your own python code to implement the

gradient ascend algorithm for the logistic regression with intercept:

You may use the following defined logistic function.

def logistic_function(reg_input):

return np.exp(reg_input) / (1 + np.exp(reg_input))

Using the given data, write python code to use initial values ?? = [0,0, … ,0]

, to run the

gradient ascend algorithm to maximize the the log-likelihood function of logistic regression

with respect to the parameters.

Find the optimal learning rate and resulting estimated ??? . Then re-do task (a): probability

of sample ID 8510426 (20th sample) having a benign diagnosis. Compare the results and

explain the major reasons why you may have different answers with scikit-learn.

Now change the initial values to ?? = [1,1, … ,1]

, and re-do the above tasks and report

your results and findings.

About the dataset:

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancerwisconsin/wdbc.names

Part B: Youtube Comment Spam Classification (20 Marks)

Some questions in Task 2 need you to do some self-learning, e.g., exploring how to build

features for the text data using bag of words. You should discuss with your group members

on how to deal with the problem and do necessary self-learning which is an important ability

to have for your future study and career.

Your goal is to build a Random Forest (RF) classifier that classifies whether a youtube

comment is spam or not.

Use the ytube_spam dataset. We have already split the data into train and test sets:

"ytube_spam_trainset.csv" and "ytube_spam_testset.csv".

General instructions:

1. "CLASS" in the data is the target variable ??.

2. 3-fold cross validation if needed.

3. Make sure set your random number generator seed to 0 for this question:

"np.random.seed(0)".

(a) Self-study and use the following Python package:

from sklearn.feature_extraction.text import TfidfVectorizer

Build a bag of words representation of the data with:

Max 1000 features

Remove the top 1% of frequently occurring words

A word must occur at least twice to be included as a feature

Remove common English words

b) Build a random forest classifier and use cross validation to optimise the parameters of the

random forest. You need to at least optimise the number of trees in the random forest and can

explore and optimise other parameters as well.

Use the following Python packages:

from sklearn import ensemble

from sklearn.model_selection import GridSearchCV

With your CV selected optimal parameters' values, re-train the RF on the full training set and

produce your best performing model.

Test your best performing model on the test set, and you must achieve an average score ("avg

/ total") of at least 0.96 for precision, recall and f1-score of "sklearn classification_report".

Report "sklearn classification_report" output.

(c) Based on your cross validation results from GridSearchCV, plot the "mean_test_score"

and "mean_train_score" vs number of trees on the same Figure.

If you optimised other parameters, then fix these parameters to their optimal values.

(d) Report your random forest settings that achieve the best classification.

(e) Produce a histogram of the depths of the trees of your best performing model.

(f) Report the top 10 most important text features of your best performing model.

Task 2 (25 Marks)

1. Problem description

Rossmann is a German drug store chain that operates over 3000 stores in 7 European

countries. In this assignment, you will use “Rossman_Sales.csv” data to forecast six weeks

of daily sales following the last period in the dataset.

Your objective inthis assignment isto developunivariate forecastingmodels, e.g. only

using the historical sales, to address this problem.

We focus on the sales forecasting of store 1. You can download the dataset

“Rossman_Sales.csv” from Canvas.

2. Report andrequirements

a. The purpose of the report is to discuss the business context, exploratory data

analysis, methodology, model diagnostics, model validation and present

forecasts and conclusions for six weeks of daily sales following the last

period in thedataset.

b. Your group must identify at least 1 simple benchmark model and at least 2 different

forecastingmethodsormodelsthat can be used to forecastsales.

c. The report should also include an analysis of a monthly sales (with the limitation

that the sample size is small at this frequency).

3. Further analysis for bonus marks

The group can earn up to 2 bonus marks (in the final mark for the unit) by developing a

system to automatically generate forecasts for all stores. In order to obtain the bonus

marks, you should present interesting results based on thistool (use the appendix and refer

to it the main text of the report). The ability to summarise information and be concise is

essential here.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp