联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2023-03-27 08:41

OMP5318/COMP4318 – Machine Learning and Data Mining Semester 1, 2023

Page 1 of 7

Assignment 1: Classification

Key information

Deadlines

Submission: 11:59pm, 6 April, 2023 (Thursday week 7, Sydney time)

Late submissions policy

Late submissions are allowed for up to 3 days late. A penalty of 5% per day late will apply. Assignments

more than 3 days late will not be accepted (i.e. will get 0 marks). The day cut-off time is 11:59pm.

Marking

This assignment is worth 15 marks = 15% of your final mark.

Your code will be marked for correctness. A few marks will be allocated for style – meaningful variable

names and comments.

We will run your code. If it doesn’t run, you will get 0 marks for the parts that don’t run.

The assignment should be completed in pairs (groups of 2 students). No more than 2 students are allowed.

Submission

This assignment must be written in Python in the Jupyter Notebook environment. A Jupyter Notebook

template is provided. Your implementation should use the same suite of libraries that we have used during

the tutorials, such as scikit-learn, numpy and pandas.

The assignment will be submitted in Canvas.

Submission instructions:

Before you submit, you need to create a group. In Canvas -> “People”, select one of these two tabs:

“A1part1” or “A1part2”. Choose one of the empty groups listed and join it. Both you and your

partner must join the same group. Groups have a maximum of 2 members.

When you are ready to submit your assignment, you need to submit it on behalf of the group in

the corresponding submission box. You need to submit two versions of your code: ipynb and pdf.

Only one student from the group needs to submit, not both.

In summary:

o If you have registered your group under "A1part1", submit your ipynb code in "Submission:

Assignment 1 ipynb for A1part1 groups" and your pdf in "Submission: Assignment 1 pdf for

A1part1 groups"

o If you have registered your group under "A1part2", submit your ipynb code in

"Submission: Assignment 1 ipynb for A1part2 groups" and your pdf in "Submission:

Assignment 1 pdf for A1part2 groups"

It is important to follow the submission instructions carefully as otherwise your mark may not be recorded

correctly.

We had to create two options (A1part1 and A1part2) and two submission boxes because of the limitations

of Canvas for the number of groups it allows.

COMP5318/COMP4318 – Machine Learning and Data Mining Semester 1, 2023

Page 2 of 7

File names and student names

The submission files should be named like this: a1-SID1-SID2.ipynb (.pdf), where SID1 and SID2 are

the SIDs of the two students

In the Jupyter Notebook, include only your SIDs (as shown in the template) and not your name. The

marking is anonymous.

Task

In this assignment you will investigate a real dataset by implementing multiple classification algorithms.

You will first pre-process the dataset by replacing missing values and normalising the dataset with a min-

max scaler. You will then evaluate the performance of multiple classification algorithms: K-Nearest

Neighbour, Logistic Regression, Na?ve Bayes, Decision Tree, Support Vector Machine, Bagging, AdaBoost,

Gradient Boosting and Random Forest, using the stratified 10-fold cross-validation method. You will also

apply a grid search to find the best parameters for some of these classifiers.

1. Data loading, pre-processing and printing

The dataset for this assignment is the Breast Cancer Wisconsin. It contains 699 examples described by 9

numeric attributes. There are two classes – class1, corresponding to benign breast cancer tumours, and

class2, corresponding to malignant breast cancer tumours. The features are computed from a digitized

image of a biopsy sample of breast tissue for a subject.

The dataset should be downloaded from Canvas: breast-cancer-wisconsin.csv. This file includes the

attribute (feature) headings and each row corresponds to one individual. Missing attributes in the dataset

are recorded with a ‘?’.

You will need to pre-process the dataset, before you can apply the classification algorithms. Three types of

pre-processing are required: filling in the missing values, normalisation and changing the class values. After

this is done, you need to print the first 10 rows of the pre-processed dataset.

1. Filling in the missing attribute values - The missing attribute values should be replaced with

the mean value of the column using sklearn.impute.SimpleImputer.

2. Normalising the data - Normalisation of each attribute should be performed using a min-max

scaler to normalise the values between [0,1] with sklearn.preprocessing.MinMaxScaler.

3. Changing the class values - The classes class1 and class2 should be changed to 0 and 1

respectively.

4. Print the first 10 rows of the pre-processed dataset. The feature values should be formatted to

4 decimal places using .4f, the class value is an integer. A function print_data has been provided

in the template to help you achieve this.


For example, if your normalised data looks like this:

(You need to print the first 10 rows not the first 3.)

Please note that we will test your code with another dataset, and your pre-processing should be written

with this in mind. See the “Marking Criteria” section for more detail.

2. Defining functions for the classification algorithms


Part 1: Cross-validation without parameter tuning

You will now apply multiple classifiers to the pre-processed dataset, in particular: Nearest Neighbor,

Logistic Regression, Na?ve Bayes, Decision Tree, Bagging, Ada Boost and Gradient Boosting. All classifiers

should use the sklearn modules from the tutorials. All random states in the classifiers should be set to

random_state=0.

You need to evaluate the performance of these classifiers using 10-fold stratified cross-validation from

sklearn.model_selection.StratifiedKFold with these options:

cvKFold=StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

You will need to pass cvKFold (the stratified folds) as an argument when calculating the cross-validation

accuracy, not cv=10 as in the tutorials. This ensures that random_state=0.

For each classifier, write a function that accepts the required input and returns the average cross-validation

score:

def exampleClassifier(X, y, [options]):

return scores.mean()

where X contains the attribute values and y contains the class (as in the tutorial exercises).

More specifically, the headers of the functions for the classifiers are given below:

Logistic Regression

def logregClassifier(X, y)

return scores.mean()

It should use LogisticRegression from sklearn.linear_model.

Na?ve Bayes

def nbClassifier(X, y)

return scores.mean()

It should use GaussianNB from sklearn.naive_bayes

COMP5318/COMP4318 – Machine Learning and Data Mining Semester 1, 2023

Page 4 of 7

Decision Tree

def dtClassifier(X, y)

return scores.mean()

It should use DecisionTreeClassifier from sklearn.tree, with information gain (the entropy criterion)

Ensembles: Bagging, Ada Boost and Gradient Boosting

def bagDTClassifier(X, y, n_estimators, max_samples, max_depth)

return scores.mean()


def adaDTClassifier(X, y, n_estimators, learning_rate, max_depth)

return scores.mean()


def gbClassifier(X, y, n_estimators, learning_rate)

return scores.mean()


These functions should implement Bagging, Ada Boost and Gradient Boosting using BaggingClassifier,

AdaBoostClassifier and GradientBoostingClassifier from sklearn.ensemble. Bagging and

Ada Boost should combine decision trees and use information gain.


Part 2: Cross-validation with parameter tuning

For two other classifiers, SVM and Random Forest, we would like to find the best parameters using grid

search with 10-fold stratified cross-validation (GridSearchCV in sklearn).

The data should be split into training and test subsets using train_test_split from

sklearn.model_selection with stratification and random_state=0 (as in the tutorials but with

random_state=0).

You will need to pass cvKFold (the stratified folds) as an argument to GridSearchCV, not cv=10 as in the

tutorials. This ensures that random_state=0.

Write the following functions:

K-Nearest Neighbour

def bestKNNClassifier(X, y)

return (appropriate values so that the required printing can be done)

It should use the KNeighborsClassifier from sklearn.neighbors.

COMP5318/COMP4318 – Machine Learning and Data Mining Semester 1, 2023

Page 5 of 7

The grid search should consider the following values for the parameters k(n_neighbors) and p:

k = {1, 3, 5, 7, 9}

p = {1, 2}

The function should return appropriate values, so that best parameters found, the best cross-validation

accuracy and the test set accuracy can be printed when calling this function, see the next section.

SVM

def bestSVMClassifier(X,y)

return (appropriate values so that the required printing can be done)

It should use SVC from sklearn.svm with kernel set to ‘rbf’.

The grid search should consider the following values for the parameters C and gamma:

C = {0.01, 0.1, 1, 5, 15}

gamma = {0.01, 0.1, 1, 10, 50}

The function should return appropriate values, so that best parameters found, the best cross-validation

accuracy and the test set accuracy can be printed when calling this function, see the next section.

Random Forest

def bestRFClassifier(X,y)

It should use RandomForestClassifier from sklearn.ensemble with information gain and max_features

set to ‘sqrt’.

The grid search should consider the following values for the parameters n_estimators and max_leaf_nodes:

n_estimators = {10, 30, 60, 100, 150}

max_leaf_nodes = {6, 12, 18}

The function should return appropriate values, so that best parameters found, the best cross-validation

accuracy, the test set accuracy, the macro average F1 score, and the weighted average F1 score can be

printed when calling this function, see the next section.

3. Running the classifiers and printing the results

Run the classifiers from the previous section on the pre-processed dataset and print the results.

For Part 1 of this assignment, set the parameters as follows (this is already done for you in the template):


#Bagging

bag_n_estimators = 60

bag_max_samples = 100

bag_max_depth = 6


#AdaBoost

ada_n_estimators = 60

ada_learning_rate = 0.5

ada_bag_max_depth = 6

COMP5318/COMP4318 – Machine Learning and Data Mining Semester 1, 2023

Page 6 of 7


#GB

gb_n_estimators = 60

gb_learning_rate = 0.5

The printing should look like this but with the correct numbers (these are random numbers):

LR average cross-validation accuracy: 0.8123

NB average cross-validation accuracy: 0.7543

DT average cross-validation accuracy: 0.6345

Bagging average cross-validation accuracy: 0.8765

AdaBoost average cross-validation accuracy: 0.7165

GB average cross-validation accuracy: 0.9054



KNN best k: 7

KNN best p: 2

KNN cross-validation accuracy: 0.7853

KNN test set accuracy: 0.5991


SVM best C: 0.0100

SVM best gamma: 10.0000

SVM cross-validation accuracy: 0.8676

SVM test set accuracy: 0.8098


RF best n_estimators: 10

RF best max_leaf_nodes: 16

RF cross-validation accuracy: 0.8600

RF test set accuracy: 0.8321

RF test set macro average F1: 0.8123

RF test set weighted average F1: 0.8261

Format all numbers to 4 decimal places using .4f, except k, p, n_estimators and max_leaf_nodes which

should be formatted as integers.

Academic honesty – very important

Please read the University policy on Academic Honesty very carefully:

https://sydney.edu.au/students/academic-integrity.html

Plagiarism (copying from another student, website or other sources), making your work available to

another student to copy, engaging another person to complete the assignments instead of you (for

payment or not) are all examples of academic dishonesty. Note that when there is copying between

students, both students are penalised – the student who copies and the student who makes his/her work

available for copying

The University penalties are severe and include: 1) a permanent record of academic dishonesty on your

student file, 2) mark deduction, ranging from 0 for the assignment to Fail for the course and 3) expulsion

from the University and cancelling of your student visa.

If there is a suspected case, the investigation may take several months. Your mark will not be finalised

until the investigation is completed. This may create problems enrolling in other courses next semester

(COMP5318 is a pre-requisite for many courses) or delaying your graduation. Going through the

investigation is also very stressful.

COMP5318/COMP4318 – Machine Learning and Data Mining Semester 1, 2023

Page 7 of 7

In addition, the Australian Government passed a new legislation (Prohibiting Academic Cheating Services

Bill) that makes it a criminal offence to provide or advertise academic cheating services - the provision or

undertaking of work for students which forms a substantial part of a student’s assessment task.

Do not confuse legitimate co-operation and cheating! You can discuss the assignment with other students

but your group must write your own code.

We will use similarity detection software. If you cheat, the chances that you will be caught are very high.

Do not even think about engaging in plagiarism or academic dishonesty, it is not worth it. Be smart and

don’t risk your future by engaging in plagiarism and academic dishonesty!

Marking Criteria

The marking rubric is provided in Canvas.

Please note that we will test your program on another dataset. It will have the same format as the breast

cancer dataset but a different number of features and examples, and different names of the features. You

may assume the class value will be in the last column and there will be two classes as in the breast cancer

dataset. The missing values may be everywhere, not only in a single column as the breast cancer dataset.

Hence, do not hard-code the number of features and examples - do not set them to 699 and 9 as in the

breast cancer dataset, and do not make assumptions that the missing values will be in a column with a

specific name.

To test your code before submission, we have made available another dataset (test-before.csv), with the

correct results.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp