联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2021-03-06 11:03

ECMM422 Machine Learning

Course Assessment 1

This course assessment (CA1) represents 40% of the overall module assessment.

This is an individual exercise and your attention is drawn to the College and University

guidelines on collaboration and plagiarism, which are available from the College website.

Note:

.

do not change the name of this notebook, i.e. the notebook file has to be named: ca1.ipynb

.

do not remove/delete any cell

.

do not add any cell (you can work on a draft notebook and only copy the function

implementations here)

.

do not add you name or student code in the notebook or in the file name

Evaluation criteria:

Each question asks for one or more functions to be implemented.

Each question is awarded a number of marks.

A (hidden) unit test is going to evaluate if all desired properties of the required function(s) are

met.

If the test passes all the associated marks are awarded, if it fails 0 marks are awarded. The large

number of questions allows a fine grading.

Notes:

In the rest of the notebook, the term data matrix refers to a two dimensional numpy array

where instances are encoded as rows, e.g. a data matrix with 100 rows and 4 columns is to be

interpreted as a collection of 100 instances each with four features.

When a required function can be implemented directly by a library function it is intended that

the candidate should write her own implementation of the function, e.g. a function to compute

the accuracy or the cross validation.

Some questions are just a check-point, i.e. it is for you to see that you are correctly

implementing all functions. Since those check-points use functions that you have already

implemented and that have already been marked, those questions are not going to be marked

(i.e. they appear as having marks 0).

In [ ]: %matplotlib inline

import matplotlib.pyplot as plt

import numpy as np

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 2/16

Question 1 [marks 6]

a) Make a function data_matrix = make_data_classification(mean, std,

n_centres, inner_std, n_samples, random_seed=42) to create a data matrix

according to the following rules:

.

mean is a n-dimensional vector (say [1,1], but the function should allow vectors of any

dimension)

.

n_centres is the number of centres (say 3)

.

std is the standard deviation (say 1)

.

the centres are sampled from a Normal distribution with mean mean and standard

deviation std

.

from each centre sample n_samples from a Normal distribution with the centre as the

mean and standard deviation inner_std so if mean=[1,1] n_centres=3 and

n_samples=10 then the data matrix will be a 30 rows x 2 columns numpy array.

b) Make a function data_matrix, targets = make_data_regression(mean, std,

n_centres, inner_std, n_samples_list, random_seed=42) to create a data matrix

and a target vector according to the following rules:

.

the data matrix is constructed in the same way as in make_data_classification

.

the targets are the Euclidean distance between the sample and the centre of the generating

Normal distribution

See Question 3 for a graphical example of the expected output.

Question 2 [marks 2]

import scipy as sp

# unit test utilities: you can ignore these function

def is_approximately_equal(test,target,eps=1e-2):

return np.mean(np.fabs(np.array(test) - np.array(target)))<eps

def assert_test_equality(test, target):

assert is_approximately_equal(test, target), 'Expected:\n %s \nbut got:\n %s

In [ ]:

def make_data_classification(mean, std, n_centres, inner_std, n_samples, random_

# YOUR CODE HERE

raise NotImplementedError()

def make_data_regression(mean, std, n_centres, inner_std, n_samples, random_seed

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 3/16

a) Make a function data_matrix, targets =

get_dataset_classification(n_samples, std, inner_std) to create a data matrix

and a target vector for a binary classification problem according to the following rules:

the instances from the positive class are generated according to the same rules provided

for make_data_classification ; so are the instances from the negative class

instances from the positive class have as mean the vector [10,10] and those from the

negative class, vector [-10,-10]

the number of centres is fixed to 3

the random seed is fixed to 42

n_samples indicates the total number of instances finally available in the output

data_matrix

b) Make a function data_matrix, targets = get_dataset_regression(n_samples,

std, inner_std) to create a data matrix according to the following rules:

the instances are generated according to the same rules provided for

make_data_regression

the targets are generated according to the same rules provided for

make_data_regression

instances have as mean the vector [10,10]

the number of centres is fixed to 3

the random seed is fixed to 42

n_samples indicates the total number of instances finally available in the output

data_matrix

Question 3 [marks 1]

Make a function plot(X,y) to display the scatter plot of a data matrix of two dimensional

instances using the array y to assign the colour to the instances.

When running

X, y = get_dataset_regression(n_samples=600, std=30, inner_std=5)

plot(X,y)

you should get something like

In [ ]:

def get_dataset_classification(n_samples, std, inner_std):

# YOUR CODE HERE

raise NotImplementedError()

def get_dataset_regression(n_samples, std, inner_std):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 4/16

and when running

X, y = get_dataset_classification(n_samples=600, std=30, inner_std=5)

plot(X,y)

you should get something like

Question 4 [marks 1]

Make a function classification_error(targets, preds) to compute the fraction of

times that the entries in targets do not agree with the corresponding entries in preds .

Note: do not use library functions to compute the result directly but implement your own

version.

Question 5 [marks 2]

Make a function regression_error(targets, preds) to compute the mean squared error

between targets and preds .

Note: do not use library functions to compute the result directly but implement your own

version.

Question 6 [marks 7]

Make a function make_bootstrap(data_matrix, targets) to extract a bootstrapped

replicate of an input dataset.

In [ ]:

def plot(X,y):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

def classification_error(targets, preds):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

MSE =

n∑

i=1

(Ti − Pi)

2

.

1

n

In [ ]:

def regression_error(targets, preds):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 5/16

The function should return the following 6 elements (in this order):

bootstrap_data_matrix, bootstrap_targets, bootstrap_sample_ids,

oob_data_matrix, oob_targets, oob_samples_ids , where:

.

bootstrap_data_matrix : is a data matrix encoding the bootstrapped replicate of the

data matrix

.

bootstrap_targets : is the corresponding bootstrapped replicate of the target vector

.

bootstrap_sample_ids : is an array containing the instance indices of the bootstrapped

replicate of the data matrix

.

oob_data_matrix : is a data matrix encoding the out of bag instances

.

oob_targets : is the corresponding out of bag instances of the target vector

.

oob_samples_ids : is an array containing the instance indices of the out of bag instances

Question 7 [marks 10]

Consider the following functional blueprints estimator = train(X_train, y_train,

param) and test(X_test, estimator) . A function of type train takes in input a data

matrix X_train a target vector y_train and a single value param (not a list of

parameters). A function of type train outputs an object that represent an estimator. A

function of type test takes in input a data matrix X_test the fit object estimator and

outputs the predicted targets.

Using this blueprint, write the specialised train and test functions for the following classifiers

and regressors (use the function signature provided in the next cell, e.g. train_ab for training

an adaboost classifier):

Classifiers:

a) k-nearest-neighbor: the parameter controls the number of neighbors (you may use

KNeighborsClassifier from scikit) [train_knn, test_knn]

b) adaboost: the parameter controls the maximal depth of the decision tree uses as weak

classifier (you may use the DecisionTreeClassifier from scikit but you should provide your

own implementation of the boosting algorithm) [train_ab, test_ab]

c) random forest: the parameter controls the maximal depth of the tree (you may use the

DecisionTreeClassifier from scikit but you should provide your own implementation of

the bagging algorithm) [train_rfc, test_rfc]

Regressors:

In [ ]:

def make_bootstrap(data_matrix, targets):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 6/16

d) decision tree: the parameter controls the maximal depth of the tree (you may use the

DecisionTreeRegressor from scikit) [train_dt, test_dt]

e) svm linear: the parameter controls the regularization constant C (you may use SVR from

scikit) [train_svm_1, test_svm]

f) svm with a polynomial kernel of degree 2: the parameter controls the regularization

constant C (you may use SVR from scikit) [train_svm_2, test_svm]

g) svm with a polynomial kernel of degree 3: the parameter controls the regularization

constant C (you may use SVR from scikit) [train_svm_3, test_svm]

h) random forest: the parameter controls the maximal depth of the tree (you may use the

DecisionTreeRegressor from scikit but you should provide your own implementation of

the bagging algorithm) [train_rf, test_rf]

For the algorithms adaboost and random forest , the size of the ensemble should be fixed

to 100.

In [ ]:

# classifiers

from sklearn.neighbors import KNeighborsClassifier

def train_knn(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def test_knn(X_test, est):

# YOUR CODE HERE

raise NotImplementedError()

from sklearn.tree import DecisionTreeClassifier

def train_ab(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def test_ab(X_test, models):

# YOUR CODE HERE

raise NotImplementedError()

from sklearn.tree import DecisionTreeClassifier

def train_rfc(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def test_rfc(X_test, models):

# YOUR CODE HERE

raise NotImplementedError()

# regressors

from sklearn.tree import DecisionTreeRegressor

def train_dt(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def test_dt(X_test, est):

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 7/16

Question 8 [marks 0]

This is just a check-point, i.e. it is for you to see that you are correctly implementing all

functions. Since this cell uses functions that you have already implemented and that have

already been marked, this Question is not going to be marked.

Make a dataset using

X, y = get_dataset_classification(n_samples=240, std=30, inner_std=10)

# YOUR CODE HERE

raise NotImplementedError()

from sklearn.svm import SVR

def train_svm_1(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def train_svm_2(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def train_svm_3(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

#Note: you do not need to specialise the svm test function for each degree

def test_svm(X_test, est):

# YOUR CODE HERE

raise NotImplementedError()

from sklearn.tree import DecisionTreeRegressor

def train_rf(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def test_rf(X_test, models):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 8/16

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)

and check that the classification error for

k-nearest-neighbor

random forest classifier

adaboost

Question 9 [marks 0]

This is just a check-point, i.e. it is for you to see that you are correctly implementing all

functions. Since this cell uses functions that you have already implemented and that have

already been marked, this Question is not going to be marked.

Make a dataset using

X, y = get_dataset_regression(n_samples=120, std=30, inner_std=10)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)

and check that the regression error for these regressors

decision tree

svm with polynomial kernel of degree 2

svm with polynomial kernel of degree 3

is approximately comparable.

Question 10 [marks 10]

In [ ]:

# Just run the following code, do not modify it

X, y = get_dataset_classification(n_samples=240, std=30, inner_std=10)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)

param=3

e_knn = classification_error(y_test, test_knn(X_test, train_knn(X_train, y_train

e_rfc = classification_error(y_test, test_rfc(X_test, train_rfc(X_train, y_train

e_ab = classification_error(y_test, test_ab(X_test, train_ab(X_train, y_train, p

print(e_knn, e_rfc, e_ab)

In [ ]:

# Just run the following code, do not modify it

X, y = get_dataset_regression(n_samples=120, std=30, inner_std=10)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)

param=3

e_dt = regression_error(y_test, test_dt(X_test, train_dt(X_train, y_train, param

e_svm2 = regression_error(y_test, test_svm(X_test, train_svm_2(X_train, y_train,

e_svm3 = regression_error(y_test, test_svm(X_test, train_svm_3(X_train, y_train,

print(e_dt, e_svm2, e_svm3)

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 9/16

Make a function sizes, train_errors, test_errors =

compute_learning_curve(train_func, test_func, param, X, y, test_size,

n_steps, n_repetitions) to compute the train and test errors as mandated in the learning

curve approach.

The regressor will be trained via train_func on the problem data_matrix , targets with

parameter param . The estimate will be done averaging a number of replicates equal to

n_repetitions , i.e. the code needs to repeat the process n_repetitions times (say 10)

and average the error.

Note that a fraction of the data as indicated by test_size (say 0.33 for 30%) is going to be

reserved for testing purposes. The remaining amount of data can be used in the training phase.

The learning curve should be computed for an amount of training material that varies from a

minimum of 2 instances up to all the instances available for training.

You should use the function regression_error to compute the error.

Note: do not use library functions (e.g. learning_curve in scikit) to compute the result

directly but implement your own version.

Question 11 [marks 1]

Make a function plot_learning_curve(sizes, train_errors, test_errors) to

display the train and test error as a function of the size of the training set.

You should get something like:

Question 12 [marks 3]

Make a function estimate_asymptotic_error(sizes, train_errors, test_errors)

that returns an estimate of the asymptotic error, i.e. the error made in the limit of an infinitely

large training set.

In [ ]:

def compute_learning_curve(train_func, test_func, param, X, y, test_size, n_step

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

def plot_learning_curve(sizes, train_errors, test_errors):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 10/16

Question 13 [marks 0]

This is just a check-point, i.e. it is for you to see that you are correctly implementing all

functions. Since this cell uses functions that you have already implemented and that have

already been marked, this Question is not going to be marked.

When you run:

X, y = get_dataset_regression(n_samples=800, std=30, inner_std=10)

train_func, test_func = train_dt, test_dt

param=5

sizes, train_errors, test_errors = compute_learning_curve(train_func,

test_func, param, X, y, test_size=.3, n_steps=10, n_repetitions=100)

e = estimate_asymptotic_error(train_errors, test_errors)

print('Asymptotic error: %.1f'%e)

plot_learning_curve(sizes, train_errors, test_errors)

you should get something like

Question 14 [marks 6]

Make a function bias2, variance = compute_bias_variance(predictions_dict,

targets) that takes in input a dictionary of lists of predictions indexed by the instance index,

and the target vector. The function should compute the squared bias component of the error

and the variance components of the error for each instance.

As a toy example consider: predictions_dict={0:[1,1,1], 1:[1,-1], 2:

[-1,-1,-1,1]} and targets=[1,1,-1] , that is, for instance with index 0 there are 3

predictions available [1,1,1] , instead for instance with index 1 there are only 2 predictions

available [1,-1] , etc. In this case, you should get bias2=[0. , 1. , 0.25] and

variance=[0. , 1. , 0.75] .

In [ ]:

def estimate_asymptotic_error(sizes, train_errors, test_errors):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

# Just run the following code, do not modify it

X, y = get_dataset_regression(n_samples=800, std=30, inner_std=10)

train_func, test_func = train_dt, test_dt

param=5

sizes, train_errors, test_errors = compute_learning_curve(train_func, test_func,

e = estimate_asymptotic_error(sizes, train_errors, test_errors)

print('Asymptotic error: %.1f'%e)

plot_learning_curve(sizes, train_errors, test_errors)

In [ ]: def compute_bias_variance(predictions_dict, targets):

# YOUR CODE HERE

raise NotImplementedError()

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 11/16

Question 15 [marks 10]

Make a function bias2, variance = bias_variance_decomposition(train_func,

test_func, param, data_matrix, targets, n_bootstraps) to compute the bias

variance decomposition of the error of a regressor on a given problem. The regressor will be

trained via train_func on the problem data_matrix , targets with parameter param .

The estimate will be done using a number of replicates equal to n_bootstraps .

Question 16 [marks 2]

Consider the following regression problem (it does not matter that the target is only 1 and -1):

from sklearn.datasets import load_iris

def make_iris_data():

X,y = load_iris(return_X_y=True)

X=X[:,[0,2]]

y[y==2]=0

y[y==0]=-1

return X,y

Estimate the squared bias and variance component for each instance.

Consider as regressor a linear svm and a polynomial svm with degree 3.

What is the class of the instances that have the highest bias error on average?

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

def bias_variance_decomposition(train_func, test_func, param, data_matrix, targe

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

# Just run the following code, do not modify it

from sklearn.datasets import load_iris

def make_iris_data():

X,y = load_iris(return_X_y=True)

X=X[:,[0,2]]

y[y==2]=0

y[y==0]=-1

return X,y

X,y = make_iris_data()

bias2, variance = bias_variance_decomposition(train_svm_1, test_svm, param=2, da

print(np.mean(bias2[y==1]) , np.mean(bias2[y==-1]))

bias2, variance = bias_variance_decomposition(train_svm_3, test_svm, param=2, da

print(np.mean(bias2[y==1]) , np.mean(bias2[y==-1]))

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 12/16

Question 17 [marks 6]

Make a function bs,vs = compute_bias_variance_decomposition(train_func,

test_func, params, data_matrix, targets, n_bootstraps) to compute the average

squared bias error component and the average variance component of the error for each

parameter setting in the vector params . The regressor will be trained via train_func on the

problem data_matrix , targets with parameter param . The estimate will be done using a

number of replicates equal to n_bootstraps . To be clear, the vector bs contains the

average square bias error for each parameter in params and the vector vs contains the

average variance error for each parameter in params .

Question 18 [marks 1]

Make a function plot_bias_variance_decomposition(train_func, test_func,

params, data_matrix, targets, n_bootstraps, logscale=False) .

You should plot the individual components or the squared bias, the variance and the total error.

You should allow the possibility to employ a logarithmic scale for the horizontal axis via the

logscale flag.

You should get something like:

Question 19 [marks 2]

Make a function find_best_param_with_bias_variance_decomposition(train_func,

test_func, params, data_matrix, targets, n_bootstraps) that uses the bias

variance decomposition analysis to determine which parameter among params achieves the

smallest estimated predictive error.

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

def compute_bias_variance_decomposition(train_func, test_func, params, data_matr

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

def plot_bias_variance_decomposition(train_func, test_func, params, data_matrix,

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]: def find_best_param_with_bias_variance_decomposition(train_func, test_func, para

# YOUR CODE HERE

raise NotImplementedError()

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 13/16

Question 20 [marks 6]

When you execute the following code

X, y = get_dataset_regression(n_samples=400, std=10, inner_std=7)

params = np.linspace(1,30,30).astype(int)

train_func, test_func = train_dt, test_dt

p = find_best_param_with_bias_variance_decomposition(train_func,

test_func, params, data_matrix, targets, n_bootstraps=60)

print('Best parameter:%s'%p)

plot_bias_variance_decomposition(train_func, test_func, params,

data_matrix, targets, n_bootstraps=50, logscale=False)

You should get something like:

The next unit tests will run your functions

find_best_param_with_bias_variance_decomposition on an undisclosed dataset

using as regressors:

decision tree

svm degree 3

and 3 marks will be awarded for each correct optimal parameter identified.

Question 21 [marks 5]

Make a function conf_mtx = confusion_table(targets, preds) to output the

confusion matrix as a 2 x 2 Numpy array. Rows indicate the prediction and columns the target.

The cell element with index [0,0] should report the true positive count.

Running the following code:

from sklearn.datasets import load_iris

X,y = load_iris(return_X_y=True)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)

models = train_knn(X_train, y_train, param=3)

preds = test_knn(X_test, models)

conf_mtx = confusion_table(y_test, preds)

print(conf_mtx)

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 14/16

you should obtain something similar to

[[16. 1.]

[ 0. 28.]]

Note: the exact values can differ in your run

Note: do not use library functions to compute the result directly but implement your own

version.

Question 22 [marks 1]

Make a function error_from_confusion_table(confusion_table_func, targets,

preds) that takes in input the previous confusion_table function and returns the error, i.e.

the fraction of predictions that do not agree with the targets.

Question 23 [marks 12]

Make a function predictions, out_targets =

cross_validation_prediction(train_func, test_func, param, data_matrix,

targets, kfold) that estimates the predictions of a classifier trained via the function

train_func with parameter param on the problem data_matrix, targets using a kfold

cross validation strategy with the number of folds indicated by kfold .

Since the order of the instances associated to the predictions can be different from the original

order, the function is required to output also the corresponding target values in the array

out_targets (i.e. the value in position 10 in predictions corresponds to the target value

in position 10 in out_targets )

Note: do not use library functions (such as KFold or StratifiedKFold ) but implement

your own version of the cross validation.

In [ ]:

def confusion_table(targets, preds):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

def error_from_confusion_table(confusion_table_func, targets, preds):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

def cross_validation_prediction(train_func, test_func, param, data_matrix, targe

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 15/16

Question 24 [marks 5]

Make a function mean_errors =

compute_errors_with_crossvalidation(train_func, test_func, params,

data_matrix, targets, kfold, n_repetitions) that returns the estimated average

error for each parameter in params . The classifier is trained via the function train_func

with parameters taken from params on the problem data_matrix, targets using a k-fold

cross validation strategy with the number of folds indicated by kfold . The error estimate is

repeated a number of times indicated in n_repetitions . The error should be computed

using the function error_from_confusion_table . The output vector mean_errors has

as many entries as there are paramters in params .

Note: do not use library functions (such as cross_val_score ) but implement your own

version of the code.

Question 25 [marks 2]

Make a function find_best_param_with_crossvalidation(train_func, test_func,

params, data_matrix, targets, kfold, n_repetitions) that uses crossvalidation to

determine which parameter among params achieves the smallest estimated predictive error.

Question 26 [marks 0]

This is just a check-point, i.e. it is for you to see that you are correctly implementing all

functions. Since this cell uses functions that you have already implemented and that have

already been marked, this Question is not going to be marked.

You should be able to run the following code:

from sklearn.datasets import load_wine

X,y = load_wine(return_X_y=True)

params = [3,5,7,9,11]

train_func, test_func = train_knn, test_knn

kfold = 5

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

def compute_errors_with_crossvalidation(train_func, test_func, params, data_matr

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

In [ ]:

def find_best_param_with_crossvalidation(train_func, test_func, params, data_mat

# YOUR CODE HERE

raise NotImplementedError()

In [ ]:

# This cell is reserved for the unit tests. Do not consider this cell.

2021/3/2 1

localhost:8888/nbconvert/html/Downloads/1.ipynb?download=false 16/16

n_repetitions = 5

best_param = find_best_param_with_crossvalidation(train_func, test_func,

params, data_matrix, targets, kfold, n_repetitions)

print(best_param)

and get a value around 3.

In [ ]:

# Just run the following code, do not modify it

from sklearn.datasets import load_wine

data_matrix, targets = load_wine(return_X_y=True)

params = [3,5,7,9,11]

train_func, test_func = train_knn, test_knn

kfold = 5

n_repetitions = 5

best_param = find_best_param_with_crossvalidation(train_func, test_func, params,

print(best_param)


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp