Using Machine Learning Tools 2020
Assignment 2: Breast Cancer Classification
Overview
In this assignment, you will apply two classifier types, Decision Trees and Support Vector Machines, to
the problem of classifying breast cancer from a set of characteristics of the cell nuclei in an image of a
fine needle aspirate of a breast mass. The Wisconsin Breast Cancer data set is available from the
collection of example data sets in scikit learn.
The main aims of the assignment are:
To use and compare two different classifier approaches on the same data set;
To evaluate the classifiers and their structure in a white box fashion;
To practice using pipelines and hyper parameter optimisation;
To explore a multi-dimensional feature space and handle multi-dimensional data.
This assignment relates to the following ACS CBOK areas: abstraction, design, hardware and software,
data and information, HCI and programming.
Instructions
While you are free to use whatever IDE you like to develop your code, your submission should be
formatted as a Jupyter notebook that interleaves Python code with output, commentary and analysis.
Your code must use the current stable versions of python libraries, not outdated versions!
All data processing must be done within the notebook after calling the load function.
Comment your code, so that its purpose is clear to the reader!
Before submitting your notebook, make sure to run all cells in your final notebook so that it
works correctly!
This assignment is divided into several tasks. Use the template notebook, which uses the same
numbering as in this document (e.g. “3.2 Display decision tree”), and enter your code, results and
answer text analysis under the exact number that it belongs to!
Make sure to answer every question with separate answer text (“Answer: …”) in a Markdown cell
and check that you answered all sub-questions/aspects within the question. The text answers are
worth points!
Make the figures self-explanatory and unambiguous. Always include axis labels, if available with
units, unique colours and markers for each curve/type of data, a legend and a title. Give every figure
a number (e.g. at start of title), so that it can be referred to from different parts of the text/notebook.
This is also worth points!
1 Understand the dataset (15%)
Load the Wisconsin breast cancer data set from the scikit-learn sample data collection. Read the full
documentation: https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset. Figure 1
below shows an example image of a tissue sample obtained by fine needle aspiration. A more detailed
description of the parameters can be found here:
https://minds.wisconsin.edu/bitstream/1793/59692/1/TR1131.pdf
2
Figure 1. Example of a fine needle aspiration tissue image showing cells and cell nuclei (dark)
(Shigematsu et al. 2011, Creative Commons 2.0 License)
1.1 Question: Briefly describe what each of the 10 parameters of the cell nuclei mean, using the
documentation of the dataset and the example image in Figure 1. What could be the reasons
for using the mean, standard error and maximum of each of the 10 parameters?
1.2 Plot histograms of each of the 30 features, using two distributions, one for each class, in each
diagram. Use 3 figures with 10 subplots each.
1.3 Plot receiver-operating-characteristic (ROC) curves of the individual features into 3 figures, one
figure for each of the groups of 10.
1.4 Question: Which of the parameters seems promising based on the histograms and ROC curves?
Justify your choice while referring to the particular features in the figures that indicate a good
separation. Choose your top five candidate features.
1.5 Analysis Point: Calculate the mean of all instances of the malignant class (centre of mass in high
dimensional feature space) and the mean of all instances of the benign class. Save the mean
between those two as the “Analysis Point”. It is a point in the feature space that is
approximately between both classes.
2 Train a decision tree classifier (15%)
2.1 Construct a decision tree classifier using the gini criterion and random_state=0. Below, you will
perform a hyper parameter search of max_depth and min_samples_leaf. Check the following
remaining parameters of the classifier and either keep the default value or select a different
value: min_samples_split, min_weight_fraction_leaf, max_features, max_leaf_nodes,
min_impurity_decrease, min_impurity_split and class_weight. Question: Describe each choice
briefly in one sentence.
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
3
2.2 Build a pipeline including any pre-processing steps that you think are necessary. Question: Do
the data need to be scaled for decision tree classification? Are the different class sizes a
problem, and if so what are you doing about it?
2.3 Perform a grid search using five-fold cross validation over values of the maximum depth
(max_depth) and the minimum number of samples per leaf (min_samples_leaf). Choose the
value range yourself. Question: What is the rationale for your choice?
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.
GridSearchCV.html#sklearn.model_selection.GridSearchCV
3 Evaluate the decision tree classifier (20%)
3.1 Calculate the confusion matrix, precision and recall of the final classifier. Question: Based on
these metrics, what is the chance of failing to detect a sample with cancer? What are the
strengths and weaknesses of the classifier?
3.2 Display decision tree using plot_tree(). Question: Describe the structure. What do each of the
entries in the first node mean? Are the features in the decision tree matching the initial
candidate features from Section 1?
https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html
3.3 Display the decision boundaries (use function predict()) together with a scatter plot of the data
using two features at a time.
Select the five most important features from the decision tree attribute
“clf.feature_importances_”. Make a figure with 4x4 subplots, where each row and column is
one of the features and the subplots are the 2D displays using the corresponding features.
For the decision boundary, use code that samples the prediction value in the plane of the
respective two features (see example below) at the Analysis Point. This means that all other
feature dimensions stay constant at the values of the Analysis Point. Use a margin of 30% of
each feature value range (max-min) around the data cloud. Use the contourf() function like
in the example and its parameters “levels” and “colors”. It is highly recommended to
program this display as a function for later reuse. Example:
https://scikit-learn.org/stable/auto_examples/svm/plot_iris_svc.html#sphx-glr-autoexamples-svm-plot-iris-svc-py
Plot the Analysis Point with a blue “+” marker in each diagram.
3.4 Question: Is the class differentiation well characterised by the node thresholds or is it modelling
the boundary using a rigid or stair case pattern? Why are there few 2D scatterplots with only
one class shown as prediction contour?
4 Train a support vector classifier with RBF kernel (15%)
4.1 Construct a support vector classifier with a radial basis function kernel. Below, you will perform
a hyper parameter search of C and gamma. Check the following remaining parameters of the
classifier and either keep the default value or select a different value: tol, class_weight and
max_iter. Question: Describe each choice briefly in one sentence.
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
4
4.2 Build a pipeline including any pre-processing steps that you think are necessary. Question: Do
the data need to be scaled for support vector classification? Are the different class sizes a
problem, and if so what are you doing about it?
4.3 Perform a grid search using five-fold cross validation over values of the regularisation parameter
C and the kernel coefficient gamma. Choose the value ranges yourself. Question: What is the
rationale for your choice?
5 Evaluate the support vector classifier (20%)
5.1 Calculate the confusion matrix, precision and recall of the final classifier. Question: Based on
these metrics, what is the chance of failing to detect a sample with cancer? What are the
strengths and weaknesses of this classifier?
5.2 Display the decision boundary (use function decision_function()) together with a scatter plot of
the data using the same features and figure layout as in the decision tree display for direct
comparability. This time, use a suitable colormap (parameter “cmap”) in the contourf()
function. Mark the support vectors.
5.3 Question: What is the meaning of the support vectors? Where can we see their purpose in the
diagrams?
6 Compare the classifiers and interpret (15%)
6.1 Question: Compare the classifier structures and decision boundaries of both classifiers. Point
out similarities and differences. How do the classifiers compare outside the areas of dense
sampling in the parameter space, e.g. towards the edges of the scatterplot (extrapolation)?
6.2 Question: Generalisability: Do you see sources of bias in the two classifiers? Are the models
showing any signs of overfitting (variance error)?
6.3 Question: Table 1 from Street et al. (1993) below shows the accuracies of their classifiers for
different numbers of features and different numbers of hyperplanes used. Compare the
number of features (decision tree), selection of features and accuracy of your classifiers with
this table. Is there only one good set of features, many different sets or is there a pattern of
similar feature combinations?
5
Assessment
Hand in your notebook as an .ipynb file via the MyUni page. Make sure your notebook includes your
code and formatted (Markdown) text blocks explaining what you have done. Your mark will be based
on both code correctness and the quality of your comments and analysis.
The assignment is worth 35% of your overall mark for the course. It is due on Monday May 11 at
11.59pm.
Stephan Lau
April 2020
References:
Shigematsu, H., Kadoya, T., Kobayashi, Y. et al. A case of HER-2-positive recurrent breast cancer
showing a clinically complete response to trastuzumab-containing chemotherapy after primary
treatment of triple-negative breast cancer. World J Surg Onc 9, 146 (2011).
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。