Resit Assignment – CSC3060 “AIDA”
Release date: Friday 5th August
Deadline: 11:00pm Sunday 11
th August 2019.
This version: 2019-07-04.
Introduction
This assignment re-assesses key practical and theoretical learning outcomes from the CSC3060
module.
With the exception of Section 1, this assignment must be completed in R. Section 1 can be
completed in Python (recommended), Java or R. Convenient and commonly used machine learning
packages are available for R and Python, such as “class”, “caret” and “randomForest” (in the case of
R). When you use a procedure that has an element of randomness (e.g. creating cross-validation
folds) please use the seed value 3060 (your code should give the same results each time it runs).
Please read carefully the information about the assessment criteria and marking process at the end of
this document.
Section 1 (5%): Creating a dataset
This section asks you to build a dataset of images composed of handwritten numbers, letters, and
punctuation marks. Each image is represented by a black & white matrix with size 20 rows by 20
columns. In the matrix, the number “1” represents black pixels and “0” represents white pixels. As
such, each image can be stored in a .CSV file containing the matrix (and no headers), as in these
examples:
Class a b 1 3
Example
Figure 1: Examples of handwritten images and their matrix representation.
The goal in Section 1 is to create a dataset containing 10 images of each of the digits {0, 1, 2, 3, 4}, 10
images of the letters {a, b, c, d, e} and 10 images of the punctuation symbols {;, ,, ?, !, :}. Each image
should be obtained by processing a hand-written symbol (preferably with a touch screen, using the
lab computers, although it is fine if you create them using the computer mouse). The quality of the
drawing is not essential, as long as each symbol can easily be read by a human. The characters
should be drawn in proportional size to each other (e.g. the comma symbol should be drawn at a
size which is proportional to the letter symbols, as in the example below:
Figure 2: Examples of handwritten images for the letter “d” and the comma symbol, drawn at
proportional sizes.
The largest symbols should fit reasonably well in the 20x20 box (i.e. do not draw a tiny character in
one corner of the 20x20 box; this will make your life easier when it comes to doing analyses!). You
can reuse any suitable digits or letters you may have already created for the first lab assignment.
You may use whatever means you prefer to obtain the images and .csv files. However, a suggestion is
to use the software GIMP (http://www.gimp.org). Using GIMP, you can create a new image with 20
by 20 points (pt), advanced options 1 pixel/pt, color space grayscale, fill with background colour. This
will give you a small white square, which you can magnify to e.g. 2000% in order to make it easier to
draw on. To draw on the image, you can select the pencil tool and adjust the brush size to (e.g.) 1
pixel. The standard file formats of GIMP are useful to save the images, but we need a more easily
readable format. One good option is to export as PGM, type ASCII. In this format, each image becomes
a text file with a header consisting of the following four lines:
P2
# CREATOR: ...
20 20
255
The third and fourth lines of the header specify the pixel array size and the maximum allowed pixel
value, respectively. (The images are greyscale, with 0 representing fully black and 255 representing
fully white).1
The remaining lines of the file specify the pixel values, with one value on each line; the total number
of pixel values should correspond to the specified array size (i.e. 20*20=400).
For our purposes, a number < 128 represents a black pixel, while a number >= 128 represents a white
one. Such a format can be easily converted into a matrix containing ones and zeros, as presented in
Figure 1 above. You shall save each image matrix as a comma-delimited csv file consisting of a 20x20
1
For further information about this image format, see https://en.wikipedia.org/wiki/Netpbm_format
Page 3 of 8
array of 1s and 0s, following the specification above. Use the filename STUDENTNR_LABEL_INDEX.csv,
where STUDENTNR is your student number (e.g. 123456), INDEX is a number from 01 to 10 (always
two digits, with zero-padding), indexing the set of 10 images you must create for each symbol, and
LABEL is a numeric code that uniquely identifies the type of symbol.
We will use the following codes to label the 15 different types of images:
Symbol Label
For example, if your student number is 123456, then 123456_25_10.csv would be the 10th image
you created for the letter ‘e’. (As well as creating the csv files, you may also want to keep the PGM
files, in case you need to inspect the data later on).
As part of your submission, upload the csv files that you create in a directory called “images”, along
with any code you wrote to create the csv files, in a folder called “section1_code” (see submission
instructions at the end of this document).
It is important to upload the images in the correct csv format as these files will be used to verify your
calculations in the subsequent sections.
In your report, briefly explain in your own words how you created the images and obtained the
matrices from them.
Section 2 (15%): Feature engineering
Using each 20x20 matrix obtained from an image as described above, you must create an array of
characteristics that describe some features of the image. Each feature will be a number (i.e. each
feature is a numeric variable). There are 18 features in total. In the feature definitions that follow, a
pixel has 8 neighbours, which can be referred to as follows:
Features to be calculated:
Page 4 of 8
Feature
Index
Feature Short
Name
Feature Description
label The true symbol in the image (represented by one of the 15 LABEL codes).
Note that the label is not a true feature, and should not be used as a
feature for statistical tests or during model training.
index The index of this image instance (a number from 01 to 10).
1 nr_pix The number of black pixels in the image.
2 height Number of rows containing at least one black pixel
3 width Number of columns containing at least one black pixel
4 tallness Ratio of height to width; i.e. feature 2 / divided by feature 3
5 rows_with_5+ Number of rows with five or more black pixels
6 cols_with_5+ Number of columns with five or more black pixels
7 1neigh Number of black pixels with exactly 1 neighbouring pixel
8 2neigh Number of black pixels with exactly 2 neighbouring pixels
9 4+neigh Number of black pixels with 4 or more neighbours
10 max_dist The maximum Euclidean distance between any 2 black pixels, in the
image measured in pixel units from the centre of the pixel. For example,
a centre pixel has a distance of 1.414 from its lower right neighbour
(Euclidean distance: sqrt(1
2+12
)) and the lower-right neighbour has
distance 2.828 from the top left neighbour (sqrt(2
2+2
2
)).
11 nr_regions Two black pixels A and B are connected if they are neighbours of each
other, or if a black pixel neighbour of A is connected to B (this definition
is actually symmetric); a connected region is a maximal set of black pixels
which are connected to each other; this feature has the number of
connected regions in the image.
12 nr_eyes In a written character, an “eye” is a region of whitespace that is
completely surrounded by lines of the character. For example, “A”
contains one eye, “B” contains two eyes, and “C” contains no eyes. A
region of whitespace is an eye if there is a ring of black pixels surrounding
it which are all connected (i.e. they form a chain of neighbours). This
feature is the number of eyes in the image.
13 [your label] Design any other feature you like, which you think may be useful for
distinguishing between symbols.
14 [your label] Design any other feature you like, which you think may be useful for
distinguishing between symbols. This should not be a simple modification
of feature 13.
Your task in this section is to write code to calculate each of the features above. In calculating pixel
neighbours, you can assume that the images are padded on each side with white pixels.
Save your calculated features in a file called STUDENTNR_features.csv, where STUDENTNR is
your student number. This file will consist of 150 rows, with each row listing the comma-separated
feature values for each of your 150 images (10 images for each of 15 categories).
For example, the features for your eighth “e” image may be as follows:
25,8,14,8,4,12,8,8,1,11,0,7,1,1,1,2
The 8 rows that correspond to the 8 instances of a particular character should be grouped together in
the features file, and the order of the 8 rows should correspond to the INDEX used in the image
Page 5 of 8
filenames. In other words, the 150 rows of STUDENTNR_features.csv should be sorted first by
the label and secondly by the index.
If you cannot calculate a particular feature, you may use a random integer between 0 and 10 for the
feature values instead, or you can use your own manual estimate of the features’ value, provided you
clearly indicate that this is what you have done in both the report and the source code. (You will lose
marks for not calculating the feature, but you can use the random/estimated values in the analyses
that follow in the subsequent sections).
In your report, briefly describe and explain the code you have written to calculate the features above.
If you ran into difficulties, you should still explain your thought processes and your attempts to
calculate the features. In the case of features 15 and 16, you should explain your rationale for choosing
the features you did, as well as how they are calculated (i.e. you should give a justification for why you
think these features should be useful). These features should not be simple modifications of each
other, or of other features.
Finally, create a single feature file, called STUDETNR_features.CSV, containing in 150 rows the
individual features calculated for all 150 images.
Section 3: Statistical analyses of feature data (40%)
In this section, you will perform statistical analyses of the feature data, in order to explore which
features are important for distinguishing between different kinds of symbols.
You shall use descriptive statistics (mean, variance, etc.), null hypothesis testing, and confidence
intervals to perform your analysis of the data. You are encouraged to provide tables, figures, and/or
graphs in the report to support your discussions and findings. When performing tests, always
consider whether multiple test correction is needed.
It is your responsibility to define the appropriate assumptions to run the tests, and to choose an
appropriate test according to the data characteristics and the question that you are studying. You
are not restricted to the hypothesis tests that were discussed in the lectures. Recall to always justify
the approach that you choose to employ. You may assume a significance level of 0.05 for the
analyses when running hypothesis testing.
In particular, in the report you should address each of the following subtasks, using appropriate
statistical tests, tables, graphs, etc.
1. Construct suitable histograms for the nr_pix, height, and cols_with_5+ features, for each of
the following sets of items: (a) the 50 digits, (b) the 50 letters (c) the 50 punctuation symbols
and (d) the full set of 150 items. Briefly describe the shape of the distributions and comment
on any interesting patterns across the datasets. Visually assess the skew and normality of
the distributions.
2. Present summary statistics (e.g. mean and standard deviation) about all the features, for (a)
the 50 digits, (b) the 50 letters, (c) the 50 punctuation symbols and (d) all 150 items. Briefly
discuss the summary statistics, and whether they already suggest which features may be
useful for discriminating digits and letters. For features you feel may be interesting for
discrimination between groups, consider suitable visualisations (e.g. histogram of feature
values for the three groups2
). State what type of variable (continuous, categorical, etc) each
2 https://stackoverflow.com/questions/36049729/r-ggplot2-get-histogram-of-difference-between-two-groups
Page 6 of 8
variable is.
3. Assume that the nr_pix variable is sampled from a population which is normally distributed.
Estimate the mean and variance of the distribution from the available data for the 150
items. Plot the theoretical normal distribution for nr_pix and compare to the corresponding
histogram.
4. Assuming that the nr_pix variable is from a normally distributed population as above, what
is the cut-off value for the nr_pix variable such that a randomly sampled image has a 5%
probability of having a nr_pix value that is above that cut-off value?
5. Certain statistical tests assume that data are normally distributed. For each of the feature
variables 1-10, identify variables with extreme skew and investigate transformations of the
variables. Which features do you choose to transform? Explain how you reach your decision
and how the transformation changes the distribution.
6. Investigate the relationship between the “height” variable and the “tallness” variable. Are
these variables linearly associated? Consider suitable visualisation. Describe a statistical test
to measure the degree of association between these two variables.
7. For every pair of features, calculate a measure of the degree of association between the
features. Present the results in a suitable graph or table, indicating which pairs of features
are significantly correlated.
8. Is the nr_pix feature useful to discriminate between the 3 different groups of letters,
numbers and punctuation symbols? (Note that here we are looking for differences between
3 different groups, so consider a statistical method that tests for statistically significant
differences between more than 2 groups). State clearly the statistical test used, the
assumptions of the test, and how the assumptions relate to your data. If you use ANOVA,
provide the full results table for the statistical test, and describe how each element of the
results table is calculated.
9. For every feature, use ANOVA or similar statistical test to test whether there is a difference
between the three groups for that feature. Present the results (i.e. F-scores) for all features
in a suitable graph or table and indicate for which features there is a significant difference
between groups.
10. Fit multiple regression models to predict nr_pix from subsets of the remaining variables.
Consider at least three regression models using different numbers of predictor variables and
compare the models in terms of their goodness-of-fit. Justify your choice of measure of
goodness-of-fit.
For all questions above, you shall explain your reasoning, assumptions and steps of the procedure
(including the statistical analysis) when preparing the report. If you are generating p-values for
analysing the statistical significance of some features, make sure to explain how they were obtained.
It is your task to decide and justify what the most appropriate inference to be performed in each
case is, and to discuss the results you obtained.
Section 4: Machine learning (40%)
In this section, you will use the features you developed and analysed above to solve classification
problems. Specifically, you will fit classifiers to your image data, in order to build and evaluate useful
models that can predict the class labels for unseen images.
1. Using the width feature variable only, fit a logistic regression model to discriminate between
Page 7 of 8
the category of “digits” and the category of “punctuation symbols”. Present the results table
for the logistic regression, including the coefficient estimates, the z-scores and associated pvalues.
Briefly interpret the results of the logistic regression.
2. Using the logistic regression model you calculated above, find (a) the digit with the greatest
probability of being incorrectly classified as a punctuation symbol, and (b) the punctuation
symbol with the greatest probability of being incorrectly classified as a digit.
3. Using any 4 features that you think should be useful (justifying your choices, e.g. on the basis
of results in section 3), use logistic regression to discriminate between the “letter” and
“punctuation symbol” categories. Use 5-fold crossvalidation to evaluate the accuracy of your
fitted model. Briefly interpret your results.
4. In this question, we aim to build a classifier to discriminate between the three classes of
“digit”, “letter” and “punctuation symbol”. Perform k-nearest-neighbour classification with k
= {1,3,5,9,11,13,15,17,19,21,23} using 5-fold cross-validation, and using any 3 features you
think should be suitable (justifying your choices). (Consider suitable transformations of
features, such as feature scaling).
5. Build the best k-nearest-neighbour model you can that discriminates between the three
groups, evaluated using 5-fold crossvalidation. Experiment systematically with different
values of k and different sets of features, explaining and justifying your choices. Your goal is
to try to come up with the best k-nearest-neighbour model you can for classifying the three
categories of symbols.
6. Give a brief conceptual overview of the random forest method for classification. Build a
random forest model that discriminates between the three groups, evaluated using 5-fold
crossvalidation. Your goal is to try to come up with the best final model you can for
classifying the three categories of symbols. Briefly compare and interpret your results for
knn and random forest.
Assessment criteria and marking process
The most important criteria in marking is the quality and clarity of your report (approximately 65%
weighting). In your report, you should clearly demonstrate that you understand the methods used in
the assignment. Explain your reasoning, assumptions and steps of the procedures used. You should
explain and interpret your results. What are your results telling you? Are the results what you would
expect? If you ran into difficulties, explain what they were and the efforts you made to try to
overcome them.
Code has a weighting in marking of approximately 30% overall. Your code should be clear and
logically organised, and accurately calculate the values required, but code efficiency and code
sophistication is not important (this assignment does not require complex programming). If you use
freely licenced code, packages, or libraries (which is encouraged), these should be appropriately
referenced (e.g. by citing a URL in a comment). The code must be easy to use and the comments
must include information about the required steps to replicate the results that you have obtained
and are presenting in your report (transparency and replicability are essential in data analysis).
Attention to detail and following the assignment instructions accurately will also be considered in
marking (approximately 5% weighting). Each sub-task has a precise specification. Make sure you
carefully follow the instructions, and use the features specified for each task, the specified
procedures (number of cross-validation folds, seed value, etc). Make sure you upload your
deliverable files in the specified formats.
Page 8 of 8
Instructions for submission
This exercise is to be completed individually, and the dataset is generated by yourself and it is not
expected to be significantly similar to the data of other students. You shall deliver any generated
source code with suitable documentation/comments. Plagiarism may be severely punished.
For your report, use should use the Word document template provided. You may use LaTeX to
prepare your report if you wish, but please follow the general layout and formatting of the Word
template.
The maximum word count for each section of the report is as follows:
Section 1: 200 words
Section 2: 800 words
Section 3: 3000 words
Section 4: 2000 words
These values should be regarded as maximums only; it should be possible to give appropriate
answers with fewer words. You may include as many figures and tables in your report as you feel is
suitable, and these do not contribute to the word count. You should explain how you have
performed the analysis, but do not explain details of code in your report - use the source code
comments for that. Properly cite any sources that you have used. Use point 11 font for the text
body.
You should upload a single zip file, containing a folder with the following directory structure:
STUDENTNR/ # Top-level directory
STUDENTNR_report.pdf # The assignment report
STUDENTNR_features.csv # The features calculated in section 2
code_section1/ # Directory of source code files for section 1
code_section2/ # Directory of source code files for section 2
code_section3/ # Directory of source code files for section 3
code_section4/ # Directory of source code files for section 4
images/ # Directory of 100 image files created in section 1
Ensure that your name, student number and the module details (AIDA CSC3060) are in the header of
the submitted pdf. You must submit the assignment online, using the QOL webpage of the AIDA
module, by the specified date. It is your responsibility to ensure that the assignment is uploaded
correctly (and that the zip file is not corrupt, etc) and you should take steps to check and verify the
upload. A RAR file is not a ZIP file. By submitting this assignment you acknowledge that it is your own
work and that you are aware of university regulations regarding academic offences, including (but
not restricted to) plagiarism and collusion.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。