MAT 3373: HOMEWORK 1
Homework Policy Basics
(1) You can find homework deadlines in the course schedule, available on
Brightspace.
(2) You can find detailed homework grading policies in the first week’s
lecture notes, available on Brightspace.
(3) You can find overall grading policies in the syllabus, available on Brightspace.
(4) Homework should be written in R Markdown format.1 You should
always submit both the raw workbook (in .Rmd format) and a compiled
version where I can see the charts (in .html or .pdf format). Marks
will be taken off if the homework is not compiled correctly. I
suggest using the set.seed() command near the top of each homework
set, and only re-compiling chunks as you need to. This will reduce
the chance of bugs popping up in previously-completed parts of the
homework.
(5) Homework solutions should appear in the same order as the questions
in the document. I will dock 3 percent on assignments with out-of-order
solutions, and I will generally not post grades of individual questions.
(6) It is possible to submit several versions of the homework. I will always
grade the last one submitted before the deadline. With this in mind,
I strongly encourage you to submit drafts of your homework
well before the deadline. In particular, you might as well submit a
draft every time you finish a question.
1See first week’s lecture notes for comments on Python people.
1
Introduction and Homework Focus
This homework is concerned with material from Chapters 1-4 of the textbook,
with one introductory question from Section 5.2. It focuses on getting
practice with programming, and on the methods of regression, logistic regression,
and k-nearest neighbours. It uses only the basic statistical tests seen
in Chapters 1-4. In particular, the homework doesn’t use the main tools for
statistical analysis of a collection of very different models, which are introduced
in Chapter 5. As such, many of the questions are either straightforward
applications of the methods (1,2,7,10) or ask you to identify and resolve an
enormous problem that appears when you try a straightforward application
(3,4,5,8,9).
Finding and fixing these sorts of enormous problems is of course important
- all of the mistakes highlighted in this homework are fairly easy to make, even
for professionals. Still, this might give the impression that machine learning is
mostly about identifying these sorts of gross violations of modelling assumptions.
In the following homework set, we’ll have access to more of the machine
learner’s standard toolkit, and we’ll do more examples where a “normal” work-
flow gives sensible results that can be improved with small tweaks and good
statistical practice.
2
1. Introduction to Data Exploration
Do Question 8 from Chapter 2 of the textbook.
2. Empirical Study: KNN for MNIST Data
Open the (small subsample of the) MNIST testing and training datasets:
MNIST_train = read.csv("mnist_train.csv")
MNIST_test = read.csv("mnist_test.csv")
Fit the knn classifier using the training data for k ∈ {3, 4, 5, 6}. Select the
value of k that minimizes the test error, and calculate the confusion matrix.
Comment on the results. If you saw a collection of new datapoints from the
same collection, would you expect the error to be larger, smaller, or about the
same as the observed test error?
3. Short Conceptual Questions
(1) Do Question 3 from Chapter 3 of the textbook.
(2) Imagine that you’re going to collect data, and have committed to doing
a one-dimensional linear regression analysis with intercept parameter
β0 known to be 0 and variance parameter σ known to be 1 (so you are
just trying to learn the slope term β1). Furthermore, you are absolutely
certain that the linear regression model is true.2
Before collecting the response variable, you need to choose the predictors.
You have the following options:
X
(1) = (−12, −9, −6, −3, 0, 3, 6, 9, 12)
X
(2) = (−1, −0.9, −0.8, . . . , 0.8, 0.9, 1)
X
(3) = (−28, −2, 76, 412).
Which of those should you choose? Why?
(3) People often use statistical models to do optimization, as follows. You
have some function f, and would like to find the largest value maxx f(x).
You try the following procedure:
(a) Collect data (X1, Y1), . . . ,(Xn, Yn).
(b) Based on this data, estimate a model ˆf.
(c) Use the maximum of the predicted values argmaxx
ˆf(x) as a prediction
for the true location of the maximum value argmaxxf(x).
2Of course this is unrealistic - but please take it seriously for this question.
3
This is a reasonable thing to try: you get an estimate ˆf for an entire
function, and you can try to use that as a surrogate for the real function
when doing optimization (or any other task). For some models, this
approach works rather well. Explain why it will almost never work well
when the model is a linear regression model on all of R.
4. Lying with Linear Regression
It is often possible to get very misleading results by deliberately choosing a
bad model with a certain structure. In this question, we’ll practice doing this
in a simple setting.3
Throughout the question, we’ll assume that every X-value is paired with
the observed Y -value Y = sin(X); there is no measurement error. We will
then fit this data to the usual linear regression model with unknown β0, β1, σ.
(1) Find a collection of X1, . . . , Xn of predictors so that the 99-percent con-
fidence interval for the slope β1 is contained in the interval (−∞, −0.95].
Display the collection of points, fit the model in R, and give the output
of the summary function applied to the fit.
(2) Find a collection of X1, . . . , Xn of predictors so that the 99-percent
confidence interval for the slope β1 is contained in the interval [0.95,∞).
Display the collection of points, fit the model in R, and give the output
of the summary function applied to the fit.
(3) Would it be possible to do part (2) of this question if I replaced the
interval [0.95,∞) by the interval [100, ∞)? Why or why not?
Note: A complete proof is not required, but will be considered for
bonus marks.
5. Simulation Study: Post-Selection Inference
Generate predictors “Pred” and response variables “Resp” according to the
following R code:
m= 80 # You can experiment with your own large value.
n=100 # You can experiment with your own large value.
Pred = matrix(rnorm(m*n,0,1), nrow = n, ncol = m)
Resp = rnorm(n,0,1)
We interpret this as n datapoints, with the j’th datapoint looking like
(X
(j)
1
, . . . , X(j)
m , Y (j)
).
(1) Do the following two steps:
3
In case it wasn’t clear: you shouldn’t actually do this! However, seeing very misleading
results can help you learn to diagnose problems with statistical analyses.
4
(a) For each i ∈ {1, 2, . . . , m}, compute the correlation of each (X
(1)
i
, . . . , X(n
i
)
with (Y
(1), . . . , Y (n)
).
Denote by i1, i2 the indices with the largest correlation in absolute
value.
(b) Fit the linear model
Y = β0 + β1Xi1 + β2Xi2 +
based on the response variable Y and the two chosen predictors.
Comment on the fit and the confidence intervals for these quantities.
(2) Split the data into two equal-sized parts - a training dataset and a
testing data set.
Repeat the above question, but use only the training dataset in part
(a) and only the testing dataset in part (b). Comment on the difference
between the results you see.
6. Theory: Consistency of K-NN Classification
We will prove that the k-nearest neighbour classification algorithm eventually
gets the right answer, at least in a simple setting.
Define the function f : [0, 1] 7→ {0, 1} by the piecewise-constant formula
f(x) = 0, x < 0.5
f(x) = 1, x ≥ 0.5.
Consider data X1, X2, . . .
i.i.d. ∼ Unif[0, 1] and let Yi = f(Yi). For integer
k ∈ {3, 5, 7, . . .} and n ∈ {k + 1, k + 2, . . .}, let ˆfk,n be the k-nearest neighbour
classifier associated with the dataset {(X1, Y1), . . . ,(Xn, Yn)}.
Prove that, for any fixed x ∈ [0, 0.5)∪(0.5, 1] and any fixed k ∈ {3, 5, 7, . . .},
we have
limn→∞
P[
ˆfk,n(x) = f(x)] = 1.
Hint 1: You don’t have to directly estimate the probability of the event
An = {
ˆfk,n(x) = f(x)} in the question. It is enough to estimate the probability
of some event Bn contained in An.
Hint 2: If you’re stuck proving this for all x, start with a concrete number
such as x = 0.75. Can you write down a formula for the probability of the
event “at least k of the points X1, . . . , Xn are in the interval (0.5, x)”? When
that event happens, what can you say about ˆfn,k(x)?
5
Hint 3: You may find the following fact useful: for any constants α, β > 0,
limn→∞
n
α
e
−βn = 0.
You may remember this as a phrase like “exponentials beat polynomials” from
a plotting section in calculus class.
7. Empirical Study: Bayes Rate of Classifiers
Open the dataset weight-height.csv. This dataset was downloaded from
Kaggle.com. Split this dataset (randomly) into a training set and a testing
dataset, each with 2500 men and 2500 women.
(1) For both genders separately, calculate the sample average weight, sample
average height, and the sample variance for the weight and height
based on this training dataset. Note: you should have eight numbers
at this point - for each of two genders, and each of two measurements,
you should both the mean and variance.
(2) For the remainder of the question, assume that the male data is exactly
bivariate Gaussian with means and variances as calculated in the previous
part of the question (and 0 covariance). Similarly, assume that
the male data is exactly bivariate Gaussian with means and variances
as calculated in the previous part of the question (and 0 covariance). 4
Based on this modelling assumption and using only the observed
height and weight, predict the gender of the 5000 elements of the testing
set according to the Bayes classifier (see page 38 of the textbook
for a definition).
Comment on the quality of this classifier. Can you think of anywhere
that the model can be improved, or does it look about as good as you
could hope for (given the data)?
Hint: You may wish to make an illustrative plot along the following
lines: let the x-axis and y-axis be the observed weight and height, and
colour each point in one of 4 colours corresponding to true/predicted
gender. You need not make exactly this plot (and indeed you will
probably need to make small adjustments to end up with a useable
picture).
4Of course, this is not really the true data-generating process. I’m making this assumption
because it is only possible to calculate a Bayes rate if you know the true data-generating
process. As you can probably guess, this means that you can almost never calculate the
true Bayes rate. Nonetheless, it is useful to think about the Bayes rate to try to understand
fundamental limits with any classification procedure.
6
(3) Compute the average Bayes error rate, using the same modelling assumption
as in the previous part of this question.5 Use a plot or other
data-summarization technique to check if the Bayes error rate seems
sensible, and explain/interpret the broad features of your plot.
8. Data Analysis: House Sizes and Collinearity
Open the file
House_Data.csv
using the command:
house = read.csv("House_Data.csv")
This dataset is a version of the dataset posted at https://www.kaggle.com/
c/house-prices-advanced-regression-techniques/overview. The data
itself is real; I just pruned various outliers to make the following data analysis
slightly easier.
In the following, I denote by Ai
, Fi and Si the lot area, first floor area and
second floor area of the i’th house. In the dataset, these can be accessed by
writing:
house$LotArea[i], house$X1stFlrSF[i], house$X2ndFlrSF[i]
We will consider the following “full” regression model
Ai = α + β1Fi + β2Si + i (8.1)
and the two “sub-”models
Ai = α + β1Fi + i (8.2)
and
Ai = α + β2Si + i
. (8.3)
Note: Equations (8.1), (8.2) and (8.3) are three different models - I’m not
writing down one model where all three of these equations hold simultaneously!
In reality none of these models are quite true, and when you fit them you will
not get the same estimates for α, β1, β2, i
in all three models.
(1) Plot Fi vs Si
. Do you notice anything about their relationship?
(2) Use the lm function to fit all three models and find p-values for the
parameters α, β1, β2.
5Again, computing the Bayes error rate requires you to know (or assume you know) the
true data-generating process.
7
(3) When you fit model (8.1), are the p-values for β1, β2 below 0.05? What
about the p-values when you fit models (8.2) and (8.3)? Comment on
any discrepancy, perhaps in light of the plot in the first part of the
question.
(4) In the first three parts of the question, I told you to plot Fi
, Si together
and then fit three models for Ai
. In light of the plot from part (1) (and
common-sense observations about the relationship between Ai
, Fi
, Si),
suggest a modelling procedure that would have produced better results.
9. Simulation Study: Bootstrap
There are several closely-related algorithms called “the boostrap.” We first
describe the bootstrap that will be studied in this question.
We have some parametric model {fθ}θ∈Θ, data X1, . . . , Xn ∈ R, and estimator
ˆθ : R
n
7→ Θ taking a dataset to a parameter value. A single bootstrap
replicate is obtained by the following algorithm:
(1) Sample X∗
1
, . . . , X∗
n uniformly and with replacement from the dataset
X1, . . . , Xn.
(2) Return Θ∗ = ˆθ(X∗
1
, . . . , X∗
n
).
Denote by θ0 ∈ Θ the “true” parameter, F the true distribution of ˆθ(X1, . . . , Xn),
and Fboot the distribution of Θ∗
.
(1) Consider the parametric model fθ = N(θ, 1) with the usual MLE ˆθ =
1
n
Pn
i=1 Xi
. Fix n = 100 and θ = 0. Do a simulation study and plot
the densities of both F and Fboot. Do they look similar? 6
(2) Consider the parametric model fθ = Unif[0, θ]. Show that the MLE is
ˆθ = max(X1, . . . , Xn).
(3) Set n = 100 and θ = 1. Do a simulation study to plot the densities
of both F and Fboot for the setup in part (2).7 Do they look similar?
If not, comment on what would “go wrong” if you used Fboot as a
surrogate for F.
Depending on how you plot initially, the densities might be “squished”
and hard to see - rescale your x- and y-axes until you get a good luck,
6
It is possible to compute F exactly, but computing Fboot is much harder. Rather than
trying to do this, you should take many samples from Fboot using the bootstrap algorithm
given at the start of the question, then plot the histogram or density plot of this sample.
7As in the previous part of the question, you can estimate F and Fboot by simulating from
the appropriate densities and plotting the empirical distribution of the densities. However,
it is not terribly difficult to compute both of these densities exactly using only tools from
MAT2371/2377. If you can do these computations, you may find it easier to plot and
compare them.
8
and comment in particular on the values in intervals that are very close
to 1, such as [0.998, 1].
10. Logistic Regression Practice
Do Parts (a,b,c,d,e) of Question 10 from Chapter 4 of the textbook.
Comment on the quality of the result.
9
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。