联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2021-02-24 11:28

MAT 3373: HOMEWORK 1

Homework Policy Basics

(1) You can find homework deadlines in the course schedule, available on

Brightspace.

(2) You can find detailed homework grading policies in the first week’s

lecture notes, available on Brightspace.

(3) You can find overall grading policies in the syllabus, available on Brightspace.

(4) Homework should be written in R Markdown format.1 You should

always submit both the raw workbook (in .Rmd format) and a compiled

version where I can see the charts (in .html or .pdf format). Marks

will be taken off if the homework is not compiled correctly. I

suggest using the set.seed() command near the top of each homework

set, and only re-compiling chunks as you need to. This will reduce

the chance of bugs popping up in previously-completed parts of the

homework.

(5) Homework solutions should appear in the same order as the questions

in the document. I will dock 3 percent on assignments with out-of-order

solutions, and I will generally not post grades of individual questions.

(6) It is possible to submit several versions of the homework. I will always

grade the last one submitted before the deadline. With this in mind,

I strongly encourage you to submit drafts of your homework

well before the deadline. In particular, you might as well submit a

draft every time you finish a question.

1See first week’s lecture notes for comments on Python people.

1

Introduction and Homework Focus

This homework is concerned with material from Chapters 1-4 of the textbook,

with one introductory question from Section 5.2. It focuses on getting

practice with programming, and on the methods of regression, logistic regression,

and k-nearest neighbours. It uses only the basic statistical tests seen

in Chapters 1-4. In particular, the homework doesn’t use the main tools for

statistical analysis of a collection of very different models, which are introduced

in Chapter 5. As such, many of the questions are either straightforward

applications of the methods (1,2,7,10) or ask you to identify and resolve an

enormous problem that appears when you try a straightforward application

(3,4,5,8,9).

Finding and fixing these sorts of enormous problems is of course important

- all of the mistakes highlighted in this homework are fairly easy to make, even

for professionals. Still, this might give the impression that machine learning is

mostly about identifying these sorts of gross violations of modelling assumptions.

In the following homework set, we’ll have access to more of the machine

learner’s standard toolkit, and we’ll do more examples where a “normal” work-

flow gives sensible results that can be improved with small tweaks and good

statistical practice.

2

1. Introduction to Data Exploration

Do Question 8 from Chapter 2 of the textbook.

2. Empirical Study: KNN for MNIST Data

Open the (small subsample of the) MNIST testing and training datasets:

MNIST_train = read.csv("mnist_train.csv")

MNIST_test = read.csv("mnist_test.csv")

Fit the knn classifier using the training data for k ∈ {3, 4, 5, 6}. Select the

value of k that minimizes the test error, and calculate the confusion matrix.

Comment on the results. If you saw a collection of new datapoints from the

same collection, would you expect the error to be larger, smaller, or about the

same as the observed test error?

3. Short Conceptual Questions

(1) Do Question 3 from Chapter 3 of the textbook.

(2) Imagine that you’re going to collect data, and have committed to doing

a one-dimensional linear regression analysis with intercept parameter

β0 known to be 0 and variance parameter σ known to be 1 (so you are

just trying to learn the slope term β1). Furthermore, you are absolutely

certain that the linear regression model is true.2

Before collecting the response variable, you need to choose the predictors.

You have the following options:

X

(1) = (?12, ?9, ?6, ?3, 0, 3, 6, 9, 12)

X

(2) = (?1, ?0.9, ?0.8, . . . , 0.8, 0.9, 1)

X

(3) = (?28, ?2, 76, 412).

Which of those should you choose? Why?

(3) People often use statistical models to do optimization, as follows. You

have some function f, and would like to find the largest value maxx f(x).

You try the following procedure:

(a) Collect data (X1, Y1), . . . ,(Xn, Yn).

(b) Based on this data, estimate a model ?f.

(c) Use the maximum of the predicted values argmaxx

?f(x) as a prediction

for the true location of the maximum value argmaxxf(x).

2Of course this is unrealistic - but please take it seriously for this question.

3

This is a reasonable thing to try: you get an estimate ?f for an entire

function, and you can try to use that as a surrogate for the real function

when doing optimization (or any other task). For some models, this

approach works rather well. Explain why it will almost never work well

when the model is a linear regression model on all of R.

4. Lying with Linear Regression

It is often possible to get very misleading results by deliberately choosing a

bad model with a certain structure. In this question, we’ll practice doing this

in a simple setting.3

Throughout the question, we’ll assume that every X-value is paired with

the observed Y -value Y = sin(X); there is no measurement error. We will

then fit this data to the usual linear regression model with unknown β0, β1, σ.

(1) Find a collection of X1, . . . , Xn of predictors so that the 99-percent con-

fidence interval for the slope β1 is contained in the interval (?∞, ?0.95].

Display the collection of points, fit the model in R, and give the output

of the summary function applied to the fit.

(2) Find a collection of X1, . . . , Xn of predictors so that the 99-percent

confidence interval for the slope β1 is contained in the interval [0.95,∞).

Display the collection of points, fit the model in R, and give the output

of the summary function applied to the fit.

(3) Would it be possible to do part (2) of this question if I replaced the

interval [0.95,∞) by the interval [100, ∞)? Why or why not?

Note: A complete proof is not required, but will be considered for

bonus marks.

5. Simulation Study: Post-Selection Inference

Generate predictors “Pred” and response variables “Resp” according to the

following R code:

m= 80 # You can experiment with your own large value.

n=100 # You can experiment with your own large value.

Pred = matrix(rnorm(m*n,0,1), nrow = n, ncol = m)

Resp = rnorm(n,0,1)

We interpret this as n datapoints, with the j’th datapoint looking like

(X

(j)

1

, . . . , X(j)

m , Y (j)

).

(1) Do the following two steps:

3

In case it wasn’t clear: you shouldn’t actually do this! However, seeing very misleading

results can help you learn to diagnose problems with statistical analyses.

4

(a) For each i ∈ {1, 2, . . . , m}, compute the correlation of each (X

(1)

i

, . . . , X(n

i

)

with (Y

(1), . . . , Y (n)

).

Denote by i1, i2 the indices with the largest correlation in absolute

value.

(b) Fit the linear model

Y = β0 + β1Xi1 + β2Xi2 + 

based on the response variable Y and the two chosen predictors.

Comment on the fit and the confidence intervals for these quantities.

(2) Split the data into two equal-sized parts - a training dataset and a

testing data set.

Repeat the above question, but use only the training dataset in part

(a) and only the testing dataset in part (b). Comment on the difference

between the results you see.

6. Theory: Consistency of K-NN Classification

We will prove that the k-nearest neighbour classification algorithm eventually

gets the right answer, at least in a simple setting.

Define the function f : [0, 1] 7→ {0, 1} by the piecewise-constant formula

f(x) = 0, x < 0.5

f(x) = 1, x ≥ 0.5.

Consider data X1, X2, . . .

i.i.d. ~ Unif[0, 1] and let Yi = f(Yi). For integer

k ∈ {3, 5, 7, . . .} and n ∈ {k + 1, k + 2, . . .}, let ?fk,n be the k-nearest neighbour

classifier associated with the dataset {(X1, Y1), . . . ,(Xn, Yn)}.

Prove that, for any fixed x ∈ [0, 0.5)∪(0.5, 1] and any fixed k ∈ {3, 5, 7, . . .},

we have

limn→∞

P[

?fk,n(x) = f(x)] = 1.

Hint 1: You don’t have to directly estimate the probability of the event

An = {

?fk,n(x) = f(x)} in the question. It is enough to estimate the probability

of some event Bn contained in An.

Hint 2: If you’re stuck proving this for all x, start with a concrete number

such as x = 0.75. Can you write down a formula for the probability of the

event “at least k of the points X1, . . . , Xn are in the interval (0.5, x)”? When

that event happens, what can you say about ?fn,k(x)?

5

Hint 3: You may find the following fact useful: for any constants α, β > 0,

limn→∞

n

α

e

?βn = 0.

You may remember this as a phrase like “exponentials beat polynomials” from

a plotting section in calculus class.

7. Empirical Study: Bayes Rate of Classifiers

Open the dataset weight-height.csv. This dataset was downloaded from

Kaggle.com. Split this dataset (randomly) into a training set and a testing

dataset, each with 2500 men and 2500 women.

(1) For both genders separately, calculate the sample average weight, sample

average height, and the sample variance for the weight and height

based on this training dataset. Note: you should have eight numbers

at this point - for each of two genders, and each of two measurements,

you should both the mean and variance.

(2) For the remainder of the question, assume that the male data is exactly

bivariate Gaussian with means and variances as calculated in the previous

part of the question (and 0 covariance). Similarly, assume that

the male data is exactly bivariate Gaussian with means and variances

as calculated in the previous part of the question (and 0 covariance). 4

Based on this modelling assumption and using only the observed

height and weight, predict the gender of the 5000 elements of the testing

set according to the Bayes classifier (see page 38 of the textbook

for a definition).

Comment on the quality of this classifier. Can you think of anywhere

that the model can be improved, or does it look about as good as you

could hope for (given the data)?

Hint: You may wish to make an illustrative plot along the following

lines: let the x-axis and y-axis be the observed weight and height, and

colour each point in one of 4 colours corresponding to true/predicted

gender. You need not make exactly this plot (and indeed you will

probably need to make small adjustments to end up with a useable

picture).

4Of course, this is not really the true data-generating process. I’m making this assumption

because it is only possible to calculate a Bayes rate if you know the true data-generating

process. As you can probably guess, this means that you can almost never calculate the

true Bayes rate. Nonetheless, it is useful to think about the Bayes rate to try to understand

fundamental limits with any classification procedure.

6

(3) Compute the average Bayes error rate, using the same modelling assumption

as in the previous part of this question.5 Use a plot or other

data-summarization technique to check if the Bayes error rate seems

sensible, and explain/interpret the broad features of your plot.

8. Data Analysis: House Sizes and Collinearity

Open the file

House_Data.csv

using the command:

house = read.csv("House_Data.csv")

This dataset is a version of the dataset posted at https://www.kaggle.com/

c/house-prices-advanced-regression-techniques/overview. The data

itself is real; I just pruned various outliers to make the following data analysis

slightly easier.

In the following, I denote by Ai

, Fi and Si the lot area, first floor area and

second floor area of the i’th house. In the dataset, these can be accessed by

writing:

house$LotArea[i], house$X1stFlrSF[i], house$X2ndFlrSF[i]

We will consider the following “full” regression model

Ai = α + β1Fi + β2Si + i (8.1)

and the two “sub-”models

Ai = α + β1Fi + i (8.2)

and

Ai = α + β2Si + i

. (8.3)

Note: Equations (8.1), (8.2) and (8.3) are three different models - I’m not

writing down one model where all three of these equations hold simultaneously!

In reality none of these models are quite true, and when you fit them you will

not get the same estimates for α, β1, β2, i

in all three models.

(1) Plot Fi vs Si

. Do you notice anything about their relationship?

(2) Use the lm function to fit all three models and find p-values for the

parameters α, β1, β2.

5Again, computing the Bayes error rate requires you to know (or assume you know) the

true data-generating process.

7

(3) When you fit model (8.1), are the p-values for β1, β2 below 0.05? What

about the p-values when you fit models (8.2) and (8.3)? Comment on

any discrepancy, perhaps in light of the plot in the first part of the

question.

(4) In the first three parts of the question, I told you to plot Fi

, Si together

and then fit three models for Ai

. In light of the plot from part (1) (and

common-sense observations about the relationship between Ai

, Fi

, Si),

suggest a modelling procedure that would have produced better results.

9. Simulation Study: Bootstrap

There are several closely-related algorithms called “the boostrap.” We first

describe the bootstrap that will be studied in this question.

We have some parametric model {fθ}θ∈Θ, data X1, . . . , Xn ∈ R, and estimator

?θ : R

n

7→ Θ taking a dataset to a parameter value. A single bootstrap

replicate is obtained by the following algorithm:

(1) Sample X?

1

, . . . , X?

n uniformly and with replacement from the dataset

X1, . . . , Xn.

(2) Return Θ? = ?θ(X?

1

, . . . , X?

n

).

Denote by θ0 ∈ Θ the “true” parameter, F the true distribution of ?θ(X1, . . . , Xn),

and Fboot the distribution of Θ?

.

(1) Consider the parametric model fθ = N(θ, 1) with the usual MLE ?θ =

1

n

Pn

i=1 Xi

. Fix n = 100 and θ = 0. Do a simulation study and plot

the densities of both F and Fboot. Do they look similar? 6

(2) Consider the parametric model fθ = Unif[0, θ]. Show that the MLE is

?θ = max(X1, . . . , Xn).

(3) Set n = 100 and θ = 1. Do a simulation study to plot the densities

of both F and Fboot for the setup in part (2).7 Do they look similar?

If not, comment on what would “go wrong” if you used Fboot as a

surrogate for F.

Depending on how you plot initially, the densities might be “squished”

and hard to see - rescale your x- and y-axes until you get a good luck,

6

It is possible to compute F exactly, but computing Fboot is much harder. Rather than

trying to do this, you should take many samples from Fboot using the bootstrap algorithm

given at the start of the question, then plot the histogram or density plot of this sample.

7As in the previous part of the question, you can estimate F and Fboot by simulating from

the appropriate densities and plotting the empirical distribution of the densities. However,

it is not terribly difficult to compute both of these densities exactly using only tools from

MAT2371/2377. If you can do these computations, you may find it easier to plot and

compare them.

8

and comment in particular on the values in intervals that are very close

to 1, such as [0.998, 1].

10. Logistic Regression Practice

Do Parts (a,b,c,d,e) of Question 10 from Chapter 4 of the textbook.

Comment on the quality of the result.

9


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp