联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2020-04-13 09:39

Stat5401 Midterm Exam II - Due April 8th, 2020

Exam Instruction:

? There are a total number of 3 questions (25 points in total). Please check

if you have answered all the questions.

? Please attach all R codes in your solution. You can use R notebook to

organize your results.

? Please organize your answers in a single pdf file and submit it through

Canvas.

? The exam is due at 11:59 pm on April 8, 2020 (CDT). Please try to submit

your work at least a few minutes earlier than the deadline to avoid delays

due to technical issues.

? This is an open book exam. You are allowed to use your notes, book, R

help files, academic papers, online tutorials, etc.

? You are NOT allowed to discuss the exam with anyone else.

? General questions about the exam should be asked on Canvas discussions

board ‘Midterm 2 related questions and clarifications’. Please ask

questions as early as possible. I may not be able to answer last-minute

questions.

? Other questions should be directed to me at lixx1766@umn.edu. If you

are writing emails to me, please use your university email, i.e., ending with

@umn.edu, etc. You can also send email directly through Canvas.

? All answers must be written in your own words.

? Updates are colored in red.

1

Question 1(5 points)

Consider the following underlying linear regression model

yi = β1Xi1 + β2Xi2 + i,

with the standard assumption that E(i) = 0, V ar(i) = σ

2, and Cov(i, j ) = 0

for i 6= j. Note that we don’t include the intercept in this question.

Suppose that you observe.

Answer the following questions using math, i.e., by hand:

(a) (1 points) Suppose that we fit the model:

yi = β1Xi1 + β2Xi2 + i

.

Write down the design matrix X with the given X1 and X2.

(b) (2 points) Fit the model yi = β1Xi1 + i

, and let β?

1 be the least square

estimator for β1.? Derive the mean and variance for β?

1.? Is β?1 an unbiased estimator for β1?

(c) (2 points) Suppose now that the true underlying model is

yi = β1Xi1 + β2Xi3 + i,

and you observe

Suppose that you fit the model

yi = β1Xi1 + i

to estimate β1.

? Derive the mean and variance of the least square estimator for β1.

? Is it an unbiased estimator?

2

Question 2 (Total 8 points)

Download the data Q2.csv from Canvas. This is a simulated dataset motivated

by an example in the book Machine Learning with R by Brett Lantz.

The response variable in the dataset is

? charges = medical cost (in dollar) billed by insurance company

The 6 covariates are

? gender = Gender of the primary beneficiary, ‘f’ if female, ‘m’ if male.

? age = Age of the primary beneficiary.

? bmi = Body mass index

? smoker = Smoker or not, ‘yes’ for smoker, ’no’ for non-smoker

? children = Number of dependents, treated as a continuous/numeric covariate

in this problem.

? region = Residential area in the US.

(a) Build a linear model by regressing charges on all the 6 covariates. Answer

the following questions.

(i) Which effects are significant at α = 0.05, and what is the direction of

the effects? Is there a relationship between age and charges?

(ii) Find a 95% confidence interval for the linear coefficient for bmi.

(iii) What’s the R2 and adjusted R2?

(b) Build a reduced model by regressing charges on age, bmi and smoker.

Compare the this model with the full model fitted in part (a) using an F

test. According to the F test, does the model in part (a) fit the model

significantly better than that in part (b)?

Question 3 (Total 12 points)

In class, we mentioned that there are many variable selection methods available.

In this example, we study additional performance metrics, and use simulation

to verify their effectiveness. We will also study and try the stepwise variable

selection for multiple linear regression.

(a) We have learned the adjusted R2 as a metric for the model fit. In this

question, we compare different models using the adjusted R2.

Suppose the predictor is generated by

set.seed(2020)

n=200

x=rnorm(n)

3

Remark: To make our results comparable, please use set.seed(2020)

when generating x.

(i) Let the underlying model is generated by

eps=rnorm(n)

y=x+x^2+x^3+eps

What is the underlying model? How many covariates are there in

the underlying model? Please specify the covariates and true linear

coefficients.

(ii) Fit 6 different models: yi = β0 +Ppj=1 βjXji + i for p = 1, 2, 3, 4, 5, 6.

These models are polynomials of different orders. Calculate the adjusted

R2

for each of them, and draw a plot showing the adjusted

R2. (x-axis: p, y-axis: adjusted R2) Does the correct model have the

largest adjusted R2?

(Hint: You can first create a data matrix

X = cbind(x,x^2,x^3,x^4,x^5,x^6)

and then use a for loop to run the 6 regression models, in order to

simplify codes. Also, try summary(model)$adj.r.squared to extract

the adjusted R2

for a fitted model. )

(iii) Instead of using the adjusted R2

, there are other performance criteria.

Here, we consider the Akaike Information Criterion (AIC) and

Bayesian Information Criterion (BIC). Read the document at the link

https://daviddalpiaz.github.io/appliedstats/variable-selection-and-model-building.

html.

Alternatively, you can also read page 385-386 of the textbook on ‘selecting

predictor variable from a large set’ and page 705 for the defi-

nition of AIC and BIC.

For this question, write down the definition of AIC and BIC in terms

of Residual Sum of Squares (RSS), n and p.

(iv) In R, AIC and BIC can be computed using the functions AIC and BIC,

respectively. Replace the adjusted R2 by AIC and BIC in part (iii)

and plot the results. Does the correct model have the smallest AIC

and BIC?

(v) Repeat the simulation in (i), (iii) and (iv) for 100 times. You will need

to keep the same x while generating new random eps each time.

Each time, use the adjusted R2

(the largest one), AIC and BIC (smallest

one) to select the model. Take record of the model selected for each

simulation (i.e., take record of the selected p).

Report the frequency that the adjusted R2

, AIC, BIC correctly select

the true model among the 100 simulations. For this problem, which

metric selects the model best? For the other metrics, do they tend to

select more covariates or fewer covariates than the correct one?

4

(b) For multiple linear regression, stepwise selection (including forward search,

backward search, and both directions) is usually used for model selection.

Read Section 16.2 of the document

https://daviddalpiaz.github.io/appliedstats/variable-selection-and-model-building.

html and answer the following questions

(i) Use about two or three sentences to describe stepwise variable selection

methods.

(ii) Download the data Q3.csv and regress Y on X1 - X20. (Hint: try

lm(Y~.,data=Q3)). Use the function step to select variables (use the

default arguments without changing its arguments like k or direction).

Report the selected variables.

5


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp