Name:
“R-LASTNAME-FIRSTNAME.Rmd” and submit both the R Markdown file and the HTML rendering to me via email. And don’t forget to put your name above within the document. It is due by 11:59pm on Sunday December 15, 2019. Do your own work.
Everything to be done in the base R installation. Do not import any libraries.
A Learning Exercise
We will consider basic notions of statistical learning related to model selection and the bias-variance trade-off. Our general setting is that we observe a quantitative response YY and pp different predictors, X1,X2,…,XpX1,X2,…,Xp. We assume there is some true relationship between YY and X=(X1,X2,…,Xp)X=(X1,X2,…,Xp) described by a fixed but unknown function ff:
Y=f(X)+ϵ.
Y=f(X)+ϵ.
Here, ϵϵ is a random error term that is independent of XX and has mean zero. A familiar example might be linear regression, in which f(X)=a+∑pi=1bixif(X)=a+∑i=1pbixi.
In this context, a learning algorithm is a method for finding an estimate fˆf^ of ff. A typical application is prediction, in which a set of inputs XX is available but the output YY is not available. Because the error term has mean zero, our estimate fˆf^ of ff can be used to predict YY with YˆY^, where
Yˆ=fˆ(X).
Y^=f^(X).
The error of YˆY^ as a prediction for YY can be assessed using a quantity known as the mean squared error (MSE). Consider a given estimate fˆf^ and a fixed set of predictors XX, which together yields a prediction Yˆ=fˆ(X)Y^=f^(X). Then,
E(Y−Yˆ)2=E[f(X)+ϵ−fˆ(X)]2=[f(X)−fˆ(X)]2reducible+Var(ϵ)irreducible
E(Y−Y^)2=E[f(X)+ϵ−f^(X)]2=[f(X)−f^(X)]2⏟reducible+Var(ϵ)⏟irreducible
The first term is called reducible because we can potentially improve the accuracy of fˆf^ by using a more appropriate learning algorithm. After all, if f^=ff^=f, then the reducible error is zero. But even if we had the correct function ff, our predictions Yˆ=f(X)Y^=f(X) would still have error because of ϵϵ. The error involved with ϵϵ is called irreducible because we cannot reduce or eliminate it no matter how well we estimate ff.
The training data is the set of {(yi,xi)}{(yi,xi)} used to fit the model, i.e. come up with the estimate fˆf^. The MSE on the training data can be used to fit the model (some algorithms explicitly minimize MSE) or to assess its fit. But a low MSE on the set of training data is not necessarily what we are interested in. Rather, in a prediction context, we are interested in the accuracy of the predictions that we obtain when we apply our algorithm to test data, which is data assumed to be generated the same way as the training data but that was not used to train the model. We want to choose a method that gives the lowest test MSE, as opposed to the lowest training MSE.
If no test observations are available, then one might simply select the method that minimizes the training MSE. Unfortunately, there is no guarantee that the method with the lowest training MSE will also have the lowest test MSE. Many statistical methods specifically estimate coefficients to minimize the training set MSE. For these methods, the training set MSE could be quite small (as a result of overfitting, i.e. fitting the errors too well), while the test MSE could be much bigger. This is what we will explore in this exercise.
Exercise 1
In our example the true function ff is nonlinear and is given by
f(x)=5−0.002x2+0.2x−7sin(πx/50).
f(x)=5−0.002x2+0.2x−7sin(πx/50).
The noise ϵϵ is modeled as a zero-mean normal random variable with standard deviation σσ.
Write an R function G that takes two arguments, an arbitrary vector x and a numeric keyword argument eps_sd that is by default set to 4, and returns a vector with elements given by f(xi)+ϵif(xi)+ϵi, where xixi is an element of x and ϵiϵi is normally distributed with mean zero and standard deviation equal to eps_sd. The ϵ=(ϵi)ϵ=(ϵi) should be independent.
# your answer here
Use your definition of G to graph ff for values of xx between 1 and 100, inclusive. (Hint: set eps_sd=0 to remove ϵϵ and isolate ff.) Please graph as a curve/line, not as a scatterplot. The graph you find is the true ff that we will be trying to estimate with f^f^ using three learning methods. In the real world, of course, you don’t know ff!
# your answer here
Exercise 2
Next, we will train three simple learning models on our data and plot their fits. First, we will fit a linear model, and then we will fit two smoothing splines with different smoothing parameters. You can read about smoothing splines here and here. Note that linear regression is a special case of a smoothing spline.
First we form the training set (ytr,xtr)(ytr,xtr) that we will use to estimate the models. For reproducibility, set the seed using set.seed(1234). Next, use the sample() function to randomly sample N=70N=70 integers between 1 and 100 (without replacement); put this into the vector x_tr. Then set y_tr to be G applied to x_tr (keep the default eps_sd=4). Finally, create a scatter plot of x_tr and y_tr. Note that it’s not immediately obvious that the data are generated from a nonlinear model, though you of course know that it is.
# your answer here
Fit a linear model to the training data using lm(), storing the result in lmfit. Create a scatterplot of the training data (again), and use abline(lmfit, ...) to add the estimated regression line of best fit to the plot. Make the line orange and its line width parameter equal to 2. Comment on the fit of the linear model. Are the errors approximately homoskedastic?
# your answer here
Next, we will estimate a smoothing spline with a high number of degrees of freedom df. The degrees of freedom df and the smoothing parameter λλ from the Wikipedia article are inversely related (the exact relationship is not important for our purposes). For example, a linear model corresponds to a smoothing spline with degrees of freedom df=2 and a smoothing parameter λ=∞λ=∞ (there is an infinite penalty to any curvature, forcing a line). Here, we will compare the line from a linear regression to a smoothing spline with df=20, ten times the degrees of freedom of the line. Use smooth.spline() to fit a smoothing spline with degrees of freedom df=20, storing the result in ss20. (Recall that one can type ?smooth.spline to get the help documentation.) Create a scatterplot of the training data (again), and use lines(ss20, ...) to add the estimated fitted spline to the plot. Make the fitted curve blue and its line width parameter equal to 2. Comment on the fit of the smoothing spline with df=20. Are the errors approximately homoskedastic?
# your answer here
Finally, we will fit a smoothing spline where the degrees of freedom (equivalently, the smoothing parameter λλ) is chosen using leave-one-out cross validation. Cross validation is an essential technique in statistical learning. In R, leave-one-out cross validation can be used in a smoothing spline by setting cv=TRUE. Estimate such a smoothing spline, storing the result in sscv. Create a scatterplot of the training data (again), and use lines(sscv, ...) to add the estimated fitted spline to the plot. Make the fitted curve red and its line width parameter equal to 2. Print the degrees of freedom associated with the spline by accessing one of the list elements in sscv. Comment on the fit of the smoothing spline with this lower number of degrees of freedom. Are the errors approximately homoskedastic?
# your answer here
On the same set of axes, plot the true ff from 1(B) (in black), the estimated regression line (in orange) from 2(B), and the two smoothing splines from (C) and (D) in blue and red, respectively. Set the line width parameter to 2 for all the lines/curves. Ensure that all lines/curves are visible by adjusting the axes limits if necessary. Based on this graph, discuss your views about which of the three learning methods we’ve tried produce the best estimate f^f^ of ff.
# your answer here
Exercise 3
Next we compute the training mean squared error of our 3 models. Recall that if yiyi denotes a yy-value from our training set and y^iy^i denotes the corresponding predicted value from a learning model, then the training MSE is given by
MSEtr=1N∑i=1N(yi−y^i)2.
MSEtr=1N∑i=1N(yi−y^i)2.
The predicted/fitted values for the linear model lmfit can be found in the vector lmfit$fitted.values. Compute the training MSE for the estimated linear model lmfit.
# your answer here
The predicted values for a smoothing spline s can be found in the vector predict(s$fit, x_tr)$y. Compute the training MSE for the estimated smoothing splines ss20 and sscv.
# your answer here
How do your MSE measures calculated in (A) and (B) compare? If you were selecting a model based solely on the training MSE, then which would you select? What is the irreducible error in this model, i.e. what is Var(ϵ)Var(ϵ)? Is the training MSE for some models below the irreducible error? If so, what do you make of this?
your answer here
Exercise 4
Now we take advantage of the fact that we know the true data-generating process, i.e. the true ff and the true Var(ϵ)Var(ϵ), to compute the true test MSE as we vary the degrees of freedom of the smoothing splines and compare it to the associated training MSE. Minimizing the test MSE over the degrees of freedom represents the optimal model within the class of smoothing splines.
Given that the smoothing splines are non-parametric in nature, we cannot derive the test MSE analytically despite knowing the data generating process. Instead, we will essentially perform a Monte Carlo analysis to achieve precise results. To do this we create a very large test sample.
Create x_te as a vector of size 10,000 of random sample of integers (with replacement) between 1 and 100. Then let y_te be G applied to x_te (keep esp_sd at its default value). The vectors x_te and y_te comprise the test sample.
# your answer here
Next, we will create two scatterplots next to each other. Save the default plot parameters in par() as opar and create a 1x2-grid using mfrow(). On the left panel, create a normal scatterplot of the test data with plot(x,y), and on the right panel use smoothScatter() to create a smooth scatterplot of the test data The smooth scatterplot colors areas based on the density of points, with darker colors corresponding to higher density. This helps us see that around each point we have a normal distribution with standard deviation of 2. After you create your plots, be sure to set the paramters par() back to opar.
# your answer here
Next we will compute the training MSE (on our training sample x_tr and y_tr) and test MSE (on the test sample x_te and y_te) for a sequence of smoothing splines with degrees of freedom df ranging from 2 to 25. We will use a for loop. First, use seq() to define a vector dof starting at 2 and ending at 25 in increments of 0.5. Instantiate mse_tr and mse_te as empty vectors to store our the training and test MSEs, respectively. Then create a for loop in which, for each degrees of freedom df iterate, you (1) estimate a smoothing spline with df degrees of freedom using the training data set, and (2) compute the training and test MSEs, storing them in mse_tr and mse_te. In the end, you will have populated the three vectors dof, mse_tr and mse_te.
# your answer here
On the same set of axes, plot the training and test MSEs from mse_tr and mse_te as a function of the degrees of freedom dof. Plot them as curves/lines (not scatterplots) with line width parameters equal to 2, the training MSE in grey, and test MSE in red. Use abline() to plot a dashed horizontal line corresponding to the irreducible error, the variance of ϵϵ. Use points() to plot the (x,y)(x,y)-coordinates for the degrees of freedom and training and test MSEs corresponding to the three learning models in Exercise 2: linear regression (df=2) in orange, a smoothing spline with 20 degrees of freedom in blue, and a smoothing spline estimated using cross validation in red (the degrees of freedom are also estimated, accessible in sscv$df; you might have to round to the nearest value in dof). Make the points filled-in circles by adjusting the parameters as appropriate. Ensure that all points and curves are visible by adjusting the axes limits if necessary.
# your answer here
Recall that the objective of the learning problem is to minimize the test MSE. Which method should you select based on this criterion? Which method would you select if your objective was to minimize the training MSE? Comment on the value of cross validation in this setting.
your answer here
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。