Applied Stats and Data Analysis
EN.553.413-613, Spring 2024
Apr 3, 2024
Exam 2
Question 1 (22 points). Consider the linear model
Y = Xβ + ε,
where Y is a n-by-1 vector of response variables, X is an n-by-p design matrix, β is a p-by-1 vector of coefficients, ε is a multivariate normal Nn(0, σ2 In). Denote by b the least-squares estimate of β.
(a) TRUE or FALSE. Y ∼ Nn(Xb, σ2 I)
(b) TRUE or FALSE. Cook’s distance measures the influence of one observation on all fitted values.
(c) TRUE or FALSE. The least-squares estimate of β can be expressed as (XTX) −1XTY.
(d) TRUE or FALSE. Deleted residuals are the same as Externally Studentized Residuals.
(e) TRUE or FALSE. In the Multiple Linear Regression predictors could be both continuous and categorical.
(f) TRUE or FALSE. If X has full rank, then the matrix XTX has rank n.
(g) TRUE or FALSE. SST o = YT (I − n/1 11T )Y, where ✶ is an n-by-1 vector of 1’s.
(h) TRUE or FALSE. Internally studentized residuals are better than externally studentized residuals in determining outliers with respect to Y .
(i) TRUE or FALSE. Leverage can detect points that are outlying with respect to both X and Y .
(j) TRUE or FALSE. Outliers with large ESR (externally studentized residuals) are always influential.
(k) TRUE or FALSE. As we add more predictors, the coefficient of multiple determination R2 may decrease.
Question 2 (16 points). In this problem we deal with the same multiple linear regression model as in Question 1.
(a) Write the least-squares objective function Q(β) that we need to minimize to fit a multiple linear regression.
(b) Write down the formula for the least-squares estimate for the regression vector b in terms of matrices X, Y. You don’t need to derive it.
(c) Using the formula above, find E(b) and V ar(b). Show your work.
(d) This part could be solved independently from other parts. Compute the covariance matrix Cov(e, Y), where e is a vector of residuals, Y is a vector of fitted values. Simplify as much as you can. What are the dimensions of the matrix Cov(e, Y)?
Question 3 (16 points). Below is the sketch of a column space Col(X) of an n-by-2 design matrix X. The vector Y is an n-by-1 vector of values Yi , 1 is an n-by-1 vector of 1’s, X1 is an n-by-1 vector of predictor values Xi1.
The dotted line (marked by a) depicts orthogonal projection on Col(X), the dashed line (marked by d) depicts orthogonal projection on the vector of 1’s.
Let βb be the least squares estimate of β
(a) Express vectors a, b, c, d in terms of Y, X, βb, Y¯ . Here, Y¯ is an n-by-1 vector of Y¯ ’s.
(b) Express vectors a, b, c, d in terms of Y, H, I, J, where H is the hat matrix, I is an n-by-n identity matrix, J is an n-by-n matrix of 1’s.
(c) If we know the coefficient of multiple determination R2 is very close to 1, what can you say about the vectors in this plot? Be as specific as you can.
(d) If we know that there is a multicollinearity, what can you say about the vectors in this plot? Be as specific as you can.
Question 4 (20 points). Consider the following regression model:
Yi = β0 + β1X1i + β2X2i + εi
where the εi are iid N(0, σ2 ). First few observations are listed below:
We use lm in R to fit a linear regression model, obtaining the output below:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9427 XXXXXX YYYYY 0.0044x1 0.5319 0.2075 2.563 0.0374
x2 0.3204 0.1906 1.681 0.1366
---
Residual standard error: 0.9001 on 7 degrees of freedom
Multiple R-squared: 0.5584,Adjusted R-squared: 0.4322
F-statistic: 4.425 on 2 and 7 DF, p-value: 0.05724
(a) State the null and the alternative hypothesis for the model utility test for this model. What is your conclusion at the significance level α = 10%? Briefly explain.
(b) Is β1 significant at the significance level α = 0.10? Is β2 significant at the significance level α = 0.10? Briefly explain.
(c) Rewrite the model utility test from part (a) in the form. of the General Linear Test:
H0 : Cβ = γ, Ha : Cβ ≠ γ.
Identify C and γ and state their dimensions.
(d) Find the rejection region for the test in part (c) with significance level α = 0.05. Leave your answer in the form. v TA−1v > F(x, y, z). Provide numerical values for the vector v, matrix A (you don’t need to invert it), and parameters x, y, z of the quantile F(x, y, z).
(e) Standard error for the intercept β0 is hidden by XXXXX. Recover it using the data provided above.
(f) The plot below sketches the rejection regions for the tests in parts (a), (b). The inside of the purple ellipse correspond to the 90% confidence region of an F-test. Shade the areas where (0, 0) might lie in this plot.
Question 5 (20 points). Consider the grocery retailer example, where X1 denotes the num-ber of cases shipped, X2 is the indirect cost of the total labor hours as a percentage (values are between 0 and 100), and X3 is 1 if the week has a holiday, and 0 otherwise. The observations Y are the total labor hours, and we choose to fit the multiple linear regression:
Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + β4Xi1Xi3 + β5Xi2Xi3 + εi (1)
where the εi are i.i.d. normal with mean 0 and variance σ 2 . The lm output is shown:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.271e+03 2.319e+02 18.415 <2e-16 ***
x1 7.928e-04 4.180e-04 1.897 0.0642 .
x2 -2.990e+01 2.669e+01 XXXXXX 0.2684
x3 1.642e+02 4.486e+02 0.366 0.7161
x1:x3 -2.470e-04 8.813e-04 -0.280 0.7805
x2:x3 7.078e+01 5.487e+01 1.290 0.2036
---
Residual standard error: 143.8 on 46 degrees of freedom
Multiple R-squared: 0.6992,Adjusted R-squared: 0.6665
F-statistic: 21.39 on 5 and 46 DF, p-value: 5.393e-11
The first 5 rows of the dataset are the following
Y X1 X2 X3
4264 300 7.17 0
4945 265 8.61 1
4496 328 6.20 0
4317 317 4.61 0
4292 366 7.02 0
(a) How many observations are in the dataset?
(b) Write only the first three rows of the design matrix X for this model. What are the dimensions of the full design matrix X for this problem?
(c) Do you notice any issues with this model? If you do, how would you fix the issue?
(d) Write the estimated mean of the total labour hours during a week that has a holiday. Express it as a function of all, or some, X1, X2, X3.
(e) We would like to compare the models
g1 : Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + β4Xi1Xi3 + β5Xi2Xi3 + εi
g2 : Yi = β0 + β1Xi1 + β2Xi2 + β3Xi3 + β4Xi1Xi3 + εi
using the General Linear Test at the level α = 0.05. What is the value F ∗ of the F-statistic of this test? You should be able to find it using the information given. Specify the numerical values c, d, e for the rejection region F ∗ > F(c, d, e) for this test.
(f) Based on this output, find the interval where the total labour hours for a week without holidays with the 5% indirect cost of the labour hours, and 100 cases shipped would be with probability 90%. Leave your answer as A ± B · t(c, d). Specify numerical values for A, B, c, d. Identify as many values as you can. Leave values that you can’t identify as variables.
Question 6 (16 points). The table below contains residuals ei , internally studentized resid-uals ri , externally studentized residuals ti and the leverage hii for the model with p = 2 predictors and n = 10 observations.
Table 1: Values of hii, ei , ri , and ti .
(a) Which points seem to be outliers with respect to Y ? Which residual (ei , ri , ti) would you use to identify them? Briefly justify.
(b) Which points seem to be outliers with respect to X? Briefly justify.
(c) Which points appear to be most influential? Briefly justify.
(d) What should be the decision rule to identify observations outlying with respect to Y at the significance level α = 0.10? Write it in the form.
|A| ≥ t(B, C)
Identify A, B, C. Be as specific as you can.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。