Statistics 512: Problem Set 4
Due Thursday, November 1, 2018 11:59 PM
Important Note – Every graph or plot you create should have your name printed as a subtitle.
Consequently, any graph with no name will result in a 20% points loss. Also, please attach your
code at the end; any homework with no code provided will result in a 50% points loss on the entire
assignment.
1. Based on the following small data set, construct the design matrix, X, its transpose X0
, and
the matrices X0X, (X0X)1
, X0Y, and b = (X0X)1X0Y.
X Y
0 1
1 4
2 7
3 9
4 10
5 8
For the following 5 problems, consider the data given in the file CH06PR18.DAT,
which describes a data set (n = 24) used to evaluate the relation between intermediate
and senior level annual salaries of bachelor’s and master’s level mathematicians
(Y , in thousand dollars) and an index of work quality (X1), number of
years of experience (X2), and an index of publication success (X3).
2. Run the multiple linear regression with quality, experience, and publications as the explanatory
variables and salary as the response variable. Summarize the regression results by giving
the fitted regression equation, the value of R2
, and the results of the significance test for the
null hypothesis that the three regression coefficients for the explanatory variables are all zero
(give null and alternative hypotheses, test statistic with degrees of freedom, p-value, and brief
conclusion in words).
3. Give 85% confidence intervals for regression coefficients of quality, experience, and publications
based on the multiple regression. Describe the results of the hypothesis tests for the
individual regression coefficients (give null and alternative hypotheses, test statistic with degrees
of freedom, p-value, and a brief conclusion in words). What is the relationship between
these results and the confidence intervals?
4. Plot the residuals versus the predicted salary and each of the explanatory variables (i.e., 4
residual plots). Are there any unusual patterns?
5. Examine the assumption of normality for the residuals using a qqplot and histogram. State
your conclusions.
6. Predict the salary for a mathematician with quality index equal to 5.2, 15 years of experience,
and publication index equal to 6.5 . Provide a 85% prediction interval with your prediction.
1
For the following problems use the computer science data that we have been discussing
in class. You can get a copy of the data set csdata.dat from the class website.
The variables are: id, a numerical identifier for each student; GPA, the grade point
average after three semesters; HSM; HSS; HSE; SATM; SATV, which were all explained in
class; and GENDER, coded as 1 for men and 2 for women.
7. In this exercise you will illustrate some of the ideas related to the extra sums of squares.
(a) Create a new variable called SAT which equals SATM + SATV and run the following two
regressions:
i. predict GPA using HSM, HSS, and HSE;
ii. predict GPA using SAT, HSM, HSS, and HSE.
Calculate the extra sum of squares for the comparison of these two analyses. Use it to
construct the F-statistic – in other words, the general linear test statistic – for testing
the null hypothesis that the coefficient of the SAT variable is zero in the model with all
four predictors. What are the degrees of freedom for this test statistic?
(b) Compare the test statistic and p-value from the test statement with the individual t-test
for the coefficient of the SAT variable in the full model. Explain the relationship.
8. Run the regression to predict GPA using SATM, SATV, HSM, HSE, and HSS. Put the variables in
the order given above and calculate the Type I and Type II SS using R.
(a) Add the Type I sums of squares for the five predictor variables. Do the same for the
Type II sums of squares. Do either of these sum to the model sum of squares? Are there
any predictors for which the two sums of squares (Type I and Type II) are the same?
Explain why.
(b) Verify (by running additional regressions and doing some arithmetic with the results)
that the Type I sum of squares for the variable SATV is the difference in the model sum
of squares (or error sum of squares) for the following two analyses:
i. predict GPA using SATM, SATV;
ii. predict GPA using SATM.
9. Create an additional variable called HS that is the sum of the three high school scores (HSE +
HSS + HSM). Run the regression to predict GPA using a variety of variables, including HS and
SAT, as described below. Summarize the results by making a table giving the percentage of
variation explained (R2
) by each of the following models:
(a) SATM as the explanatory variable
(b) SATV as the explanatory variable
(c) HSM as the explanatory variable
(d) HSS as the explanatory variable
(e) HSE as the explanatory variable
(f) SATM and SATV as the explanatory variables
(g) SAT (=SATM+SATV) as the explanatory variable
(h) HSM, HSS, and HSE as the explanatory variables
(i) HS (=HSM+HSS+HSE) as the explanatory variable
2
(j) SATM, SATV, HSM, HSS, and HSE as the explanatory variables
(k) SAT and HS as the explanatory variables
(Please do not include the R output for all these models. Only the R2 value is needed. Note
that you can run copy and paste some lm codes in R to save typing.)
10. A data set contains 50 observations. There are 4 explanatory variables: A, B, C, and D. Use
the following results:
1.3972 1.8892 × 103 3.6060 × 103 1.3523 × 10?3 ?2.9728 × 10?2
1.8892 × 103 5.0363 × 105 6.8773 × 106 3.9875 × 106 5.0387 × 106
3.6060 × 103 6.8773 × 106 4.9685 × 105 8.6113 × 106 5.4578 × 105
1.3523 × 103 3.9857 × 106 8.6113 × 106 4.7933 × 105 1.3931 × 105
2.9728 × 102 5.0387 × 106 5.4578 × 105 1.3931 × 105 8.0975 × 104
469.7658
2.4148
3.3341
4.3285
MSE = 1963.48714
(a) Obtain a 85% confidence interval for β1 (the coefficient for A).
(b) You wish to test H0 : β4 = 0 vs. Ha : β4 6= 0. That is, you wish to determine if variable
D provides significant power for Y when variables A, B, and C are already in the model.
Obtain the test statistic for this hypothesis test and determine if you would accept or
reject the null hypothesis (α = 0.05). You should give either a critical value or a p-value
to support your conclusion.
(c) Obtain a 85% confidence interval for the mean (expected) response when A = 40, B = 20,
C = 50, and D = 30.
(d) Obtain a 85% prediction interval for a single response when A = 40, B = 20, C = 50,
and D = 30.
11. In R, create a new variable GENDERW that has values 1 for women and 0 for men (use arithmetic
on the original variable GENDER). Run a regression to predict GPA using the explanatory
variables HSM, HSS, HSE, SATM, SATV, and GENDERW. (Do not include any interaction terms.)
(a) Give the equation of the fitted regression line using all six explanatory variables.
(b) Give the fitted regression line for women (use part a).
(c) Give the fitted regression line for men (use part a).
DO NOT attempt to run lm function on a subset of the data to answer this question.
12. Use the Cp criterion to select the best subset of variables for this problem. Use only the
original six explanatory variables, not HS or SAT, and use either GENDER or GENDERW, not
both. Summarize the results and explain your choice of the best model.
13. Check the assumptions of this “best” model using all the usual plots (you know what they
are by now). Explain in detail whether or not each assumption appears to be substantially
violated.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。