SS 3860B/9055B - Deviance
Assignment 2
STATS 3860B/9155B -
Winter 2023
This assignment is due Friday, March 10th, at 11:55 pm (EST).
You must write your R code and answers using Rmarkdown generating a
single pdf file.
Submissions must be made via Gradescope. You must carefully assign each
question part to its corresponding page (or pages) on your pdf file. Question parts
with no pages assigned to them will receive zero marks.
Each student must submit their own work. Scholastic offences are taken seriously,
and students are directed to read the appropriate policy, specifically, the definition of
what constitutes a Scholastic Offence, at the following Web site: http://www.uwo.ca/u
nivsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf
Question 1
The data set NMES1988 in the AER package contains a sample of individuals over 65 who
are covered by Medicare in order to assess the demand for health care through physician
office visits, outpatient visits, ER visits, hospital stays, etc. The data can be accessed by
installing and loading the AER package and then running data(NMES1988). More background
information and references about the NMES1988 data can be found in help pages for the AER
package.
a) Show through graphical means that there are more respondents with 0 visits than
might be expected under a Poisson model.
b) Fit a ZIP model for the number of physician office visits using chronic, health, and
insurance as predictors for the Poisson count, and chronic and insurance as the
predictors for the binary part of the model. Then, provide interpretations in context for
the following estimated model parameters:
coefficient of chronic in the Poisson part of the model
coefficient of poor health in the Poisson part of the model
the intercept in the logistic part of the model
1
coefficient of insurance in the logistic part of the model
c) Use the estimated coefficients from part b) to calculate the predicted probability of
“always zero” (never visiting the doctor) for an individual who has insurance and two
chronic conditions. Then check your answer using the predict() function to compute
the same probability.
d) Use the estimated coefficients from part b) to calculate the predicted probability of
visiting the doctor five times for an individual with poor health, four chronic conditions
and without insurance. Then check your answer using the predict() function to
compute the same probability.
Question 2
The dataset melanoma gives data on a sample of patients suffering from melanoma (skin
cancer) cross-classified by the type of cancer and the location on the body.
suppressMessages(library(faraway))
str(melanoma)
## 'data.frame': 12 obs. of 3 variables:
## $ count: num 22 16 19 11 2 54 33 17 10 115 ...
## $ tumor: Factor w/ 4 levels "freckle","indeterminate",..: 1 4 3 2 1 4 3 2 1 4 ...
## $ site : Factor w/ 3 levels "extremity","head",..: 2 2 2 2 3 3 3 3 1 1 ...
a) Display the data in a two-way table. Make a mosaic plot and comment on the evidence
of independence.
b) Check for independence between site and tumour type using a Chi-squared test.
c) Fit a Poisson GLM model and use it to check for independence.
d) Make a two-way plot of the deviance residuals from the last model. Comment on the
larger residuals.
Question 3
The UCBAdmissions dataset presents data on applicants to graduate school at Berkeley for
the six largest departments in 1973 classified by admission and sex.
suppressMessages(library(faraway))
str(UCBAdmissions)
## 'table' num [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
## - attr(*, "dimnames")=List of 3
## ..$ Admit : chr [1:2] "Admitted" "Rejected"
## ..$ Gender: chr [1:2] "Male" "Female"
## ..$ Dept : chr [1:6] "A" "B" "C" "D" ...
2
a) Show that this provides an example of the Simpson’s paradox.
b) Determine the most appropriate dependence model between the variables. Explain in
detail each step you take.
c) Fit a binomial regression with admissions status as the response and show the relationship
to your model in the previous question.
Question 4
The hsb data was collected as a subset of the “High School and Beyond” study conducted by
the National Education Longitudinal Studies program of the National Center for Education
Statistics. The variables are gender, race, socioeconomic status (SES), school type, chosen
high school program type, scores on reading, writing, math, science and social studies. The
response variable is the chosen high school program type (prog), which is multinomial with 3
levels.
library(faraway)
library(nnet)
data("hsb")
hsb <- hsb[,-1] ## removing first column corresponding to student ID
str(hsb)
## 'data.frame': 200 obs. of 10 variables:
## $ gender : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
## $ race : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...
## $ ses : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
## $ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...
## $ prog : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...
## $ read : int 57 68 44 63 47 44 50 34 63 57 ...
## $ write : int 52 59 33 44 52 52 59 46 57 55 ...
## $ math : int 41 53 54 47 57 51 42 45 54 52 ...
## $ science: int 47 63 58 53 53 63 53 39 58 50 ...
## $ socst : int 57 61 31 56 61 61 61 36 51 51 ...
a) Fit a multinomial regression model for prog (with baseline level academic) and all nine
predictors.
b) Interpret the coefficients corresponding to the five subjects (scores on reading, writing,
math, science and social studies) in terms of odds.
c) Regarding to part b), identify which one of the five subjects gives unexpected results and
suggest an explanation for this behavior. Any reasonable explanation will be accepted.
Question 5
An earlier study examined the effect of workplace rules in Minnesota which require smokers
to smoke cigarettes outside. The number of cigarettes smoked by smokers in a 2-hour period
3
Table 1: A small subset of hypothetical data on Minnesota workplace rules on smoking.
subject x (location) y (cigarettes)
1 0 3
2 1 0
3 1 0
4 1 1
5 0 2
6 0 1
was recorded, along with whether the smoker was at home or at work. A (very) small subset
of the data appears in Table 1 and it is also available in the smoking.csv file.
Model 1: Assume that Y ~ Poisson(λ); there is no difference between home and work.
Model 2: Assume that Y ~ Poisson(λW ) when the smoker is at work, and Y ~
Poisson(λH) when the smoker is at home.
Model 3: Assume that Y ~ Poisson(λ) and log(λ) = β0 + β1x.
a) Write out the likelihood L(λ) and log-likelihood logL(λ) in Model 1. Use the data
values in Table 1, and simplify where possible.
b) Intuitively, what would be a reasonable estimate for λ based on this data? Why?
c) Use R to produce a plot of the likelihood function L(λ) . Find the maximum likelihood
estimator for λ in Model 1 using an optimization routine in R (for example, optim(),
but not the glm() function).
d) Write out the log-likelihood function logL(λW , λH) in Model 2. Use the data values in
Table 1, and simplify where possible.
e) Intuitively, what would be reasonable estimates for λW and λH based on this data?
Why?
f) Find the maximum likelihood estimators for λW and λH in Model 2 using an optimization
routine in R (for example, optim(), but not the glm() function).
g) Write out the log-likelihood function logL(β0, β1) in Model 3. Again, use the data values
in Table 1, and simplify where possible.
h) Find the maximum likelihood estimators for β0 and β1 in Model 3 using an optimization
routine in R (for example, optim(), but not the glm() function). Use R to produce a
3D plot of the log-likelihood function.
i) Confirm your estimates for Model 1 and Model 3 using glm(). Then show that the
MLEs for Model 3 agree with the MLEs for Model 2.
4
Question 6
This question refers to exercise 4 of Chapter 8 of the textbook; however, instead of working
with the Galapagos data, you will work with the dataset in Table 1 from the previous question.
The purpose of this question is to reproduce the details of the GLM fitting of this data via
the IRWLS algorithm.
a) Consider the Poisson GLM fit for Model 3 in part i) of the previous question. Report
the estimated values of the coefficients and deviance.
For parts b), c), d), e), f) and g) refer to Exercise 4 Chapter 8 of the textbook (page 172).
5
Camila de Souza SS 3860B/9055B - Deviance Winter 2023 1 / 7
Use of deviance to measure goodness of fit
We can measure how well our proposed model fits the data by
comparing it to the saturated (or full) model.
A saturated model is achievable by adding sufficiently many parameters,
in most cases as many parameters as number of data points.
This comparison (saturated versus proposed) can be done by
calculating the so-called deviance, that is, the difference between the
log likelihood of the saturated model (Sat) and the log likelihood of
our proposed model (M):
DM = 2(log LSat ? log LM)
Camila de Souza SS 3860B/9055B - Deviance Winter 2023 2 / 7
Use of deviance to measure goodness of fit
Gaussian linear regression: DM = RSS.
However, we cannot use RSS as a test statistic for goodness of fit
because of the variance (dispersion) parameter.
Instead, we use can R2 = 1? RSS/TSS
Binary logistic regression: DM = ?2 log LM , also cannot be used as a
test statistic for goodness fit.
For logistic regression we can use the Hosmer-Lemeshow (HL) test.
Camila de Souza SS 3860B/9055B - Deviance Winter 2023 3 / 7
Use of deviance to measure goodness of fit
In Poisson or Binomial regression we can use use DM as a test
statistic to assess the goodness of fit of the propose model.
H0 : no lack of fit versus H1: lack of fit
Under H0, DM ~ approx. χ2n?p, where n is the number of observations
and p the number of parameters in the model.
So, we reject “H0 : no lack of fit” if P(DM > Dobs) is sufficiently
small, say smaller than 0.05.
Note: For Poisson and Binomial regression we can also use the Pearson χ2
statistic instead of DM to conduct the goodness of fit test.
Camila de Souza SS 3860B/9055B - Deviance Winter 2023 4 / 7
Use of deviance to measure goodness of fit
Quasi-Binomial or Quasi-Poisson models:
log L is approximated by a function Q that allows for extra variation
Qsat = 0
We cannot use then the deviance based on Q to assess goodness of fit
Camila de Souza SS 3860B/9055B - Deviance Winter 2023 5 / 7
Use of deviance to compare two nested models
For logistic regression consider the likelihood ratio test statistic (LRT):
LRT = 2 log LLLS
= DS ? DL ~ approx. χ2l?s
H0: smaller model is correct
→ H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ, where Θ0 ? Θ
We reject the null hypothesis (that the simpler model is consistent with
the data), and therefore selet the larger model over the smaller one, if
the LRTobs exceeds the 95% quantile of the reference (null)
distribution (P(LRT > LRTobs) < 0.05).
If the LRTobs lies below this quantile then the null hypothesis is not
rejected, and the smaller model is selected in favour of the larger one.
Camila de Souza SS 3860B/9055B - Deviance Winter 2023 6 / 7
Use of deviance to compare two nested models
For Poisson and Binomial regression:
→ If there is no under or overdispersion: DS DL ~ approx. χ2ls
→ If there is under or overdispersion:
(DS DL)
(dfS dfL)
1
σ?2
~ approx. F(dfSdfL),dfL
where (dfS dfL) = l s is the difference in number of parameters between
the large and small models.
Note: for quasi-likelihood methods we must use F -tests to compare models
because there is also the dispersion parameter
Camila de Souza SS 3860B/9055B - Deviance Winter
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。