联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2023-02-14 09:09

SS 3860B/9055B - Deviance

Assignment 2

STATS 3860B/9155B -

Winter 2023

This assignment is due Friday, March 10th, at 11:55 pm (EST).

You must write your R code and answers using Rmarkdown generating a

single pdf file.

Submissions must be made via Gradescope. You must carefully assign each

question part to its corresponding page (or pages) on your pdf file. Question parts

with no pages assigned to them will receive zero marks.

Each student must submit their own work. Scholastic offences are taken seriously,

and students are directed to read the appropriate policy, specifically, the definition of

what constitutes a Scholastic Offence, at the following Web site: http://www.uwo.ca/u

nivsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf

Question 1

The data set NMES1988 in the AER package contains a sample of individuals over 65 who

are covered by Medicare in order to assess the demand for health care through physician

office visits, outpatient visits, ER visits, hospital stays, etc. The data can be accessed by

installing and loading the AER package and then running data(NMES1988). More background

information and references about the NMES1988 data can be found in help pages for the AER

package.

a) Show through graphical means that there are more respondents with 0 visits than

might be expected under a Poisson model.

b) Fit a ZIP model for the number of physician office visits using chronic, health, and

insurance as predictors for the Poisson count, and chronic and insurance as the

predictors for the binary part of the model. Then, provide interpretations in context for

the following estimated model parameters:

coefficient of chronic in the Poisson part of the model

coefficient of poor health in the Poisson part of the model

the intercept in the logistic part of the model

1

coefficient of insurance in the logistic part of the model

c) Use the estimated coefficients from part b) to calculate the predicted probability of

“always zero” (never visiting the doctor) for an individual who has insurance and two

chronic conditions. Then check your answer using the predict() function to compute

the same probability.

d) Use the estimated coefficients from part b) to calculate the predicted probability of

visiting the doctor five times for an individual with poor health, four chronic conditions

and without insurance. Then check your answer using the predict() function to

compute the same probability.

Question 2

The dataset melanoma gives data on a sample of patients suffering from melanoma (skin

cancer) cross-classified by the type of cancer and the location on the body.

suppressMessages(library(faraway))

str(melanoma)

## 'data.frame': 12 obs. of 3 variables:

## $ count: num 22 16 19 11 2 54 33 17 10 115 ...

## $ tumor: Factor w/ 4 levels "freckle","indeterminate",..: 1 4 3 2 1 4 3 2 1 4 ...

## $ site : Factor w/ 3 levels "extremity","head",..: 2 2 2 2 3 3 3 3 1 1 ...

a) Display the data in a two-way table. Make a mosaic plot and comment on the evidence

of independence.

b) Check for independence between site and tumour type using a Chi-squared test.

c) Fit a Poisson GLM model and use it to check for independence.

d) Make a two-way plot of the deviance residuals from the last model. Comment on the

larger residuals.

Question 3

The UCBAdmissions dataset presents data on applicants to graduate school at Berkeley for

the six largest departments in 1973 classified by admission and sex.

suppressMessages(library(faraway))

str(UCBAdmissions)

## 'table' num [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...

## - attr(*, "dimnames")=List of 3

## ..$ Admit : chr [1:2] "Admitted" "Rejected"

## ..$ Gender: chr [1:2] "Male" "Female"

## ..$ Dept : chr [1:6] "A" "B" "C" "D" ...

2

a) Show that this provides an example of the Simpson’s paradox.

b) Determine the most appropriate dependence model between the variables. Explain in

detail each step you take.

c) Fit a binomial regression with admissions status as the response and show the relationship

to your model in the previous question.

Question 4

The hsb data was collected as a subset of the “High School and Beyond” study conducted by

the National Education Longitudinal Studies program of the National Center for Education

Statistics. The variables are gender, race, socioeconomic status (SES), school type, chosen

high school program type, scores on reading, writing, math, science and social studies. The

response variable is the chosen high school program type (prog), which is multinomial with 3

levels.

library(faraway)

library(nnet)

data("hsb")

hsb <- hsb[,-1] ## removing first column corresponding to student ID

str(hsb)

## 'data.frame': 200 obs. of 10 variables:

## $ gender : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...

## $ race : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ...

## $ ses : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...

## $ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ...

## $ prog : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ...

## $ read : int 57 68 44 63 47 44 50 34 63 57 ...

## $ write : int 52 59 33 44 52 52 59 46 57 55 ...

## $ math : int 41 53 54 47 57 51 42 45 54 52 ...

## $ science: int 47 63 58 53 53 63 53 39 58 50 ...

## $ socst : int 57 61 31 56 61 61 61 36 51 51 ...

a) Fit a multinomial regression model for prog (with baseline level academic) and all nine

predictors.

b) Interpret the coefficients corresponding to the five subjects (scores on reading, writing,

math, science and social studies) in terms of odds.

c) Regarding to part b), identify which one of the five subjects gives unexpected results and

suggest an explanation for this behavior. Any reasonable explanation will be accepted.

Question 5

An earlier study examined the effect of workplace rules in Minnesota which require smokers

to smoke cigarettes outside. The number of cigarettes smoked by smokers in a 2-hour period

3

Table 1: A small subset of hypothetical data on Minnesota workplace rules on smoking.

subject x (location) y (cigarettes)

1 0 3

2 1 0

3 1 0

4 1 1

5 0 2

6 0 1

was recorded, along with whether the smoker was at home or at work. A (very) small subset

of the data appears in Table 1 and it is also available in the smoking.csv file.

Model 1: Assume that Y ~ Poisson(λ); there is no difference between home and work.

Model 2: Assume that Y ~ Poisson(λW ) when the smoker is at work, and Y ~

Poisson(λH) when the smoker is at home.

Model 3: Assume that Y ~ Poisson(λ) and log(λ) = β0 + β1x.

a) Write out the likelihood L(λ) and log-likelihood logL(λ) in Model 1. Use the data

values in Table 1, and simplify where possible.

b) Intuitively, what would be a reasonable estimate for λ based on this data? Why?

c) Use R to produce a plot of the likelihood function L(λ) . Find the maximum likelihood

estimator for λ in Model 1 using an optimization routine in R (for example, optim(),

but not the glm() function).

d) Write out the log-likelihood function logL(λW , λH) in Model 2. Use the data values in

Table 1, and simplify where possible.

e) Intuitively, what would be reasonable estimates for λW and λH based on this data?

Why?

f) Find the maximum likelihood estimators for λW and λH in Model 2 using an optimization

routine in R (for example, optim(), but not the glm() function).

g) Write out the log-likelihood function logL(β0, β1) in Model 3. Again, use the data values

in Table 1, and simplify where possible.

h) Find the maximum likelihood estimators for β0 and β1 in Model 3 using an optimization

routine in R (for example, optim(), but not the glm() function). Use R to produce a

3D plot of the log-likelihood function.

i) Confirm your estimates for Model 1 and Model 3 using glm(). Then show that the

MLEs for Model 3 agree with the MLEs for Model 2.

4

Question 6

This question refers to exercise 4 of Chapter 8 of the textbook; however, instead of working

with the Galapagos data, you will work with the dataset in Table 1 from the previous question.

The purpose of this question is to reproduce the details of the GLM fitting of this data via

the IRWLS algorithm.

a) Consider the Poisson GLM fit for Model 3 in part i) of the previous question. Report

the estimated values of the coefficients and deviance.

For parts b), c), d), e), f) and g) refer to Exercise 4 Chapter 8 of the textbook (page 172).

5

Camila de Souza SS 3860B/9055B - Deviance Winter 2023 1 / 7

Use of deviance to measure goodness of fit

We can measure how well our proposed model fits the data by

comparing it to the saturated (or full) model.

A saturated model is achievable by adding sufficiently many parameters,

in most cases as many parameters as number of data points.

This comparison (saturated versus proposed) can be done by

calculating the so-called deviance, that is, the difference between the

log likelihood of the saturated model (Sat) and the log likelihood of

our proposed model (M):

DM = 2(log LSat ? log LM)

Camila de Souza SS 3860B/9055B - Deviance Winter 2023 2 / 7

Use of deviance to measure goodness of fit

Gaussian linear regression: DM = RSS.

However, we cannot use RSS as a test statistic for goodness of fit

because of the variance (dispersion) parameter.

Instead, we use can R2 = 1? RSS/TSS

Binary logistic regression: DM = ?2 log LM , also cannot be used as a

test statistic for goodness fit.

For logistic regression we can use the Hosmer-Lemeshow (HL) test.

Camila de Souza SS 3860B/9055B - Deviance Winter 2023 3 / 7

Use of deviance to measure goodness of fit

In Poisson or Binomial regression we can use use DM as a test

statistic to assess the goodness of fit of the propose model.

H0 : no lack of fit versus H1: lack of fit

Under H0, DM ~ approx. χ2n?p, where n is the number of observations

and p the number of parameters in the model.

So, we reject “H0 : no lack of fit” if P(DM > Dobs) is sufficiently

small, say smaller than 0.05.

Note: For Poisson and Binomial regression we can also use the Pearson χ2

statistic instead of DM to conduct the goodness of fit test.

Camila de Souza SS 3860B/9055B - Deviance Winter 2023 4 / 7

Use of deviance to measure goodness of fit

Quasi-Binomial or Quasi-Poisson models:

log L is approximated by a function Q that allows for extra variation

Qsat = 0

We cannot use then the deviance based on Q to assess goodness of fit

Camila de Souza SS 3860B/9055B - Deviance Winter 2023 5 / 7

Use of deviance to compare two nested models

For logistic regression consider the likelihood ratio test statistic (LRT):

LRT = 2 log LLLS

= DS ? DL ~ approx. χ2l?s

H0: smaller model is correct

→ H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ, where Θ0 ? Θ

We reject the null hypothesis (that the simpler model is consistent with

the data), and therefore selet the larger model over the smaller one, if

the LRTobs exceeds the 95% quantile of the reference (null)

distribution (P(LRT > LRTobs) < 0.05).

If the LRTobs lies below this quantile then the null hypothesis is not

rejected, and the smaller model is selected in favour of the larger one.

Camila de Souza SS 3860B/9055B - Deviance Winter 2023 6 / 7

Use of deviance to compare two nested models

For Poisson and Binomial regression:

→ If there is no under or overdispersion: DS DL ~ approx. χ2ls

→ If there is under or overdispersion:

(DS DL)

(dfS dfL)

1

σ?2

~ approx. F(dfSdfL),dfL

where (dfS dfL) = l s is the difference in number of parameters between

the large and small models.

Note: for quasi-likelihood methods we must use F -tests to compare models

because there is also the dispersion parameter

Camila de Souza SS 3860B/9055B - Deviance Winter


相关文章

版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp