联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-01-18 10:16

VERSION 1 STATS 330

THE UNIVERSITY OF AUCKLAND

SEMESTER ONE 2018

Campus: City

STATISTICS

Advanced Statistical Modelling

(Time allowed: THREE hours)

INSTRUCTIONS

SECTION A: Multiple Choice (24 marks)

Answer ALL 12 questions on the coloured teleform sheet provided.

To answer, fill in the appropriate box on the teleform sheet.

Use pencil only. To change an answer, erase the original answer completely and

fill in a new answer.

If you give more than one answer to any question, you will receive zero marks

for that question.

All questions carry the same mark value.

All questions have a single correct answer.

Incorrect answers are not penalised.

SECTION B (76 marks)

Answer all questions.

Total for both parts: 100 marks.

Page 1 of 24

VERSION 1 STATS 330

SECTION A

1. Suppose we fit a model and calculate a 95% confidence interval and a 95%

prediction interval for an observation. The confidence interval is (1.1, 2.1) and

the prediction interval is (?0.8, 1.8). Which of these statements is TRUE?

(1) If we calculate the confidence interval for a number of successive samples

from the same population, it will contain the true mean 90% of the time.

(2) If we calculate the confidence interval for a number of successive samples

from the same population, it will contain the future observation 95% of the

time.

(3) If we calculate the prediction interval for a number of successive samples

from the same population, it will contain the future observation 90% of the

time.

(4) If we calculate the prediction interval for a number of successive samples

from the same population, it will contain the true mean 95% of the time.

(5) These intervals are not consistent with statistical theory; they have been

calculated incorrectly.

2. Consider diagnostics for a linear model. If the constant variance assumption

fails, which of the following options is a potential solution?

(1) Transform the response and fit the model again, using a Box-Cox plot to

choose the transformation.

(2) Delete influential points and fit the model again.

(3) Use the backwards elimination method.

(4) Use weighted least squares, where the weights are the inverse of the coeffi-

cients.

(5) Delete the variables that are collinear and fit the model again.

Page 2 of 24

VERSION 1 STATS 330

The next two questions are based on the following scenario.

Suppose we have a response variable Y and an explanatory factor X, which has

four levels. We ran the following code in R:

> mymodel <- lm(Y ~ X)

> plot(mymodel, which = 1:6)

3. We want to test whether the effect of all levels of X is the same. What is the

correct code to do this in R?

(1) anova(submodel, mymodel)

(2) plot(mymodel)

(3) anova(mymodel)

(4) t.test(mymodel)

(5) summary(mymodel)

4. Which plots are produced by the code?

(1) Residuals vs Fitted, Normal Q-Q, Scale-Location, Cook’s distance, Residuals

vs Leverage, Fitted values.

(2) Residuals, Normal Q-Q, Scale-Location, Cook’s distance, Leverage, Fitted

values.

(3) Residuals vs Fitted, Normal Q-Q, Scale-Location, Cook’s distance, Residuals

vs Leverage, Cook’s distance vs Leverage.

(4) Residuals vs Fitted, Normal Q-Q, Scale-Location, Residuals vs Leverage.

(5) Residuals vs Fitted, Normal Q-Q, Scale-Location vs Cook’s distance, Residuals

vs Fitted, Cook’s distance vs Leverage.

Page 3 of 24

VERSION 1 STATS 330

The next two questions are based on the following analysis.

Blackburn Rovers is a football club based in Lancashire, England. One of their

fans cross-classified all their league matches over the last five seasons by result

(Win, Draw, or Loss) and match location (Home or Away). A ‘Home’ match

is played at Ewood Park in Blackburn, while an ‘Away’ match is played at the

opposition’s football ground. The data are shown in the following contingency

table:

Win Draw Loss

Home 53 35 27

Away 35 40 40

The following code was used to analyse these data:

> blackburn.df

result location count

1 Win Home 53

2 Draw Home 35

3 Loss Home 27

4 Win Away 35

5 Draw Away 40

6 Loss Away 40

> blackburn.fit <- glm(count ~ result*location, family = "poisson",

data = blackburn.df)

> anova(blackburn.fit, test = "Chisq")

Analysis of Deviance Table

Model: poisson, link: log

Response: count

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev Pr(>Chi)

NULL 5 9.49

result 2 2.91 3 6.58 0.234

location 1 0.00 2 6.58 1.000

result:location 2 6.58 0 0.00 0.037 *

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Page 4 of 24

VERSION 1 STATS 330

> summary(blackburn.fit)

Call:

glm(formula = count ~ result * location, family = "poisson",

data = blackburn.df)

Deviance Residuals:

[1] 0 0 0 0 0 0

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.69e+00 1.58e-01 23.33 <2e-16 ***

resultDraw -2.34e-16 2.24e-01 0.00 1.000

resultWin -1.34e-01 2.31e-01 -0.58 0.564

locationHome -3.93e-01 2.49e-01 -1.58 0.115

resultDraw:locationHome 2.60e-01 3.40e-01 0.76 0.445

resultWin:locationHome 8.08e-01 3.31e-01 2.44 0.015 *

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 9.4884e+00 on 5 degrees of freedom

Residual deviance: -1.4655e-14 on 0 degrees of freedom

AIC: 44.81

Number of Fisher Scoring iterations: 3

> coef(blackburn.fit)

(Intercept) resultDraw resultWin

3.6889e+00 -2.3424e-16 -1.3353e-01

locationHome resultDraw:locationHome resultWin:locationHome

-3.9304e-01 2.5951e-01 8.0799e-01

> exp(coef(blackburn.fit))

(Intercept) resultDraw resultWin

40.0000 1.0000 0.8750

locationHome resultDraw:locationHome resultWin:locationHome

0.6750 1.2963 2.2434

Page 5 of 24

VERSION 1 STATS 330

5. Which of the following statements is FALSE?

(1) We estimate that the log-odds of Blackburn Rovers winning rather than

losing if they are playing at home are approximately 0.81 higher than the

log-odds of them winning rather than losing if they are playing away.

(2) We estimate that the odds of Blackburn Rovers winning are approximately

0.88 times the odds of them losing.

(3) We estimate that the odds of Blackburn Rovers winning rather than losing

are approximately 124% higher if they are playing at home than they are

if they are playing away.

(4) We estimate that the odds of Blackburn Rovers winning rather than losing

if they are playing at home are approximately 2.24 times the odds of them

winning rather than losing if they are playing away.

(5) There is evidence of an association between the result and the location of

a Blackburn Rovers match.

6. Which of the following statements is TRUE?

(1) We should drop the result:location interaction because the p-value for

resultDraw:locationHome is large.

(2) The residual deviance of the model blackburn.fit is zero because we have

fitted the saturated model.

(3) The null deviance of the model blackburn.fit is smaller than the residual

deviance.

(4) If we dropped the result:location interaction, the resulting model would

probably have a smaller residual deviance than the model blackburn.fit.

(5) We probably should not trust this analysis, because some of the cells have

counts that are too small.

Page 6 of 24

VERSION 1 STATS 330

7. Which of the following statements about logistic regression with a logit link

function is FALSE?

(1) By default, R uses the logit link function when a logistic regression model

is fitted.

(2) We assume that the observations are independent.

(3) We assume that the variance of the response variable is constant across all

observations.

(4) We assume that the response variable comes from a binomial distribution.

(5) We assume that the log-odds of a trial being successful is a linear combination

of the explanatory variables.

8. A Poisson regression model with a log link funcion was fitted to a response

variable Y , using only a single numeric expanatory variable, X. Estimates of

the linear predictor’s intercept, β0, and slope, β1, were obtained. Which of the

following is an appropriate interpretation?

(1) When x = 0, we estimate that the odds of success are equal to exp(β

0)/(1+

exp(β

0)).

(2) For every one-unit increase in X, we estimate that the expected value of

Y increases by β

1.

(3) When x = 0, we estimate that the expected value of Y is β0.

(4) For every one-unit increase in X, we estimate that the odds of success are

multiplied by exp(β

1).

(5) For every one-unit increase in X, we estimate that the expected value of

Y is multiplied by exp(β

1).

Page 7 of 24

VERSION 1 STATS 330

9. Which of the following statements about the use of offsets in generalised linear

models is FALSE?

(1) While offsets are most common for Poisson regression models, we can use

them for logistic or standard linear regression models, too.

(2) We estimate a coefficient for an offset, which allows us to interpret the

relationship between the offset and the response variable.

(3) An offset adds a fixed value to the linear predictor for each observation.

(4) When we fit a Poisson regression model with a log link function in R, we

can use the argument offset = log(t) if we think the expected value of

the response is directly proportional to the variable t.

(5) An explanatory variable’s estimated coefficient is very close to 1. The fitted

values from our model are unlikely to change much if we use the variable

as an offset instead.

10. Which of the following statements about estimated coefficients of linear models

(LMs) and generalised linear models (GLMs) is FALSE?

(1) The estimated coefficients of a GLM maximise the deviance.

(2) The estimated coefficients of a GLM minimise the sum of the squared

deviance residuals.

(3) The estimated coefficients of a GLM maximise the likelihood.

(4) The estimated coefficients of a GLM maximise the log-likelihood.

(5) The estimated coefficients of a LM minimise the residual sum of squares.

Page 8 of 24

VERSION 1 STATS 330

11. Consider a logistic regression model fitted to ungrouped data. The observed

responses from the first two observations are given by y1 and y2. The first was

observed as a success (y1 = 1), and has a fitted probability under the model

of p1 = 0.8. The second was observed as a failure (y2 = 0), and has a fitted

probability under the model of p2 = 0.7. Which of the following statements is

TRUE?

(1) Under the fitted model, the expected value of the first observation is 0.2,

and the expected value of the second observation is 0.3.

(2) The Pearson residual of the second observation is larger in magnitude (i.e.,

further from zero) than the Pearson residual of the first observation.

(3) The deviance residual of the first observation is negative, and the deviance

residual of the second observation is positive.

(4) It is possible for an observation to have a positive deviance residual, but a

negative deviance residual.

(5) The further an observed value is from its expected value, the closer to zero

the deviance residual is.

12. Which of the following statements is FALSE?

(1) The null deviance is always equal to or smaller than the residual deviance.

(2) The residual deviance is equal to twice the difference between the loglikelihood

of the saturated model and the log-likelihood of the fitted model.

(3) The residual deviance of the saturated model is always equal to zero.

(4) The residual deviance is equal to the sum of the squared deviance residuals.

(5) The log-likelihood of the saturated model is always equal to or larger than

the log-likelihood of the fitted model.

Page 9 of 24

VERSION 1 STATS 330

SECTION B

13. [8 marks] Guess the analysis: choose the most appropriate model to fit for

each of the scenarios described below. Different scenarios may have the same

answer.

For each scenario, select one of these three possible answers:

(1) Linear regression model

(2) Logistic regression model

(3) Poisson regression model

(a) TVNZ has just released a new TV show. Their market analyst wishes to

build a model to predict whether or not specific individuals will enjoy the

show. They conduct a survey, collecting variables from participants such

as gender, age, income, and occupation. They also asked participants if

they enjoyed a pilot episode of the show.

[2 marks]

(b) A STATS 330 lecturer is interested to see if the number of questions posted

on Piazza is related to the closeness of an assignment deadline. Each day,

they count the number of Piazza questions that were posted, and record

the number of days until the next assignment deadline.

[2 marks]

(c) A detective wishes to determine whether or not there is an association between

a serial killer’s gender (female or male) and their preferred method

(poisoning, strangulation, and so on). They cross-classify a sample of convicted

serial killers using these two variables.

[2 marks]

(d) A University of Auckland empolyee wishes to determine the quickest way

to get to work. Each day, they randomly select a transportation method

(bus, train, or walk) and a departure time (8:00am, 8:15am, or 8:30am).

They record how long their journey took.

[2 marks]

Page 10 of 24

VERSION 1 STATS 330

14. [4 marks] Suppose we wish to predict the expenditure on cancer treatment

for a patient at the Auckland hospital. We have collected the following variables

from a sample of cancer patients:

Stage of cancer

Type of cancer

Cancer treatment expenditure

Age

Gender

Ethnicity

Marital status

(a) Is it possible to build a predictive model with this information? Explain

your answer.

[2 marks]

(b) We also want to build an explanatory model to investigate if dietary habits

cause cancer. Can we fit a model to do so using only the variables above?

Explain your answer.

[2 marks]

Page 11 of 24

VERSION 1 STATS 330

15. [17 marks] The data for this question were collected from a sample of 44

male and 51 female athletes at the Australian Institute of Sport. The data set

contains the following variables:

sex The athlete’s sex, either female or male.

sport The athlete’s sport, either basketball (BBal), rowing (Row),

swimming (Swim), or tennis (Tennis).

BMI The athletes body mass index, calculated by dividing their

weight (in kg) by their height (in m) squared.

X.Bfat The athlete’s body fat percentage.

Printed below are the first three observations, and summary statistics for each

of the variables:

> head(sport.df, 3)

sex sport BMI X.Bfat

1 female BBall 20.56 19.75

2 female BBall 20.67 21.30

3 female BBall 21.86 19.88

> summary(sport.df)

sex sport BMI X.Bfat

female:51 BBall :25 Min. :17.1 Min. : 6.16

male :44 Row :37 1st Qu.:21.3 1st Qu.: 8.92

Swim :22 Median :22.7 Median :12.20

Tennis:11 Mean :22.8 Mean :13.91

3rd Qu.:24.0 3rd Qu.:18.62

Max. :26.8 Max. :28.83

We analysed these data using the following code:

> sport.fit <- lm(log(X.Bfat) ~ BMI + sport*sex, data = sport.df)

> reg <- allpossregs(sport.fit)[, -c(1, 2, 3, 4)]

> reg

AIC BIC CV BMI Row Swim Tennis male Row:male Swim:male Tennis:male

1 161.12 166.23 0.470 0 0 0 0 1 0 0 0

2 130.05 137.71 0.384 1 0 0 0 1 0 0 0

3 112.36 122.58 0.332 1 0 1 0 1 0 0 0

4 107.89 120.66 0.318 1 0 1 0 1 0 1 0

5 106.14 121.46 0.316 1 0 1 1 1 0 1 0

6 102.21 120.09 0.308 1 0 1 1 1 0 1 1

7 102.84 123.28 0.310 1 1 1 1 1 0 1 1

8 104.00 126.98 0.313 1 1 1 1 1 1 1 1

Page 12 of 24

VERSION 1 STATS 330

> sport.fit.2 <- lm(log(X.Bfat) ~ sport, data = sport.df)

> anova(sport.fit.2)

Analysis of Variance Table

Response: log(X.Bfat)

Df Sum Sq Mean Sq F value Pr(>F)

sport 3 1.87 0.625 3.91 0.011 *

Residuals 91 14.54 0.160

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> summary(sport.fit.2)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.5956 0.0799 32.47 <2e-16 ***

sportRow 0.0752 0.1035 0.73 0.469

sportSwim -0.2835 0.1168 -2.43 0.017 *

sportTennis -0.1043 0.1446 -0.72 0.472

---

Residual standard error: 0.4 on 91 degrees of freedom

Multiple R-squared: 0.114,Adjusted R-squared: 0.085

F-statistic: 3.91 on 3 and 91 DF, p-value: 0.0112

(a) Write down the mathematical formula for the model sport.fit. This

should describe the relationship between the explanatory variables and the

response, and show its assumptions.

[3 marks]

(b) Based on the results from the allposregs() function, which model will

return the lowest prediction error? Write down its mathematical formula

and the R code that could be used to fit this model.

[3 marks]

(c) Consider the model sport.fit.2 and the output from the anova() function,

shown above. What is the null hypothesis, or what are the null

hypotheses, associated with the p-value(s) in this output?

[3 marks]

Page 13 of 24

VERSION 1 STATS 330

(d) What do you conclude about the hypothesis (or hypotheses) from the output

of the anova() function?

[2 marks]

(e) Consider the model sport.fit.2 and the output from the summary() function,

shown above. What is the null hypothesis, or what are the null hypotheses,

associated with the p-value(s) in this output?

[3 marks]

(f) What do you conclude about the hypothesis (or hypotheses) from the output

of the summary() function?

[3 marks]

Page 14 of 24

VERSION 1 STATS 330

16. [18 marks] The data for this question are related to a sample of 1599

Portugese red wines. Various physiochemical properties of the wines were measured.

Additionally, a panel of judges decided whether or not each wine was of

‘good quality’. The data set contains the following variables:

good.quality This variable takes the value 1 if the wine is of ‘good quality’,

and 0 otherwise.

fixed.acidity The fixed concentration of tartaric acid (g per dm3

).

volatile.acidity The volatile concentration of tartaric acid (g per dm3

).

residual.sugar The concentration of residual sugars (g per dm3

).

chlorides The concentration of sodium chloride (g per dm3

).

f.sulfur.dioxide The concentration of free sulfur dioxide (mg per dm3

).

density The density of the wine (g per cm3

).

sulphates The concentration of potassium sulphate (g per dm3

).

alcohol The alcohol level of the wine (percentage alcohol by volume).

The following final model was fitted in R:

> wine.fit <- glm(good.quality ~ fixed.acidity + volatile.acidity +

residual.sugar + chlorides + t.sulfur.dioxide +

I(t.sulfur.dioxide^2) + density + sulphates +

I(sulphates^2) + alcohol + I(alcohol^2),

family = "binomial", data = wine.df)

> summary(wine.fit)

Call:

glm(formula = good.quality ~ fixed.acidity + volatile.acidity +

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.44e+02 1.01e+02 2.42 0.01554 *

fixed.acidity 2.68e-01 8.48e-02 3.17 0.00154 **

volatile.acidity -2.28e+00 6.66e-01 -3.43 0.00061 ***

residual.sugar 2.24e-01 7.87e-02 2.85 0.00436 **

chlorides -6.89e+00 3.57e+00 -1.93 0.05378 .

t.sulfur.dioxide -2.85e-02 7.65e-03 -3.73 0.00020 ***

I(t.sulfur.dioxide^2) 1.17e-04 5.07e-05 2.31 0.02104 *

density -2.93e+02 1.02e+02 -2.88 0.00399 **

sulphates 2.22e+01 5.06e+00 4.39 1.1e-05 ***

I(sulphates^2) -1.10e+01 3.25e+00 -3.39 0.00070 ***

alcohol 5.67e+00 1.49e+00 3.81 0.00014 ***

I(alcohol^2) -2.20e-01 6.51e-02 -3.38 0.00072 ***

---

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1269.92 on 1598 degrees of freedom

Residual deviance: 825.17 on 1587 degrees of freedom

Page 15 of 24

VERSION 1 STATS 330

> ROC.curve(wine.fit)

Area under ROC curve = 0.8766

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

True positive rate

AUC = 0.8766

The following code was used to predict which wines in the sample were ‘good

quality’ wines. A probability cutoff of c = 0.5 was used.

> wine.pred <- predict(wine.fit, type = "response")

> wine.predcode <- ifelse(wine.pred < 0.5, "Not good", "Good")

The results of this code have been reformatted to give the following confusion

matrix:

Predicted

Good Not good

Observed Good 85 132

Not good 41 1341

Page 16 of 24

VERSION 1 STATS 330

(a) Interpret the effects of the following variables on wine quality under the

model wine.fit:

(i) Concentration of residual sugars.

[2 marks]

(ii) Volatile concentration of tartaric acid.

[2 marks]

(b) Calculate the following:

(i) In-sample sensitivity.

[2 marks]

(ii) In-sample specificity.

[2 marks]

(iii) In-sample error rate.

[2 marks]

(c) Based on your calculations, Comment on the model’s predictive power

using a probability cutoff of c = 0.5. How well does it predict wines that

are good quality? How well does it predict wines that are not good quality?

[3 marks]

Page 17 of 24

VERSION 1 STATS 330

(d) A chemist wishes to use this model to predict which wines are good quality.

However, they want the model to correctly predict 80% of the good-quality

wines in the sample. They adjust their probability cutoff c accordingly.

Using the ROC curve above, approximately what proportion of wines in

the sample that are not of good quality will they correctly predict using

this adjusted cutoff? Give your answer to one decimal place.

[2 marks]

(e) Specificity and sensitivity can also be calculated via crossvalidation using

the R function cross.val(). Would you expect the sensitivity and speci-

ficity calculated by cross.val() to be higher or lower than your in-sample

calculations above? Explain your answer.

[3 marks]

Page 18 of 24

VERSION 1 STATS 330

17. [29 marks] In 2015, New Zealand’s National Institute of Water and Atmospheric

Research (NIWA) sampled 486 sites on rivers around the country. At

each site, they recorded whether or not various freshwater species were present.

Each site was cross-classified based on these presence/absence data to form a

contingency table. The data set fishy.df has the following variables:

eel Presence (1) or absence (0) of the longfin eel.

koura Presence (1) or absence (0) of the koura, a type of crayfish.

bully Presence (1) or absence (0) of the upland bully.

trout Presence (1) or absence (0) of the brown trout.

count The number of sites with a particular combination of the

above variables.

Freshwater biologists were interested in how the species interact. Is it common

to find species at the same site? Or do some species avoid one another?

The data are shown below:

> fishy.df

eel koura bully trout count

1 0 0 0 0 233

2 0 0 0 1 41

3 0 0 1 0 35

4 0 0 1 1 12

5 0 1 0 0 12

6 0 1 0 1 2

7 0 1 1 0 2

8 0 1 1 1 1

9 1 0 0 0 52

10 1 0 0 1 34

11 1 0 1 0 9

12 1 0 1 1 13

13 1 1 0 0 16

14 1 1 0 1 8

15 1 1 1 0 4

16 1 1 1 1 12

For example, from the first row, there were 233 sites that had none of the species

present. From the fourth row, there were 12 sites at which the upland bully and

the brown trout present, but the longfin eel and koura were not present.

> fishy.fit.1 <- glm(count ~ eel*koura*bully*trout,

family = "poisson", data = fishy.df)

>

> fishy.fit.2 <- glm(count ~ (eel + koura + bully + trout)^2,

family = "poisson", data = fishy.df)

Page 19 of 24

VERSION 1 STATS 330

> summary(fishy.fit.2)

Call:

glm(formula = count ~ (eel + koura + bully + trout)^2, family = "poisson",

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 5.4632 0.0642 85.04 < 2e-16 ***

eel -1.5204 0.1449 -10.49 < 2e-16 ***

koura -3.0946 0.2668 -11.60 < 2e-16 ***

bully -1.9837 0.1711 -11.59 < 2e-16 ***

trout -1.7869 0.1595 -11.21 < 2e-16 ***

eel:koura 1.8526 0.3251 5.70 1.2e-08 ***

eel:bully 0.2525 0.2749 0.92 0.35833

eel:trout 1.3429 0.2347 5.72 1.1e-08 ***

koura:bully 0.7185 0.3367 2.13 0.03286 *

koura:trout 0.0911 0.3280 0.28 0.78115

bully:trout 0.8875 0.2627 3.38 0.00073 ***

---

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 850.0360 on 15 degrees of freedom

Residual deviance: 3.0544 on 5 degrees of freedom

> fishy.fit.3 <- glm(count ~ eel + koura + bully + trout + eel:koura +

eel:bully + eel:trout + koura:bully + bully:trout,

family = "poisson", data = fishy.df)

> summary(fishy.fit.3)

Call:

glm(formula = count ~ eel + koura + bully + trout + eel:koura +

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 5.4628 0.0643 85.02 < 2e-16 ***

eel -1.5290 0.1421 -10.76 < 2e-16 ***

koura -3.0821 0.2626 -11.73 < 2e-16 ***

bully -1.9869 0.1710 -11.62 < 2e-16 ***

trout -1.7838 0.1591 -11.22 < 2e-16 ***

eel:koura 1.8773 0.3127 6.00 1.9e-09 ***

eel:bully 0.2463 0.2743 0.90 0.36920

eel:trout 1.3623 0.2241 6.08 1.2e-09 ***

koura:bully 0.7363 0.3305 2.23 0.02588 *

bully:trout 0.8959 0.2609 3.43 0.00059 ***

---

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 850.0360 on 15 degrees of freedom

Residual deviance: 3.1312 on 6 degrees of freedom

Page 20 of 24

VERSION 1 STATS 330

> fishy.fit.4 <- glm(count ~ eel + koura + bully + trout + eel:koura +

eel:trout + koura:bully + bully:trout,

family = "poisson", data = fishy.df)

> summary(fishy.fit.4)

Call:

glm(formula = count ~ eel + koura + bully + trout + eel:koura +

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 5.4569 0.0641 85.09 < 2e-16 ***

eel -1.4937 0.1355 -11.02 < 2e-16 ***

koura -3.1160 0.2630 -11.85 < 2e-16 ***

bully -1.9363 0.1596 -12.13 < 2e-16 ***

trout -1.8106 0.1583 -11.44 < 2e-16 ***

eel:koura 1.9013 0.3109 6.12 9.7e-10 ***

eel:trout 1.3934 0.2213 6.30 3.0e-10 ***

koura:bully 0.8300 0.3132 2.65 0.00804 **

bully:trout 0.9636 0.2494 3.86 0.00011 ***

---

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 850.0360 on 15 degrees of freedom

Residual deviance: 3.9273 on 7 degrees of freedom

> confint(fishy.fit.4)

Waiting for profiling to be done...

2.5 % 97.5 %

(Intercept) 5.32863 5.5801

eel -1.76581 -1.2340

koura -3.66910 -2.6328

bully -2.26073 -1.6339

trout -2.13076 -1.5094

eel:koura 1.30797 2.5336

eel:trout 0.96218 1.8307

koura:bully 0.20057 1.4340

bully:trout 0.47172 1.4517

> AIC(fishy.fit.1, fishy.fit.2, fishy.fit.3, fishy.fit.4)

df AIC

fishy.fit.1 16 101.936

fishy.fit.2 11 94.990

fishy.fit.3 10 93.067

fishy.fit.4 9 91.863

Page 21 of 24

VERSION 1 STATS 330

(a) What are the assumptions of a Poisson regression model?

[2 marks]

(b) Write an equation to calculate the expected number of sites with a particular

combination of the presence/absence variables under the model

fishy.fit.4. Define any notation you use that is not obvious.

[2 marks]

(c) Consider the following code and output:

> anova(fishy.fit.2, fishy.fit.1, test = "Chisq")

Analysis of Deviance Table

Model 1: count ~ (eel + koura + bully + trout)^2

Model 2: count ~ eel * koura * bully * trout

Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 5 3.05

2 0 0.00 5 3.05 0.69

What is the null hypothesis being tested here? Refer to effects estimated

by the model fishy.fit.1 in your answer.

[3 marks]

(d) What can you conclude from the hypothesis test conducted in question

(c)?

[2 marks]

(e) The model fishy.fit.2 was simplified to fishy.fit.3, and then further

simplified to fishy.fit.4. Briefly state why you think that these were

sensible decisions.

[2 marks]

Page 22 of 24

VERSION 1 STATS 330

(f) Sketch the association graph for the model fishy.fit.4.

[3 marks]

(g) Describe the relationship between the following pairs of factors under the

model fishy.fit.4. For each pair, select one of these three possible answers:

(1) Independent

(2) Conditionally independent given other factors

(3) Dependent

If you select option (2), state which other factor(s) the independence is

conditional upon. Both pairs may have the same answer.

(i) bully and trout

[2 marks]

(ii) bully and eel

[2 marks]

(h) Assume the model fishy.fit.4 is the correct model. A freshwater biologist

is interested in the association between presence of the koura and

presence of the brown trout. They wish to simplify the contingency table

by collapsing over another factor.

(i) Is it appropriate to collapse over the factor eel? Briefly explain your

answer.

[2 marks]

(ii) Is it appropriate to collapse over the factor bully? Briefly explain

your answer.

[2 marks]

Page 23 of 24

VERSION 1 STATS 330

(i) The freshwater biologist believes that the longfin eel and the brown trout

avoid one another. In other words, holding all other variables constant,

sites with longfin eel are less likely to have brown trout present than sites

without longfin eel. Does the analysis above suggest that the biologist’s

belief is correct? Explain your answer.

[3 marks]

(j) Provide a 95% confidence interval for the odds ratio that quantifies the

association between the presence of koura and the presence of the upland

bully, holding all other variables constant.

[2 marks]

(k) Write a sentence interpreting your confidence interval from question (j).

[2 marks]


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp