联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2023-03-01 08:53

MTHM017 Advanced Topics in Statistics

Assignment

Please make sure that the submitted work is your own. This is NOT a group assignment,

therefore approaches and solutions shouldn’t be discussed with other students. Plagiarism

and collusion with other students are examples of academic misconduct and will be reported.

More information on academic honesty can be found here.

The assignment has two main parts. Part A involves (i) fitting a Poisson regression model to assess the

effect of using different priors, and (ii) fitting a fixed effects and a random effects model in order to compare

posterior distributions under different models. Part B involves using different methods for classification of

data into two groups.

A. Bayesian Inference [66 marks]

1. The first question of part A involves fitting a Poisson regression model using the Scotland dataset, which

contains the observed and expected counts of lip cancer cases for the local authority administrative areas in

Scotland between 1975 and 1986.

a. [3 marks] Calculate the Standard Mortality Ratios (SMR=observed/expected) for each administrative

area and plot the distribution of the SMRs. Then plot a map of the SMRs by administrative area. To

map the SMRs you may use the ScotlandMap function readily available on the ELE page.

The code below uses random numbers to demonstrate how the ScotlandMap function works. Note

that in order for this to run you need to add the shapefiles for Scotland (.shx, .shp, .prj, .dbf) to your

working directory. The ScotlandMap.R file and the shapefiles are available on the ELE page.

library(tidyverse)

library(RColorBrewer)

library(sf)

library(rgdal)

source("ScotlandMap.R") # need to read in the ScotlandMap function

testdat <- runif(56) # generate random numbers to use as observations

ScotlandMap(testdat,figtitle="Scotland random numbers")

We are interested in estimating the relative risk (RR) for each area. The relative risk is the parameter

that quantifies whether an area i has a higher (RRi > 1) or lower (RRi < 1) occurrence of cases than

what we would expect from the reference rates. In order to estimate the relative risk we are going to fit

a Poisson model of the following form:

Obsi ~ Pois(μi)

log(μi) = log(Expi) + β0 + θi

RRi = exp(β0)exp(θi)

Where θi ~ Normal(0, σ2). Here, the Exp(ected) numbers are an ‘offset’, which are treated as fixed

data (i.e. no coefficient is estimated for this term).

b. [3 marks] Describe the role (interpretation) of β0 and the set of θ’s in this model and how they contribute

to the estimation of RR. Focus on the meaning of these parameters in the estimation of the relative risk.

1

c. [14 marks] Code up this Poisson(-Lognormal) model in JAGS to analyse the Scotland data. For the

parameter β0 use a Normal prior with mean 0 and variance 1000. And for the precision parameter

τ(= 1/σ2) of θi use a prior τ ~ Gamma(1, 0.05) (which is parametrised so as to have a mean of 20).

Initialise 2 chains and run the model with these two chains. You will have to decide on the appropriate

values of n.iter and burnin. Produce trace plots for the chains and summaries of all the model

parameters. Investigate whether the chains for all the parameters have converged.

d. [4 marks] Extract the posterior means for the RR and map them. Compare the posterior mean relative

risk to the SMR for each area. Note that we can think about the SMR as the maximum likelihood

estimate of the relative risk.

e. [7 marks] Now calculate the posterior probabilities that the relative risk in each area exceeds 1. Extract

these probabilities and map them. For which areas are you at least 95% certain that there is an excess

risk of lip cancer (i.e. relative risk > 1)?

f. [6 marks] Repeat the analysis with different priors for β0 and τ . The exact choice is yours, but explain

why you have chosen them and what they mean. Map the two sets of RRs and explain any differences

you see in the map and the summaries of the posteriors for the parameters of the model.

2. In this question we will analyse the effect different models have on the posterior distribution. We will use

the surgical.csv file, which contains data on 12 hospitals, where the column n gives the total number of

heart surgery operations carried out on children under 1 year old by each centre between April 1991 and

March 1995, and the column r gives the number of these operations where the patient died within 30 days of

the operation.

A binomial model seems reasonable for these data, so we will focus on comparing different models for the

hospital specific mortality rates.

We will first fit a logistic random effects model of the following form:

ri ~ Binom(ni, θi),

logit(θi) ~ N(μ, σ2).

This model assumes that failure rates (θi) across hospitals are similar in some way. This similarity is

represented by the assumption that the mortality rates for the hospitals are drawn from the same (parent)

distribution.

a. [9 marks] Code this model in JAGS to analyse the surgical data. Use vague priors for the parameters

of the shared normal distribution. In particular you can use the following model definition:

jags.mod <- function(){

for(i in 1:12){

r[i] ~ dbin(theta[i],n[i])

logit(theta[i]) <- logit.theta[i]

logit.theta[i] ~ dnorm(mu, tau)

}

mu ~ dnorm(0,1.0E-3)

tau <- 1/sigma?2

sigma ~ dunif(0,100)

}

Run the model for 10,000 iterations, with 2 chains, discarding the first 5,000 as ‘burn-in’. Produce

traceplots for the chains and summaries for the fitted parameters. Comment on whether the chains for

all the parameters have converged.

b. [8 marks] We want to know whether the assumptions of the above random effects model are reasonable,

or whether there is any evidence that the mortality rates for one or more of the hospitals are not drawn

from the same parent distribution. We will assess this by comparing the observed and predicted number

of deaths in each hospital.

2

Edit your JAGS model so that it (i) finds the posterior prediction of the number of deaths (rpredi ) in

each hospital, and (ii) computes the following posterior probability for each hospital:

pi = P (rpredi > ri) +

1

2P (r

pred

i = ri).

The above probability is called the ‘mid p-value’. This allows us to summarise conflict between discrete

quantities.

Examining the mid p-values, are there any hospitals that appear to have an unusual mortality rate?

Confirm your conclusion by producing kernel density plots of the predicted number of deaths in each

hospital, and visually comparing these to the observed number of deaths in each hospital.

c. [8 marks] As an alternative model we will consider the independent (fixed) effects model, which has the

following hospital-specific mortality rates:

θi ~ Unif(0, 1),

for i = 1, 2, . . . , 12.

Fit this alternative model, and compare the posterior distributions of the hospital mortality rates under

the random effects and the fixed effects models. Interpret the differences. (Hint: the comparison is

easiest to do by producing box plots of the mortality rates θ under each model).

d. [4 mark] Compute the mid p-values for the fixed effects model and compare these to the mid p-values

of the random effects model. Does the fixed effects model better explain the unusual mortality rates?

B. Classification [34 marks]

The following figure shows the information in the dataset Classification.csv - it shows two different

groups, plotted against two explanatory variables. This is simulated data - the aim is to find a suitable

method for classifying the 1000 datapoints into two groups from a selection of possible approaches.

1. [5 marks] Summarise the two groups (separately) in terms of the variables X1 and X2, and for each

group plot the distribution of both variables. Describe your findings.

3

2. [4 marks] Considering the plot showing the observations, your density plots and the numerical summaries,

which of the following classification methods do you think are suitable for classifying this data? For each

method, you should explain the reasons behind your answer (i.e. why is the given method suitable/why

it might not be suitable)?

a. Linear discriminant analysis.

b. Quadratic discriminant analysis.

c. K-nearest neighbour classification.

d. Support vector machines.

e. Random forests.

3. [1 marks] Select 80% of the data to act as a training set, with the remaining 20% for testing/evaluation.

4. [22 marks] Choose three of the methods listed in Q2 that might be suitable to classify the data.

Perform classification using these methods. (If you use more than three of the listed methods, only the

first three will be considered for marking). In each case, briefly describe how the given classification

method classifies the data, present the results of an evaluation of the model fit (highlighting different

aspects of the model performance) and describe your findings. Where appropriate optimise the

(hyper)parameters of the method.

5. [2 marks] Compare the results from your chosen three approaches and select what you think is the best

method for classification in this case, explaining your reasoning.

Total for paper = 100 marks

The deadline for submission is Noon (12pm), 31st March. Note that late submissions will be

penalised.

You should submit a pdf that contains your answers (and relevant output/plots!) to the

questions via eBart. In Part A you should use the R programming language, but in Part B

you can choose to use R or Python (or both).


相关文章

版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp