Assignment 2, Question 3 MAST90125: Bayesian
Statistical Learning
Due: Friday 20 September 2019
There are places in this assignment where R code will be required. Therefore set the random
seed so assignment is reproducible.
set.seed(123456) #Please change random seed to your student id number.
Question Three (18 marks)
A group of 453 Bangladeshi women in 5 districts were asked about contraceptive use. The response variable
use is an indicator for contraceptive use (coded N for no and Y for yes). Other covariates of interest are
categorical variables for geographical location district (5 levels), and urban (2 levels), and number of living
children livch (4 levels), and the continuous covariate for standardised age age.A random intercept for the
district was suggested. This suggested the following model should be fitted,
θ = Zu + Xβ,
where θ is a link function, Z is an indicator variable for district, u is a random intercept with prior
p(u) = N (0, σ2u
I), and X is a design matrix for fixed effects β, where β includes the coefficients for the
intercept, urban status, living children, and age.
Data can be downloaded from LMS as Contraceptionsubset.csv.
a) Fit a generalised linear mixed model assuming a logistic link using Stan. The R and stan code below
covers the following steps.
• Importing the data.
• Constructing design matrices.
• Provides code to go into the stan file.
• Running stan in R. This assumes your stan file is called *logitmm.stan*, and that you will run the
sampler for 2000 iterations and 4 chains.
Note that provided code assumes everything required is located in your working directory in R.
#Step one: Importing data, constructing design matrices and calculating matrix dimensions.
dataX= read.csv("Contraceptionsubset.csv",header=TRUE)
n<-dim(dataX)[1]
Z = table(1:n,dataX$district) #incidence matrix for district
Q = dim(Z)[2]
D1 = table(1:n,dataX$livch) #Dummy indicator for living children
D2 = table(1:n,dataX$urban) #Dummy indicator for urban status
#fixed effect design matrix
X = cbind(rep(1,n),dataX$age,D1[,-1],D2[,-1])
P = dim(X)[2]
1
y = rep(0,n)
y[dataX$use %in% 'Y'] = 1
An example stan file.
//
// This Stan program defines a logistic mixed model
//
// Learn more about model development with Stan at:
//
// http://mc-stan.org/users/interfaces/rstan.html
// https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started
//
data {
int<lower=0> n; //number of observations
int<lower=0> Q; //number of random effect levels
int<lower=0> P; //number of fixed effect levels
int y[n]; //response vector
matrix[n,Q] Z; //indicator matrix for random effect levels
matrix[n,P] X; //design matrix for fixed effects
}
// The parameters accepted by the model.
// accepts three sets of parameters 'beta', 'u' and 'sigma'.
parameters {
vector[P] beta; //vector of fixed effects of length P.
vector[Q] u; //vector of random effects of length Q.
real<lower=0> sigma; //random effect standard deviation
}
// The model to be estimated. We model the output
// 'y' to be bernoulli with logit link function,
// and assume a i.i.d. normal prior for u.
model {
u ~ normal(0,sigma); //prior for random effects.
y ~ bernoulli_logit(X*beta+ Z*u); //likelihood
}
library(rstan)
logistic.mm <-stan(file="logitmm.stan",data=c('Z','X','y','n','P','Q'),iter=2000,chains=4)
print(logistic.mm)
Note that in Stan, defaults for burn-in (warm-up) is one half of all iterations in stan, and no thinning. Note
the code is written using the stan file and csv is in your working directory. Use the print function to report
posterior means, standard deviations, 95 % central credible intervals and state from the output whether you
believe the chains have converged. Also report the reference categories for urban and livch.
b) An alternative to the logit link when analysing binary data is the probit. The probit link is defined as,
yi =1 if zi ≥ 0
In lecture 14, we showed how by letting zi be normal, probit regression can be fitted using a Gibbs sampler,
but to do so, it requires the ability to sample from a truncated normal defined on either (−∞, 0) (if yi = 0)
or (0, ∞) (if yi = 1). Check by comparing the empirical and the true density that a modified version of
the inverse cdf method can be used to produce draws from a truncated normal. Do this for the case where
x ∈ (0, ∞) and x ∈ (−∞, 0) with parameters µ = 0.5 and σ = 1.
Hints: If y is drawn from a truncated normal with lower bound a, upper bound b and parameters µ, σ2
which in R means the truncated normal density can be written as
dnorm(x,mean=mu,sd=sigma)/(pnorm(b,mean=mu,sd=sigma)-pnorm(a,mean=mu,sd=sigma))
The inverse cdf method involves drawing v from U(0, 1) so that x ∼ p(x) can be found solving x = F
where F is the cdf. If the only change compared to drawing from a normal distribution is truncation, think
about what happens to the bounds of the uniform distribution.
c) Implement a Gibbs sampler to fit the same mixed model as fitted in Stan in a), but now with a probit
link. As before, fit 4 chains, each running for 2000 iterations, with the first 1000 iterations discarded as
burn-in. Perform graphical convergence checks and Gelman-Rubin diagnostics. Report posterior means,
standard deviations and 95 % central credible intervals for σ, β, u by combining chains.
d) For the co-efficients β, u, calculate the mean of the ratio of the posterior means βi,logit/βi,probit, ui,logit/ui,probit
obtained when fitting the logistic mixed model and the probit mixed model. To do this, you will need to
apply the extract function to the stan model object. Once calculated, multiply the iterations obtained
assuming a probit link by this constant and compare to the iterations obtained assuming a logit link.
e) The logistic link can be written in the same way as the probit link, but instead of ei ∼ N (0, 1), the
error term is ei ∼ Logistic(0, 1). By evaluating the standard normal and logistic inverse cdfs and
superimposing the line y = mx where m is the posterior ratio, do you think the results in d) were
surprising.
3
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。