联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2022-07-30 09:42


MATH5885 Longitudinal Data Analysis

Term 2, 2022

Project

Due 23:59, Sunday, 31st July (end of Week 9) via Moodle.

The project should be submitted via the Assignment tool. This tool is accessible via a clearly

indicated link in the Assessments subfolder on moodle.You are allowed to work in pairs (groups) of

two if you wish. In that case, only one of the group members should submit the PDF file on Moodle,

with the names of both students clearly indicated and signed on the first page of the

document . The submitting student should add a cover page containing a copy of your student ID

card (or passport page if ID card is not available), and write with your own handwriting:

“I declare that this assignment is my own work, except where acknowledged and I have read and

understood the University rules regarding Academic Misconduct”, and sign it.

You must upload ONE pdf file containing all your working where all the R material should be at

the back of the project’s pdf file and be titled “Appendix”. Please include sufficient working, computer

code (adequately documented and commented) and output (adequately explained) so that I could fol-

low what you have done. As it is known since George Box that “all models are wrong but some are

useful” I do not expect any two submitted projects to be identical.

Please note that there are page limitations for the MAIN PART of the report:

maximum of 12 pages typed in minimum 12 pt font, single line spacing with minimum 62.5px mar-

gins, single sided which should include mathematical summaries of the models fit, essential R code and

output only, any essential tabular and graphical output with a narrative about how you arrived at key

modelling decisions, and your summary of findings or conclusions. You should also describe any model

deficiencies and suggest possible remedies. Further details below.

There are no page limitations for the appendix part of the report that should contain the com-

plete R coding and any additional graphs and tables properly labelled so that the main report can cross

reference these and so that I can quickly locate the relevant R code and additional tables and graphs

should that be needed. This is NOT a defacto extension to your report. Your Part 1 Report should

stand on its own and be readable without reference to the Appendix.

If you are not skilled at producing typeset reports, then neatly handwritten reports are accept-

able provided the specifications on font size, margins, line spacing etc described above are reasonably

conformed to.

1

1 Project Background and Data

The project uses the CD4 dataset from DHLZ, introduced in Week 2. Please download the attached text

file cd4data.txt to use the data for your current analysis. Any of the explanatory variables included

in the data set may be considered for inclusion in your model, as well as fnctions of time. The response

variable is CD4+ cell count but you may also wish to consider transformations of the response. Basic

background information is available in documents:

1. DHLZ-CD4-BasicDataAnalysis.pdf, which contains some basic data analysis from Diggle et al.

2. ZegerDiggle-1994-Biometrics.pdf, which gives a published journal article using this dataset and

explains the variables observed in the study — see in particular their Section 5 for details.

The dataset consists of longitudinally collected observations on 369 subjects, resulting in a total of

2376 observations of CD4 cell counts denoted CD4 in the dataset. Other variables collected are:

1. Time: as the time (in years) since seroconversion, where a negative time denotes actual time

before seroconversion.

2. Age: age at seroconversion (a baseline measurement), centred at 30 years of age, so that negative

ages denote years younger than 30.

3. Packs: the number of packets of cigarettes smoked per day at time of measurement.

4. Drugs: a binary variable taking the values 1 or 0 to denote if the respondent takes recreational

drugs or not respectively, measured at each time point.

5. Sex: number of sexual partners reported at each time point. Looks to have been centred somehow

and truncated at ±5.

6. Cesd: an index of depression measured at each time point, with time trends removed. Higher

scores indicate greater depressive symptoms.

Zeger and Diggle (1994) suggest (Section 5):

“The first objective of this analysis is to characterize the population average time course

of CD4 decay while accounting for the following additional predictor variables: smoking

(packs per day); recreational drug use (yes or no); numbers of sexual partners; and depres-

sion symptoms as measured by the CESD scale (larger values indicate increased depressive

symptoms). The analysis was conducted on square-root-transformed CD4 numbers whose

distribution is more nearly Gaussian”

Later they state:

“The linear regression coefficients (standard errors in parentheses) for the covariates age

at seroconversion (years), packs of cigarettes, recreational drug use (0: no, 1: yes), number

of sexual partners, and depression score are: .037 (.18), .27 (.15), .37 (.31), .10 (.038),

and -.058 (.015), respectively. Age plays little role. Smoking, recreational drug use, and

increased numbers of sexual partners are associated with higher CD4 cell numbers. This may

reflect immune response stimulation or simply selection bias whereby healthier men choose

to continue these practices. Increased depressive symptoms are significantly associated with

decreased CD4 levels. Again, a causal direction cannot be inferred from this analysis.”

2

These estimated regression coefficients seem to be those obtained by least squares in a model in

which (page 694): “μ(t) was approximated by a knotted cubic spline with seven equally spaced knots.”

Note that the model of Zeger and Diggle uses square root of the CD4 cell counts as the response

variable and the other available variables are covariates. However, as they rightly point these other

variables cannot be inferred to cause the level of CD4 cell counts.

Available on Moodle is a document CD4InitialAnalysis.pdf. There is also an and accompanying R

Script file called CD4InitialAnalysis.R. These provide some preliminary exploratory data analysis and

an attempt to reproduce various results reported in Zeger and Diggle. As is often the case in scientific

papers, there is typically insufficient detail available to allow exact reproduction of the findings. In

particular, the point estimates and standard errors reported by Zeger and Diggle cannot be reproduced

despite best efforts to do so.

As a starting point, you should work through the R Script file CD4InitialAnalysis.R to ensure you

understand what each part of that does. Then you should undertake your own analysis for the project

as described in the next section.

3

2 Project Aims

The aim of the project is to determine a suitable model for the square root of CD4 cell counts as the

response variable with covariates time (suitably modelled), age, cigarettes, CESD score, drug use and

partners.

You should proceed as follows:

1. Using and adapting the techniques introduced in the course and in the above R script, perform

exploratory data analysis for the dataset in order to explore the mean structure, including the

impacts of the various covariates on the mean response and to explore the covariance structure

for the model randomness.

For example, this will include plots of individual and average profiles across time (possibly strati-

fied by levels of the other covariates), investigation of covariance structure, and any other analyses

you feel are relevant. Choose two or three preliminary fixed effects structures based on this analy-

sis. In particular you might want to model the response to time as a combination of linear or other

functions over segments of time. The model based on natural splines is provided as a starting

point to flexibly model the temporal trend in mean response. But it may be possible to simplify

this — up to you!

2. Fit these preliminary models using linear regression, comment on significance of regression coef-

ficients and obtain the residuals from these models.

3. You should consider possible components in the models for the covariance structure including

compound symmetry, unequal variances, random error, exponential or Gaussian autocorrelation

decay. Use correlation and/or variogram analysis to propose possible models for the covariance

of the residuals and any random effects components you may wish to include in the regression

specification. Compare your alternative models using appropriate statistical model fit criteria and

hypothesis tests. Select the “best” covariance model based on your analysis.

4. Consider whether your preliminary fixed effects structure needs to be adjusted in light of the

chosen covariance model and refit the adjusted model. Make your conclusions.

5. Obtain the estimated covariance and correlation matrices for a selected patient with 7 or 8 mea-

surements spanning (roughly evenly) time 0. Discuss how the variances vary with time, and how

the correlations vary with time between measurements.

6. Select four patients with 7 or 8 measurements spanning time 0. Try to select a range of patients

responding “high”, “medium” and “low” initially and over time. Use BLUPs to estimate the

individual trajectories for these patients and plot them on the same graph, along with their

observed levels of CD4 cell counts.

4

3 Your report

Write up a detailed report on your analysis. You should include:

Section 1: Introduction A very brief summary of the situation, the data and the objectives of your

analysis and report.

Section 2: Exploratory data analysis Briefly describe the results of exploratory data analysis and sum-

marize its results, including relevant graphical output.

Section 3: Model formulation This is the major section summarizing the steps taken and models tried

in arriving at your final model.

Describe and justify your model selection procedure, saying why you chose to fit the models

you did.

Explain why you prefer the model for fixed effects and error structure you ended up choosing.

Formulate a model for the random errors in terms of random effects, serial dependence and

pure noise.

Write down the final fitted model for the mean response including standard errors and

discussion of significance of covariates.

Discuss the effect of the explanatory variables on the response.

Discuss the main features of the covariance structure.

Discuss the properties of the residuals in the model and any impact these may have on

inferences you make about model fit and significance of model terms.

Section 4: Application to individual trajectories Include the results of the analyses specified in items

5 and 6 of the Project Aims.

Section 5: Discussion of modelling Discuss the difficulties you encountered with the analysis, and the

limitations of your model (if any).

The report’s quality will be assessed as if the report is for a decision maker who only wants the key

details in the main report but may want to easily access further detail in the Appendices.


相关文章

版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp