Chapter 7
Assignment
Due: To Be Determined by Class Vote by the End of the First Week of Teaching
The penalty for late hand-in of coursework is 10% of the total marks per working day. No
credit will be given after more than 5 working days. A working day is a 24-hour period
starting from the original hand-in deadline, but skipping days on which the School is closed.
7.1 Individual Assignment (40%)
We will study the so-called South-African Heart-Disease (SAHD) data. Necessary background
is in [3, Section 4.4.2].
Briefly, observations on n = 462 individuals are available. On each individual, the data
consists of: a binary indicator (value 0 or 1) of the presence of Coronary Heart Disease
(1=have disease, 0=does not have disease), viewed as a response; and the following raw
inputs: systolic blood pressure; tobacco use; LDL, also known as “bad cholesterol”; (binary)
indicator of family history of heart disease; (measure of) obesity; (measure of) alcohol use;
and age. In data file SAHD.mat, the response variable is chd, and the inputs are sbp,
tobacco, ldl, famhist, obesity, alcohol, age respectively.
We want to regress the response on inputs, where an input may be a raw input or a function
of it (e.g., a quadratic function). The main tool is a logit (logistic) regression of the form
Yi ~ Bernoulli(μi) and independent, where logit(μi) = X
pj=1Xj,iβj (7.1)
where logit(μ) = log(μ/(1?μ)), and (Yi
, X1,i, . . . , Xp,i) is the i-th observation of the response
and the inputs. Another possibility is probit regression of the form
Yi ~ Bernoulli(μi) and independent, where μi = Φ(X
p
j=1
Xj,iβj ), (7.2)
where Φ() is the Normal(0,1) cdf.
[3, Section 4.4.2] fit logit models of the form (7.1). In [3, Section 5.2.2, pages 146-148], a
more advanced model in equation (5.6) there captures the effect of a raw input Xj on the
response by a spline (piecewise-polynomial) function hj (Xj ); the (estimated) functions are
shown in Figure 5.4 for selected inputs. These findings suggest the following:
Variables sbp, tobacco, and obesity may appear in (7.1) in quadratic form (b1x +b2x2,
where x is the raw input) or in cubic form (b1x + b2x2 + b3x3).
To capture the effect of age, which does not seem to resemble a low-order polynomial,
consider as inputs the binary indicators of categorized age, similar to what is done in [5,
Example 10]. For example, the sample deciles of age, namely 18, 28, 34, 40, 45, 49, 54,
58, 61, and 64, define the intervals (0, 18], (18, 28], (28, 34], ..., (61, 64], which give 10
categories of age.
47
Standard tasks (model estimation; testing the significance a certain coefficient; computing
the AIC; etc.) may be done via sahd.m Part 1, which is similar to sco.m (lab 3, Section
6.2.2). (Part 2 supports the group assignment.) A very thorough exploration could involve
transformed raw inputs beyond those seen there.
Those wishing to have matlab on a non-University computer should start at software.soton.ac.uk.
Install at least matlab (the basic product); and the toolboxes on statistics and optimization.
7.1.1 Deliverables and Marking Scheme
Submit a typed report to the Faculty Office, with the usual cover about academic integrity.
The audience is a hypothetical manager or analyst familiar with our notes, the brief and the
analyzes in [3] cited above.
Discuss results, and analysis as necessary, on the following:
1. Develop a preferred model of the response. A better (more parsimonious) model is often
found by removing an input Xj if the hypothesis “βj = 0” cannot be rejected at some level
α (the corresponding t value satisfies |t| < Φ1
(1 α/2), where Φ1
() is the Normal(0,1)
inverse cdf); and then re-fitting the model. Use of a model-selection criterion, such as
Akaike Information (AIC), is recommended.
2. Use the preferred model to describe the effect of each of obesity and age on the likelihood
of heart disease. For background on interpretation, see Section 4.5.4 and the second-fromlast
paragraph in [3, Section 4.4.2].
3. Brief explanation of how results were obtained can appear in an appendix. If significant
extensions are made to any codes provided to you during the course (sahd.m or other),
then: (a) submit the code on blackboard (individual assignment), and state in the report
the submitting user (e.g., “xy1g09”) (so the code can be matched to the report); and (b)
provide instructions on how the code can be used.
Avoid re-stating the brief or the codes provided. Length: up to 400 words, excluding tables,
figures, and their captions.
Marking Scheme:
40%. Plausibility of the proposed model. The plausibility will be judged against the
findings in [3] and against experiments the manager could attempt; thus your analysis
should be informed by the above. For example, while Table 4.3 of [3] suggests that
obesity and systolic blood pressure do not have a significant effect, the reverse is found in
[3, Figure 5.4]; the latter suggests that appropriate functions of these inputs should enter
the regression (e.g., as a quadratic function, or a categorization).
30%. Correctness of estimation of any model being proposed.
30%. Quality of presentation (clarity, coherence, ease of reading).
7.2 Group Assignment (60%)
We wish to develop a classification (prediction) method of the response (0/1 indicator of heart
disease) based on the inputs. The aim is to minimize the expected loss (per observation,
as in (5.7)) under the following cost structure: correct classification costs nothing; misclassification
of class 0 (true is 0, prediction is 1) costs 1 unit; and mis-classification of class
1 (true is 1, prediction is 0) costs 10 units.
Similar to the analysis seen in lab 4, we consider classifiers defined by a Bernoulli (GLM)
model of the response and a classification threshold t, as in (6.1); and the main aim is to
minimize the expected loss by choice of a classifier. The more extensive the set of models
and thresholds considered, the smaller the loss we can expect. Moreover, a “better” model
(with an appropriate threshold) is more likely to minimize the loss; thus, the model chosen
for the individual assignment seems (a priori) a strong contender.
48
Script sahd.m, Part 2, implements an analysis of the form above, similar to cvsco.m (lab 4).
Specifically, a set of classifiers is developed; the expected (out-of-sample) loss is estimated
via cross-validation (section 5.3, equation (5.10)); and the classifier minimizing the estimated
loss is selected.
A second (subsequent) task is to estimate the expected loss, L say, of the selected classifier.
The CV of the selected classifier (minimum CV across all classifiers considered) is an optimistic
(biased low) estimate of L [3, Section 7.2, page 222]. To obtain an unbiased estimate,
the following is proposed: (a) on the first two thirds of the data, apply cross-validation for
classifier selection; and (b) on the last third of the data, the selected classifier is applied; the
average loss is an unbiased estimate of L. For this task, minor modification of sahd.m (or a
new code) may be necessary.
7.2.1 Deliverables and Marking Scheme
Submit a typed report to the Faculty Office, with the usual cover about academic integrity.
The audience is a hypothetical manager or analyst familiar with our notes, the brief and the
analyzes in [3] cited above.
Discuss results, and analysis as necessary, on the following:
1. State a preferred classifier, with the aim of minimizing the expected loss. In estimating
the loss, leave-one-out cross-validation (Section 5.3) seems preferable (the modest size of
the sample makes this practical).
2. Provide an unbiased estimate of the expected loss L of the selected classifier.
3. Brief explanation of how results were obtained can appear in an appendix. If significant
extensions are made to any codes provided to you during the course (sahd.m or other),
then: (a) submit the code on blackboard (group assignment), and state in the report the
submitting user (e.g., “xy1g09”) (so the code can be matched to the report); and (b)
provide instructions on how the code can be used.
Avoid re-stating the brief or the codes provided. Length: up to 500 words, excluding tables,
figures, and their captions.
Marking Scheme:
50%. Thoroughness of exploration of potential classifiers, and success in minimizing the
expected loss.
20%. Correctness of estimation of the expected loss L of the selected classifier.
30%. Quality of presentation: clarity, coherence, ease of reading.
Group Work
Work in a group of size up to 4. Each group makes a single submission of the deliverables.
Group members are responsible for functioning as a team. A common mark will be given
to the members of a group. In exceptional cases where there is strong evidence of lack of
contribution by a member, a correction may be made. A key requirement for making such a
correction is majority opinion (i.e., 2 or more against 1). To help resolve such cases, keeping
minutes of work meetings may be appropriate.
49
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。