Exercises for
Statistical Modeling: A Fresh Approach
Daniel Kaplan
Second Edition, Copyright (c) 2012, v2.0
2
Contents
1 Introduction 5
2 Data: Cases, Variables, Samples 7
3 Describing Variation 11
4 Group-wise Models 23
5 Confidence Intervals 27
6 The Language of Models 33
7 Model Formulas and Coefficients 39
8 Fitting Models to Data 49
9 Correlation and Partitioning of Variance 55
10 Total and Partial Relationships 59
11 Modeling Randomness 67
12 Confidence in Models 77
13 The Logic of Hypothesis Testing 81
14 Hypothesis Testing on Whole Models 85
15 Hypothesis Testing on Parts of Models 95
16 Models of Yes/No Variables 101
17 Causation 109
18 Experiment 111
3
4 CONTENTS
Chapter 1
Introduction
Reading Questions
1. How can a model be useful even if it is not exactly
correct?
2. Give an example of a model used for classification.
3. Often we describe personalities as “patient,” “kind,”
“vengeful,” etc. How can these descriptions be used as
models for prediction?
4. Give three examples of models that you use in everyday
life. For each, say what is the purpose of the model and
in what ways the representation differs from the real
thing.
5. Make a sensible statement about how precisely these
quantities are typically measured:
The speed of a car.
Your weight.
The national unemployment rate.
A person’s intelligence.
6. Give an example of a controlled experiment. What
quantity or quantities have been varied and what has
been held constant?
7. Using one of your textbooks from another field, pick an
illustration or diagram. Briefly describe the illustration
and explain how this is a model, in what ways it is faithful
to the system being described and in what ways it
fails to reflect that system.
Prob 1.01 Many fields of natural and social science have
principles that are identified by name. Sometimes these are
called “laws,” sometimes “principles”, “theories,” etc. Some
examples:
Kepler’s Law Newton’s Laws of Motion
Ohm’s Law Grimm’s Law Nernst equation
Raoult’s Law Nash equilibrium Boyle’s Law
Zipf’s Law Law of diminishing marginal utility
Pareto principle Snell’s Law Hooke’s Law
Fitt’s Law Laws of supply and demand
Ideal gas law Newton’s law of cooling
Le Chatelier’s principle Poiseuille’s law
These laws and principles can be thought of as models.
Each is a description of a relationship. For instance, Hooke’s
law relates the extension and stiffness of a spring to the force
exerted by the spring. The laws of supply and demand relate
the quantity of a good to the price and postulates that the
market price is established at the equilibrium of supply and
demand.
Pick a law or principle from an area of interest to you —
chemistry, linguistics, sociology, physics, ... whatever. Describe
the law, what quantities or qualities it relates to one
another, and the ways in which the law is a model, that is, a
representation that is suitable for some purposes or situations
and not others.
Enter your answer here:
An example is given below.
EXAMPLE: As described in the text, Hooke’s
Law, f = ?kx, relates the force (f), the stiffness
(k) and the extension past resting length (x) for a
spring. It is a useful and accurate approximation
for small extensions. For large extensions, however,
springs are permanently distorted or break.
Springs involve friction, which is not included in
the law. Some springs, such as passive muscle,
are really composites and show a different pattern,
e.g., f = ?k x3/|x| for moderate sized extensions.
Prob 1.02 NOTE: Before starting, instruct R to use the
mosaic package:
> require(mosaic)
Each of the following statements has a syntax mistake.
Write the statements properly and give a sentence saying what
was wrong. (Cut and paste the correct statement from R,
along with any output that R gives and your sentence saying
what was wrong in the original.)
5
6 CHAPTER 1. INTRODUCTION
Here’s an example:
QUESTION: What wrong with this statement?
> a = fetchData(myfile.csv)
ANSWER: It should be
> a = fetchData("myfile.csv")
The file name is a character string and therefore
should be in quotes. Otherwise it’s treated as an
object name, and there is no object called my-
file.csv.
Now for the real thing. Say what’s wrong with each of
these statements for the purpose given:
(a) > seq(5;8) to give [1] 5 6 7 8
A Nothing is wrong.
B Use a comma instead of a semi-colon to separate
arguments: seq(5,8).
C It should be seq(5 to 8).
(b) > cos 1.5 to calculate the cosine of 1.5 radians
(c) > 3 + 5 = x to make x take the value 3+5
(d) > sqrt[4*98] to find the square root of 392
(e) > 10,000 + 4,000 adding two numbers to get 14,000
(f) > sqrt(c(4,16,25,36))=4 intended to give
[1] FALSE TRUE FALSE FALSE
(g) > fruit = c(apple, berry, cherry) to create a collection
of names
[1] "apple" "berry" "cherry"
(h) > x = 4(3+2) where x is intended to be assigned the
value 20
(i) > x/4 = 3+2 where x is intended to be assigned the value
20
Prob 1.04 The operator seq generates sequences. Use seq
to make the following sequences:
1. the integers from 1 to 10
2. the integers from 5 to 15
3. the integers from 1 to 10, skipping the even ones
4. 10 evenly spaced numbers between 0 and 1
Prob 1.05 According to legend, when the famous mathematician
Johann Carl Friedrich Gauss (1777-1855) was a
child, one of his teachers tried to distract him with busy work:
add up the numbers 1 to 100. Gauss did this easily and immediately
without a computer. But using the computer, which
of the following commands will do the job?
A sum(1,100)
B seq(1,100)
C sum of seq(1,100)
D sum(seq(1,100))
E seq(sum(1,100))
F sum(1,seq(100))
Prob 1.10 Try the following command:
> seq(1,5,by=.5,length=2)
Why do you think the computer responded the way it did?
Prob 1.11 1e6 and 10e5 are two different ways of writing
one-million. Write 5 more different forms of one-million using
scientific notation.
Prob 1.12 The following statement gives a result that some
people are surprised by
> 10e3 == 1000
[1] FALSE
Explain why the result was FALSE rather than TRUE.
A 10e3 is 100, not 1000
B 10e3 is 10000, not 1000
C 10e3 is not a valid number
D It should be true; there’s a bug in the software.
Chapter 2
Data: Cases, Variables, Samples
Reading Questions
1. What are the two major different kinds of variables?
2. How are variables and cases arranged in a data frame?
3. How is the relationship between these things: population,
sampling frame, sample, census?
4. What’s the difference between a longitudinal and crosssectional
sample?
5. Describe some types of sampling that lead to the sample
potentially being unrepresentative of the population?
Prob 2.02 Using the tally operator and the comparison
operators (such as > or ==), answer the following questions
about the CO2 data. You can read in the CO2 data with the
statement
CO2 = fetchData("CO2")
You can see the data set itself by giving the command
CO2
In this exercise, you will use R commands to count how
many of the cases satisfy various criteria:
1. How many of the plants in CO2 are Mc1 for Plant?
7 12 14 21 28 34
2. How many of the plants in CO2 are either Mc1 or Mn1?
8 12 14 16 23 54 92
3. How many are Quebec for Type and nonchilled for
Treatment?
8 12 14 16 21 23 54 92
4. How many have a concentration (conc) of 300 or bigger?
12 24 36 48 60
5. How many have a concentration between 300 and 450
(inclusive)?
12 24 36 48 60
6. How many have a concentration between 300 and 450
(inclusive) and are nonchilled?
6 8 10 12 14 16
7. How many have an uptake that is less than 1/10 of the
concentration (in the units reported)?
17 33 34 51 68
Prob 2.04 Here is a small data frame about automobiles.
Make and Vehicle Trans. # of City Hwy
model type type cyl. MPG MPG
Kia Optima compact Man. 4 21 31
Kia Optima compact Auto. 6 20 28
Saab 9-7X AWD SUV Auto. 6 14 20
Saab 9-7X AWD SUV Auto. 8 12 16
Ford Focus compact Man. 4 24 35
Ford Focus compact Auto. 4 24 33
Ford F150 2WD pickup Auto. 8 13 17
(a) What are the cases in the data frame?
A Individual car companies
B Individual makes and models of cars
C Individual configurations of cars
D Different sizes of cars
(b) For each case, what variables are given? Are they categorical
or quantitative?
? Kia Optima: not.a.variable categorical quantitative
? City MPG: not.a.variable categorical quantitative
? Vehicle type: not.a.variable categorical quantitative
? SUV: not.a.variable categorical quantitative
? Trans. type: not.a.variable categorical quantitative
? # of cyl.: not.a.variable categorical quantitative
(c) Why are some cars listed twice? Is this a mistake in the
table?
A Yes, it’s a mistake.
B A car brand might be listed more than
once, but the cases have different attributes
on other variables.
C Some cars are more in demand than others.
7
8 CHAPTER 2. DATA: CASES, VARIABLES, SAMPLES
Prob 2.09 Here is a data set from an experiment about how
reaction times change after drinking alcohol.[?] The measurements
give how long it took for a person to catch a dropped
ruler. One measurement was made before drinking any alcohol.
Successive measurements were made after one standard
drink, two standard drinks, and so on. Measurements are in
seconds.
Before After 1 After 2 After 3
0.68 0.73 0.80 1.38
0.54 0.64 0.92 1.44
0.71 0.66 0.83 1.46
0.82 0.92 0.97 1.51
0.58 0.68 0.70 1.49
0.80 0.87 0.92 1.54
and so on ...
(a) What are the rows in the above data set?
A Individual measurements of reaction time.
B An individual person.
C The number of drinks.
(b) How many variables are there?
A One — the reaction times.
B Two — the reaction times with and without
alcohol.
C Four — the reaction times at four different
levels of alcohol.
The format used for these data has several limitations:
? It leaves no room for multiple measurements of an individual
at one level of alcohol, for example, two or three
baseline measurements, or two or three measurements
after one standard drink.
? It provides no flexibility for different levels of alcohol,
for example 1.5 standard drinks, or for taking into account
how long the measurement was made after the
drink.
Another format, which would be better, is this:
SubjectID ReactionTime Drinks
S1 0.68 0
S1 0.73 1
S1 0.80 2
S1 1.38 3
S2 0.54 0
S2 0.64 1
S2 0.92 2
and so on ...
What are the cases in the reformatted data frame?
A Individual measurements of reaction time.
B An individual person.
C The number of drinks.
How many variables are there?
A The same as in the original version. It’s the
same data!
B Three — the subject, the reaction time, the
alcohol level.
C Four — the reaction times at four different
levels of alcohol.
The lack of flexibility in the original data format indicates
a more profound problem. The response to alcohol is not just
a matter of quantity, but of timing. Drinks spread out over
time have less effect than drinks consumed rapidly, and the
physiological response to a drink changes over time as the
alcohol is first absorbed into the blood and then gradually
removed by the liver. Nothing in this data set indicates how
long after the drinks the measurements were taken. The small
change in reaction time after a single drink might reflect that
there was little time for the alcohol to be absorbed before the
measurement was taken; the large change after three drinks
might actually be the response to the first drink finally kicking
in. Perhaps it would have been better to make a measurement
of the blook alcohol level at each reaction-time trial.
It’s important to think carefully about how to measure
your variables effectively, and what you should measure in
order to capture the effects you are interested in.
Prob 2.14 Sometimes categorical information is represented
numerically. In the early days of computing, it was
very common to represent everything with a number. For
instance the categorical variable for sex, with levels male or
female, might be stored as 0 or 1. Even categorical variables
like race or language, with many different levels, can be represented
as a number. The codebook provides the interpretation
of each number (hence the word “codebook”).
Here is a very small part of a dataset from the 1960s used
to study the influence of smoking and other factors on the
weights of babies at birth.[?] gestation.csv
gest. wt race ed wt.1 inc smoke number
284 120 8 5 100 1 0 0
282 113 0 5 135 4 0 0
279 128 0 2 115 2 1 1
244 138 7 2 178 98 0 0
245 132 7 1 140 2 0 0
351 140 0 5 120 99 3 2
282 144 0 2 124 2 1 1
279 141 0 1 128 2 1 1
281 110 8 5 99 2 1 2
273 114 7 2 154 1 0 0
285 115 7 2 130 1 0 0
255 92 4 7 125 1 1 5
261 115 3 2 125 4 1 5
261 144 0 2 170 7 0 0
At first glance, all of the data seems quantitative. But
read the codebook:
gest. - length of gestation in days
wt - birth weight in ounces (999 unknown)
race - mother’s race
9
0-5=white 6=mex 7=black 8=asian
9=mixed 99=unknown
ed - mother’s education
0= less than 8th grade,
1 = 8th -12th grade - did not graduate,
2= HS graduate--no other schooling ,
3= HS+trade,
4=HS+some college
5= College graduate,
6&7 Trade school HS unclear,
9=unknown
marital 1=married, 2= legally separated, 3= divorced,
4=widowed, 5=never married
inc - family yearly income in $2500 increments
0 = under 2500, 1=2500-4999, ...,
8= 12,500-14,999, 9=15000+,
98=unknown, 99=not asked
smoke - does mother smoke? 0=never, 1= smokes now,
2=until current pregnancy, 3=once did, not now,
9=unknown
number - number of cigarettes smoked per day
0=never, 1=1-4, 2=5-9, 3=10-14, 4=15-19,
5=20-29, 6=30-39, 7=40-60,
8=60+, 9=smoke but don’t know, 98=unknown, 99=not asked
Taking into account the codebook, what kind of data is
each variable? If the data have a natural order, but are not
genuinely quantitative, say “ordinal.” You can ignore the “unknown”
or “not asked” codes when giving your answer.
(a) Gestation categorical ordinal quantitative
(b) Race categorical ordinal quantitative
(c) Marital categorical ordinal quantitative
(d) Inc categorical ordinal quantitative
(e) Smoke categorical ordinal quantitative
(f) Number categorical ordinal quantitative
The disadvantage of storing categorical information as
numbers is that it’s easy to get confused and mistake one
level for another. Modern software makes it easy to use text
strings to label the different levels of categorical variables.
Still, you are likely to encounter data with categorical data
stored numerically, so be alert.
A good modern practice is to code missing data in a consistent
way that can be automatically recognized by software as
meaning missing. Often, NA is used for this purpose. Notice
that in the number variable, there is a clear order to the categories
until one gets to level 9, which means “smoke but don’t
know.” This is an unfortunate choice. It would be better to
store number as a quantitative variable telling the number of
cigarettes smoked per day. Another variable could be used to
indicate whether missing data was “smoke but don’t know,”
“unknown”, or “not asked.”
Prob 2.22 Since the computer has to represent numbers
that are both very large and very small, scientific notation is
often used. The number 7.23e4 means 7.23 × 104 = 72300.
The number 1.37e-2 means 1.37 × 10?2 = 0.0137.
For each of the following numbers written in computer
scientific notation, say what is the corresponding number.
(a) 3e1 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3000 10000
(b) 1e3 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3000 10000
(c) 0.1e3 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3000 10000
(d) 0.3e-2 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3000 10000
(e) 10e3 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3000 10000
(f) 10e-3 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3000 10000
(g) 0.0003e3 0.003 0.01 0.03 0.1 0.3 1 3 10 30 100 300 1000 3000 10000
10 CHAPTER 2. DATA: CASES, VARIABLES, SAMPLES
Chapter 3
Describing Variation
Reading Questions
1. What is the disadvantage of using a 100% coverage interval
to describe variation?
2. In describing a sample of a variable, what is the relationship
between the variance and the standard deviation?
3. What is a residual?
4. What’s the difference between “density” and “frequency”
in displaying a variable with a histogram?
5. What’s a normal distribution?
6. Here is the graph showing boxplots of height broken
down according to sex as well as for both males and
females together.
Which components of the boxplot for “All” match up
exactly with the boxplot for “M” or “F”? Explain why.
7. Variables typically have units. For example, in Galton’s
height data, the height variable has units of inches. Suppose
you are working with a variable in units of degrees
celsius. What would be the units of the standard deviation
of a variable? Of the variance? Why are they
different?
Prob 3.01 Here is a small table of percentiles of typical
daily calorie consumption of college students.
Percentile Calories
0 1400
5 1800
10 2000
25 2400
50 2600
75 2900
90 3100
95 3300
100 3700
(a) What is the 50%-coverage interval?
Lower Boundary 1800 1900 2000 2200 2400 2500 2600
Upper Boundary 2600 2750 2900 3000 3100 3200 3500
(b) What percentage of cases lie between 2900 and 3300?
10 20 25 30 40 50 60 70 80 90 95
(c) What is the percentile that marks the upper end of the
95%-coverage interval? 75 90 92.5 95 97.5 100
Estimate the corresponding calorie value from the table.
2900 3000 3100 3300 3500 3700
(d) Using the 1.5 IQR rule-of-thumb for identifying an outlier,
what would be the threshold for identifying a low
calorie consumption as an outlier?
1450 1500 1650 1750 1800 2000
Prob 3.02 Here are some useful operators for taking a quick
look at data frames:
names Lists the names of the components.
ncol Tells how many components there are.
nrow Tells how many lines of data there are.
head Prints the first several lines of the data frame.
Here are some examples of these commands applied to the
CO2 data frame:
CO2 = fetchData("CO2")
Called from: fetchData("CO2")
names(CO2)
[1] "Plant" "Type" "Treatment" "conc" "uptake"
ncol(CO2)
[1] 5
11
12 CHAPTER 3. DESCRIBING VARIATION
nrow(CO2)
[1] 84
head(CO2)
Plant Type Treatment conc uptake
1 Qn1 Quebec nonchilled 95 16.0
2 Qn1 Quebec nonchilled 175 30.4
3 Qn1 Quebec nonchilled 250 34.8
4 Qn1 Quebec nonchilled 350 37.2
5 Qn1 Quebec nonchilled 500 35.3
6 Qn1 Quebec nonchilled 675 39.2
? The data frame iris records measurements on flowers.
You can read in with
iris = fetchData("iris")
Called from: fetchData("iris")
creating an object named iris.
Use the above operators to answer the following questions.
1. Which of the following is the name of a column in
iris?
flower Color Species Length
2. How many rows are there in iris?
1 50 100 150 200
3. How many columns are there in iris?
2 3 4 5 6 7 8 10
4. What is the Sepal.Length in the third row?
1.2 3.6 4.2 4.7 5.9
? The data frame mtcars has data on cars from the 1970s.
You can read it in with
cars = fetchData("mtcars")
Called from: fetchData("mtcars")
creating an object named cars.
Use the above operators to answer the following questions.
1. Which of the following is the name of a column in
cars?
carb color size weight wheels
2. How many rows are there in cars?
30 31 32 33 34 35
3. How many columns are there in cars?
7 8 9 10 11
4. What is the wt in the second row?
2.125 2.225 2.620 2.875 3.215
Prob 3.03 Here are Galton’s data on heights of adult children
and their parents.
> require(mosaic)
> galton = fetchData("Galton")
(a) Which one of these commands will give the 95th percentile
of the children’s heights in Galton’s data?
A qdata(95,height,data=galton)
B qdata(0.95,height,data=galton)
C qdata(95,galton,data=height)
D qdata(0.95,galton,data=height)
E qdata(95,father,data=height)
F qdata(0.95,father,data=galton)
(b) Which of these commands will give the 90-percent coverage
interval of the children’s heights in Galton’s data?
A qdata(c(0.05,0.95),height,data=
galton)
B qdata(c(0.025,0.975),height,data=
galton)
C qdata(0.90,height,data=galton)
D qdata(90,height,data=galton)
(c) Find the 50-percent coverage interval of the following variables
in Galton’s height data:
? Father’s heights
A 59 to 73 inches
B 68 to 71 inches
C 63 to 65.5 inches
D 68 to 74 inches
? Mother’s heights
A 59 to 73 inches
B 68 to 71 inches
C 63 to 65.5 inches
D 68 to 74 inches
(d) Find the 95-percent coverage interval of
? Father’s heights
A 65 to 73 inches
B 65 to 74 inches
C 68 to 73 inches
D 59 to 69 inches
? Mother’s heights
A 62.5 to 68.5 inches
B 65 to 69 inches
C 63 to 68.5 inches
D 59 to 69 inches
Prob 3.04 In Galton’s data, are the sons typically taller
than their fathers? Create a variable that is the difference
between the son’s height and the father’s height. (Arrange it
so that a positive number refers to a son who is taller than
his father.)
1. What’s the mean height difference (in inches)?
-2.47 -0.31 0.06 66.76 69.23
13
2. What’s the standard deviation (in inches)?
1.32 2.63 2.74 3.58 3.75
3. What is the 95-percent coverage interval (in inches)?
A -3.7 to 4.8
B -4.6 to 4.9
C -5.2 to 5.6
D -9.5 to 4.5
Prob 3.05 Use R to generate the sequence of 101 numbers:
0, 1, 2, 3, · · · , 100.
1. What’s the mean value?
25 50 75 100
2. What’s the median value?
25 50 75 100
3. What’s the standard deviation?
10.7 29.3 41.2 53.8
4. What’s the sum of squares?
5050 20251 103450 338350 585200
Now generate the sequence of perfect squares
0, 1, 4, 9, · · · , 10000, or, written another way, 02,
(Hint: Make a simple sequence 0 to 100 and square it.)
1. What’s the mean value?
50 2500 3350 4750 7860
2. What’s the median value?
50 2500 3350 4750 7860
3. What’s the standard deviation?
29.3 456.2 3028 4505 6108
4. What’s the sum of squares?
5050 20251 338350 585200 2050333330
Prob 3.06 Using Galton’s height data (galton.csv),
> require(mosaic)
> galton = fetchData("Galton")
make a box-and-whisker plot of the appropriate variable and
count the outliers to answer each of these questions.
1. Which of these statements will make a box-and-whisker
plot of height?
A bwplot(height,data=galton)
B bwplot(~height,data=galton))
C bwplot(galton,data=height)
D bwplot(~galton,data=height)
2. How many of the cases are outliers in height ?
0 1 2 3 5 10 15 20
3. Make a box-and-whisker plot of mother. The bounds of
the whiskers will be at 60 and 69 inches. It looks like
just a few cases are beyond the whiskers. This might be
misleading if there are several mothers with exactly the
same values.
Make a tally of the mothers, like this:
> tally( ~mother, data=galton )
58 58.5 59 60 60.2 60.5
7 9 26 36 1 1
61 61.5 62 62.5 62.7 63
25 1 73 22 7 103
63.5 63.7 64 64.2 64.5 64.7
42 8 112 5 26 7
65 65.5 66 66.2 66.5 66.7
133 36 69 5 47 6
67 68 68.5 69 70.5 Total
45 11 10 23 2 898
How many of the cases lie outside the whiskers in the
box-and-whisker plot of mother?
0 11 22 33 44 55 66
4. Apply the same process to father. According to the criteria
used by bwplot, how many of the cases are outliers
in father? 0 4 9 14 19 24 29
5. You can tally on multiple variables. For instance, to
tally on both mother and sex, do this:
> tally( ~mother + sex, data=galton )
Looking just at the cases where mother is an outlier, how
many of the children involved (variable sex) are female?
0 5 10 15 20 25 30 35
Prob 3.08 The figure shows the results from the medal
winners in the women’s 10m air-rifle competition in the 2008
Olympics. (Figure from the New York Times, Aug. 10, 2008)
The location of each of 10 shots is shown as transluscent
light circles in each target. The objective is to hit the bright
target dot in the center. There is random scatter (variance)
as well as steady deviations (bias) from the target.
What is the direction of the apparent bias in Katerina
Emmons’s results? (Directions are indicated as compas directions,
E=east, NE=north east, etc.)
NE NW SW SE
To measure the size of the bias, find the center of the shots
and measure how far that is from the target dot. Take the
distance between the concentric circles as one unit.
What is the size of the bias in Katerina Emmon’s results?
0 1 3 4 6 10
14 CHAPTER 3. DESCRIBING VARIATION
Prob 3.09 Here is a boxplot:
Reading from the graph, answer the following:
(a) What is the median?
0 1 2 3 6 Can’t estimate from this graph
(b) What is the 75th percentile?
0 1 2 3 6 Can’t estimate from this graph
(c) What is the IQR?
0 1 2 3 4 6 Can’t estimate from this graph
(d) What is the 40th percentile?
A between 0 and 1
B between 1 and 2
C between 2 and 3
D between 3 and 4
E between 4 and 6
F Can’t estimate from this graph.
Prob 3.10a The plot shows two different displays of density.
The displays might be from the same distribution or two
different distributions.
(a) What are the two displays?
A Density and cumulative
B Rug and cumulative
C Cumulative and box plot
D Density and rug plot
E Rug and box plot
(b) The two displays show the same distribution. True or
False
(c) Describe briefly any sign of mismatch or what features
convince you that the two displays are equivalent.
Prob 3.10b The plot shows two different displays of density.
The displays might be from the same distribution or two
different distributions.
(a) What are the two displays?
A Density and cumulative
B Rug and cumulative
C Cumulative and box plot
D Density and rug plot
E Rug and box plot
(b) The two displays show the same distribution. True or
False
(c) Describe briefly any sign of mismatch or what features
convince you that the two displays are equivalent.
Prob 3.11 By hand, calculate the mean, the range, the
variance, and the standard deviation of each of the following
sets of numbers:
(A) 1, 0, ?1
(B) 1, 3
(C) 1, 2, 3.
15
1. Which of the 3 sets of numbers — A, B, or C — is the
most spread out according to the range?
A A
B B
C C
D No way to know
E All the same
2. Which of the 3 sets of numbers — A, B, or C — is the
most spread out according to the standard deviation?
A A
B B
C C
D No way to know
E All the same
Prob 3.12 A standard deviation contest. For (a)
and (b) below, you can choose numbers from the set
0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Repeats are allowed.
(a) Which list of 4 numbers has the largest standard deviation
such a list can possibly have?
A 0,3,6,9
B 0,0,0,9
C 0,0,9,9
D 0,9,9,9
(b) Which list of 4 numbers has the smallest standard deviation
such a list can possibly have?
A 0,3,6,9
B 0,1,2,3
C 5,5,6,6
D 9,9,9,9
Prob 3.13
(a) From what kinds of variables would side-by-side boxplots
be generated?
A categorical only
B quantitative only
C one categorical and one quantitative
D varies according to situation
(b) From what kinds of variables would a histogram be generated?
A categorical only
B quantitative only
C one categorical and one quantitative
D varies according to situation
Prob 3.14 The boxplots below are all made from exactly
the same data. One of them is made correctly, according to
the “1.5 IQR” convention for drawing the whiskers. The others
are drawn differently.
Plot 1 Plot 2 Plot 3 Plot 4
? Which of the plots is correct? 1 2 3 4
Prob 3.15 The plot purports to show the density of a distribution
of data. If this is true, the fraction of the data that
falls between any two values on the x axis should be the area
under the curve between those two values.
x
Density???
5 8 11 14 17 20
0 0.02 0.05 0.08 0.11 0.14 0.17 0.2
Answer the following questions. In doing so, keep in mind
that the area of each little box on the graph paper has been
arranged to be 0.01, so you can calculate the area by counting
boxes. You don’t need to be too fanatical about dealing with
boxes where only a portion in under the curve; just eyeball
things and estimate.
(a) The total area under a density curve should be 1. Assuming
that the density curve has height zero outside of
the area of the plot, is the area under the entire curve
consistent with this? yes no
(b) What fraction of the data falls in the range 12 ≤ x ≤ 14?
A 14%
B 22%
C 34%
D 56%
E Can’t tell from this graph.
(c) What fraction of the data falls in the range 14 ≤ x ≤ 16?
A 14%
B 22%
C 34%
D 56%
E Can’t tell from this graph.
16 CHAPTER 3. DESCRIBING VARIATION
(d) What fraction of the data has x ≥ 16?
A 1%
B 2%
C 5%
D 10%
E Can’t tell from this graph.
(e) What is the width of the 95% coverage interval. (Note:
The coverage interval itself has top and bottom ends.
This problem asks for the spacing between the two ends.)
A 2
B 4
C 8
D 12
E Can’t tell from this graph.
Prob 3.16 If two distributions have the same five-number
summary, must their density plots have the same shape? Explain.
Prob 3.17 As the name suggests, the Old Faithful geyser
in Yellowstone National Park has eruptions that come at
fairly predictable intervals, making it particularly attractive
to tourists.
(a) You are a busy tourist and have only 10 minutes to sit
around and watch the geyser. But you can choose when
to arrive. If the last eruption occurred at noon, what time
should you arrive at the geyser to maximize your chances
of seeing an eruption?
A 12:50
B 1:00
C 1:05
D 1:15
E 1:25
(b) Roughly, what is the probability that in the best 10-
minute interval, you will actually see the eruption:
A 5%
B 10%
C 20%
D 30%
E 50%
F 75%
(c) A simple measure of how faithful is Old Faithful is the
interquartile range. What is the interquartile range, according
to the boxplot above?
A 10 minutes
B 15 minutes
C 25 minutes
D 35 minutes
E 50 minutes
F 75 minutes
(d) Not only are you a busy tourist, you are a smart tourist.
Having read about Old Faithful, you understand that the
time between eruptions depends on how long the previous
eruption lasted. Here’s a box plot indicating the distribution
of inter-eruption times when the previous eruption
duration was less than three minutes. (That is, “TRUE”
means the previous eruption lasted less than three minutes.)
You can easily ask the ranger what was the duration of
the previous eruption.
What is the best 10-minute interval to return (after a
noon eruption) so that you will be most likely to see the
next eruption, given that the previous eruption was less
than three minutes in duration (the “TRUE” category).
A 1:00 to 1:10
B 1:05 to 1:15
C 1:10 to 1:20
D 1:15 to 1:25
E 1:20 to 1:30
F 1:25 to 1:35
(e) How likely are you to see an eruption if you return for the
most likely 10-minute interval?
A About 5%
B About 10%
C About 20%
D About 30%
E About 50%
F About 75%
Prob 3.18 For each of the following distributions, estimate
by eye the mean, standard deviation, and 95% coverage interval.
Also, calculate the variance.
Part 1.
17
? Mean. 10 15 20 25 30
? Std. Dev. 2 5 12 15 20
? 95% coverage interval.
– Lower end: 1 3 10 15 20
– Upper end : 20 25 30 35 40
? Variance. 2 7 10 20 25 70 140 300
Part 2.
? Mean. 0.004 150 180 250
? Std. Dev. 10 30 60 80 120
? 95% coverage interval.
– Lower end: 50 80 100 135 150 200 230
– Upper end: 50 80 100 180 200 230
? Variance. 30 80 500 900 1600 23000
Prob 3.19 Consider a large company where the average
wage of workers is $15 per hour, but there is a spread of
wages from minimum wage to $35 per hour.
After a contract negotiation, all workers receive a $2 per
hour raise. What happens to the standard deviation of hourly
wages?
A No change
B It goes up by $2 per hour
C It goes up by $4 per hour
D It goes up by 4 dollars-square per hour
E It goes up by $4 per hour-square
F Can’t tell from the information given.
The annual cost-of-living adjustment is 3%. After the
cost-of-living adjustment, what happens to the standard deviation
of hourly wages?
A No change
B It goes up by 3%
C It goes up by 9%
D Can’t tell from the information given.
Prob 3.20 Construct a data set of 10 hypothetical exam
scores (use integers between 0 and 100) so that the interquartile
range equals zero and the mean is greater than the
median.
Give your set of scores here:
Prob 3.23 Here are some familiar quantities. For each of
them, indicate what is a typical value, how far a typical case
is from this typical value, and what is an extreme but not
impossible case.
Example: Adult height. Typical value, 1.7 meters (68
inches). Typical case is about 7cm (3 inches) from the typical
value. An extreme height is 2.2 meters (87 inches).
? An adult’s weight.
? Income of a full-time employed person.
? Speed of cars on a highway in good conditions.
? Systolic blood pressure in adults. [You might need to
look this up on the Internet.]
? Blood cholesterol LDL levels. [Again, you might need
the Internet.]
? Fuel economy among different models of cars.
? Wind speed on a summer day.
? Hours of sleep per night for college students.
Prob 3.24 Data on the distribution of economic variables,
such as income, is often presented in quintiles: divisions of
the group into five equal-sized parts.
Here is a table from the US Census Bureau (Historical Income
Tables from March 21, 2002) giving the distribution of
income across US households in year 2000.
Upper Mean
Quintile Boundary Value
Lowest $17,955 $10,190
Second $33,006 $25,334
Third $52,272 $42,361
Fourth $81,960 $65,729
Fifth — $141,260
Based on this table, calculate:
(a) The 20th percentile of family income.
10190 17955 33006 25334 52272 42361 81960 141260
(b) The 80th percentile of family income.
10190 17955 33006 25334 52272 42361 81960 141260
18 CHAPTER 3. DESCRIBING VARIATION
(c) The table doesn’t specify the median family income but
you can make a reasonable estimate of it. Pick the closest
one.
10000 18000 25500 42500 53000 65700
(d) Note that there is no upper boundary reported for the
fifth quintile, and no lower boundary reported for the first
quintile. Why?
(e) From this table, what evidence is there that family income
has a skew rather than “normal” distribution?
Prob 3.25 Use the Internet to find “normal” ranges for some
measurements used in clinical medicine. Pick one of the following
or choose one of particular interest to you: blood pressure
(systolic, diastolic, pulse), hematocrit, blood sodium and
potassium levels, HDL and LDL cholesterol, white blood cell
counts, clotting times, blood sugar levels, vital respiratory capacity,
urine production, and so on. In addition to the normal
range, find out what “normal” means, e.g., a 95% coverage interval
on the population or a range inconsistent with proper
physiological function. You may find out that there are differing
views of what “normal” means — try to indicate the
range of such views. You may also find out that “normal”
ranges can be different depending on age, sex, and other demographic
variables.
Prob 3.28 An advertisement for “America’s premier weight
loss destination” states that “a typical two week stay results
in a loss of 7-14 lbs.” (The New Yorker, 7 April 2008, p 38.)
The advertisement gives no details about the meaning of
“typical.” Give two or three plausible interpretations of the
quoted 7-14 pound figure in terms of “typical.” What interpretation
would be most useful to a person trying to predict
how much weight he or she might lose?
Prob 3.29 A seemingly straightforward statistic to describe
the health of a population is average age at death. In 1842,
the Report on the Sanitary Conditions of the Labouring Population
of Great Britain gave these averages: “gentlemen and
persons engaged in the professions, 45 years; tradesmen and
their families, 26 years; mechanics, servants and laborers, and
their families, 16 years.”
A student questioned the accuracy of the 1842 report with
this observation: “The mechanics, servants and laborer population
wouldn’t be able to renew itself with an average age
at death of 16 years. Mothers would be dying so early in life
that they couldn’t possibly raise their kids.”
Explain how an average age of death of 16 years could
be quite consistent with a “normal” family structure in which
parents raise their children through the child’s adolescence
in the teenage years. What other information about ages at
death would give a more complete picture of the situation?
Prob 3.30 The identification of a case as an outlier does not
always mean that the case is invalid or abnormal or the result
of a mistake. One situation where perfectly normal cases can
look like outliers is when there is a mechanism of proportionality
at work. Imagine, for instance, that there is a typical
rate of production of a substance, and the normal variability
is proportional in nature, say from 1/10 of that typical rate
to 10 times the rate. This leads to a situation where some
normal cases are 100 times as large as others.
To illustrate, look at the alder.csv data set, which contains
field data from a study of nitrogen fixation in alder
plants. The SNF variable records the amount of nitrogen fixed
in soil by bacteria that reside in root nodules of the plants.
Make a box plot and a histogram and describe the distribution.
Which of the following descriptions is most appropriate:
A The distribution is skewed to the left, with
outliers at very low values of SNF.
B The distribution is skewed to the right,
with outliers at very high values of SNF.
C The distribution is roughly symmetrical,
although there are a few outliers.
In working with a variable like this, it can help to convert
the variable in a way that respects the idea of a proportional
change. For instance, consider the three numbers 0.1, 1.0, and
10.0, which are evenly spaced in proportionate terms — each
number is 10 times bigger than the preceding number. But
as absolute differences, 0.1 and 1.0 are much closer to each
other than 1.0 and 10.0.
The logarithm function transforms numbers to a scale
where even proportions are equally spaced. For instance, taking
the logarithm of the numbers 0.1, 1.0, and 10.0 gives the
sequence ?1, 0, 1 — exactly evenly spaced.
The logSNF variable gives the logarithm of SNF. Plot out
the distribution of logSNF. Which of the following descriptions
is most apt?
A The distribution is skewed to the left.
B The distribution is skewed to the right.
C The distribution is roughly symmetrical.
You can compute logarithms directly in R, using the functions
log, log2, or log10. Which of these functions was used
to compute the quantity logSNF from SNF. (Hint: Try them
out!)
log log2 log10
The base of the logarithm gives the size of the proportional
change that corresponds to a 1-unit increase on the
logarithmic scale. For example, log2 calculates the base-2
logarithm. On the base-2 logarithmic scale, a doubling in size
corresponds to a 1-unit increase. In contrast, on the base-10
scale, a ten-fold increase in size gives a 1-unit increase.
Logarithmic transformations are often used to deal with
variables that are positive and strongly skewed. In economics,
price, income and production variables are often this way. In
general, any variable where it is sensible to describe changes
in terms of proportion might be better displayed on a logarithmic
scale. For example, price inflation rates are usually
given as percent (e.g., “The inflation rate was 4% last year.”)
and so in dealing with prices over time, the logarithmic transformation
can be appropriate.
Prob 3.31 This exercise deals with data on weight loss
achieved by clients who stayed two weeks at a weight-loss
resort. The same data using three different sorts of graphical
displays: a pie chart, a histogram, and a box-and-whiskers
plot. The point of the exercise is to help you decide which
display is the most effective at presenting information to you.
19
In many fields, pie charts are used as “statistical graphics.”
Here’s a pie chart of the weight loss:
(0,2]
(2,4] (4,6]
(6,8]
(8,10]
(10,12]
(12,16]
Weight Loss in lbs
Using the pie graph, answer the following:
(a) What’s the “typical” (median or mean) weight loss?
3.7 4.2 5.5 6.8 8.3 10.1 12.4
(b) What is the central 50% coverage interval?
2.3to6.8 4.2to10.7 4.4to8.7 6.1 to 9.3 5.2to12.1
(c) What is an upper extreme value? 10 13 16 18 20
Now to display the data as a histogram. So that you can’t
just re-use your answers from the pie chart, the weights have
been rescaled into kilograms.
Weight Loss in kg
kg
Density
0 2 4 6 8
0.00 0.10 0.20
Using the histogram, answer the following:
1. What’s the “typical” (median or mean) weight loss?
1.9 2.1 3.1 3.7 4.6 5.6
2. What is the central 50% coverage interval?
1.1to3.3 2.0to4.8 2.0to3.9 2.8 to 4.4 2.5to5.4
3. What is an upper extreme value? 6 8 10 12 14
Finally, here is a boxplot of the same data. It’s been
rescaled into a traditional unit of weight: stones
0.0 0.4 0.8
Weight Loss in Stones
Stones
Using the boxplot, answer the following:
1. What’s the “typical” (median or mean) weight loss?
0.20 0.35 0.50 0.68 0.83 1.2
2. What is the central 50% coverage interval?
0.2to0.5 0.3to0.8 0.4to0.8 0.5to0.7 0.3to0.6
3. What is an upper extreme value? 0.7 0.9 1.0 1.1 1.3
Which style of graphic made it easiest to answer the questions?
pie.chart histogram box.plot
Prob 3.36 Elevators typically have a close-door button.
Some people claim that this button has no mechanical function;
it’s there just to give impatient people some sense of
control over the elevator.
Design and conduct an experiment to test whether the
button does cause the elevator door to close. Pick an elevator
with such a button and record some details about the elevator
itself: place installed, year installed, model number, etc.
Describe your experiment along with the measurements
you made and your conclusions. You may want to do the
experiment in small teams and use a stopwatch in order to
make accurate measurements. Presumably, you will want to
measure the time between when the button is pressed and
when the door closes, but you might want to measure other
quantities as well, for instance the time from when the door
first opened to when you press the button.
Store the data from your experiment in a spreadsheet in
Google Docs. Set the permissions for the spreadsheet so that
anyone with the link can read your data. Make sure to paste
the link into the textbox so that your data can be accessed.
Please don’t inconvenience other elevator users with the
experiment.
Prob 3.50 What’s a “normal” body temperature? Depending
on whether you use the Celsius or Fahrenheit scale, you
are probably used to the numbers 37?
(C) or 98.6
?
(F). These
numbers come from the work of Carl Wunderlich, published
in Das Verhalten der Eigenwarme in Krankenheiten in 1868
20 CHAPTER 3. DESCRIBING VARIATION
based on more than a million measurements made under the
armpit. According to Wunderlich, “When the organism (man)
is in a normal condition, the general temperature of the body
maintains itself at the physiologic point: 37?C= 98.6
?F.”
Since 1868, not only have the techniques for measuring
temperatures improved, but so has the understanding that
“normal” is not a single temperature but a range of temperatures.
A 1992 article in the Journal of the American Medical
Association (PA Mackowiak et al., “A Critical Appraisal of
98.6
F ...” JAMA v. 268(12) pp. 1578-1580) examined temperature
measurements made orally with an electronic thermometer.
The subjects were 148 healthy volunteers between
age 18 and 40.
The figure shows the distribution of temperatures, separately
for males and females. Note that the horizontal scale
is given in both C and F — this problem will use F.
What’s the absolute range for females?
Minimum: 96.1 96.3 97.1 98.6 99.9 100.8
Maximum: 96.1 96.3 97.1 98.6 99.9 100.8
And for males?
Minimum: 96.1 96.3 97.1 98.6 99.9 100.8
Maximum: 96.1 96.3 97.1 98.6 99.9 100.8
Notice that there is an outlier for the females’ temperature,
as evidenced by a big gap in temperature between that
bar and the next closest bar. How big is the gap?
A About 0.01? F.
B About 0.1
C Almost 1? F.
Give a 95% coverage interval for females. Hint: The interval
will exclude the most extreme 2.5% of cases on each
of the left and right sides of the distribution. You can find
the left endpoint of the 95% interval by scanning in from the
left, adding up the heights of the bars until they total 0.025.
Similarly, the right endpoint can be marked by scanning in
from the right until the bars total 0.025.
A About 96.2
F to about 99.0
B About 96.8F to about 100.0
C About 97.6
F to about 99.2
And for males?
A About 96.2
F to about 99.2
B About 96.7 to about 99.4
C About 97.5
F to about 99.6
Prob 3.53 There are many different numerical descriptions
of distributions: mean, median, standard deviation, variance,
IQR, coverage interval, ... And these are just the ones we
have touched on so far. We’ll also encounter “standard error,”
“margin of error,” “confidence interval.” There are so
many that it becomes a significant challenge to students to
keep them straight. Eventually, statistical workers learn the
subtleties of the different descriptions and when each is appropriate.
Then, like using near synonyms in English, it becomes
second nature.
As an example, consider the verb “spread.”. Here are some
synonyms from the thesaurus, each of which is appropriate in
a particular context: broadcast, scatter, propagate, sprawl,
extend, stretch, cover, daub, ... If you were talking to a farmer
about sewing seeds, the words “broadcast” or “scatter” would
be appropriate, but it would be silly to say the seeds are being
“daubbed” or “sprawled”. On the other hand, to an urbanite
concerned with congestion in traffic, the growth of the city
might well be summarized with “sprawl.” You have to know
the context and the intent to choose the correct term.
To help to understand the different context and intents,
here are two important ways of categorizing what a particular
description captures:
(1) Location and scatter
– What is a typical value? (“center”)
– What are the top and bottom range of the values?
(“range”)
– How far are the values scattered? (“scatter”)
– What is high? or What is low? (“non-central”)
(2) Including the “extremes”
– All inclusive, and sensitive to outliers. (“notrobust”)
– All inclusive, but not sensitive to outliers. (“robust”)
– Leaves out the very extremes. (“plausible”’)
– Focuses on the middle. (“mainstream”)
Note that descriptors of both the “plausible” and the
“mainstream” type are necessarily robust, since they
leave out the outliers.
(3) Individual versus whole sample.
– Description relevant to individual cases
– Description or summary of entire samples, combining
many cases.
21
You won’t have to deal with this until later, where it explains
terms that you haven’t yet encountered like like
“standard error”, “margin of error”, “confidence interval.”
Example: The mean describes the center of a distribution. It
is calculated from all the data and not-robust against outliers.
For each of the following descriptors of a distribution ,
choose the items that best characterize the descriptor.
1. Median
(a) center range scatter non-central
(b) robust not-robust plausible mainstream
2. Standard Deviation
(a) center range scatter non-central
(b) robust not-robust plausible mainstream
3. IQR
(a) center range scatter non-central
(b) robust not-robust plausible mainstream
4. Variance
(a) center range scatter non-central
(b) robust not-robust plausible mainstream
5. 95% coverage interval
(a) center range scatter non-central
(b) robust not-robust plausible mainstream
6. 50% coverage interval
(a) center range scatter non-central
(b) robust not-robust plausible mainstream
7. 50th percentile
(a) center range scatter non-central
8. 80th percentile
(a) center range scatter non-central
9. 99th percentile
(a) center range scatter non-central
10. 10th percentile
(a) center range scatter non-central
One of the reasons why there are so many descriptive terms
is that they have different roles in theory. For example, the
variance turns out to have simple theoretical properties that
make it useful when describing sums of variables. It’s much
simpler than, say, the IQR.
Prob 3.54 There are two kinds of questions that are often
asked relating to percentiles:
What is the value that falls at a given percentage? For
instance, in the ten-mile-race.csv running data, how
fast are the fastest 10% of runners? In R, you would
ask in this way:
> run = fetchData("ten-mile-race.csv")
> qdata(0.10, run$net)
10%
4409
The answers is in the units of the variable, in this case
seconds. So 10% of the runners have net times faster
than or equal to 4409 seconds.
What percentage falls at a given value? For instance,
what fraction of runners are faster than 4000 seconds?
> pdata(4000, run$net)
[1] 0.04029643
The answer includes those whose net time is exactly
equal to or less than 4000 seconds.
It’s important to pay attention to the p and q in the statement.
pdata and qdata ask related but different questions.
Use pdata and qdata to answer the following questions
about the running data.
1. Below (or equal to) what age are the youngest 35% of
runners?
Which statement will do the correct calculation?
A pdata(0.35,run$age)
B qdata(0.35,run$age)
C pdata(35,run$age)
D qdata(35,run$age)
What will the answer be?
28 29 30 31 32 33 34 35
2. What’s the net time that divides the slowest 20% of
runners from the rest of the runners?
Which statement will do the correct calculation?
A pdata(0.20,run$net)
B qdata(0.20,run$net)
C pdata(0.80,run$net)
D qdata(0.80,run$net)
What will the answer be?
4921 5318 5988 6346 7123 7431 seconds
3. What is the 95% coverage interval on age?
Which statement will do the correct calculation?
A pdata(c(0.025,0.975),run$age)
B qdata(c(0.025,0.975),run$age)
C pdata(c(0.050,0.950),run$age)
D qdata(c(0.050,0.950),run$age)
22 CHAPTER 3. DESCRIBING VARIATION
? What will the answer be?
A 22 to 60
B 20 to 65
C 25 to 59
D 20 to 60
4. What fraction of runners are 30 or younger?
? Which statement will do the correct calculation?
A pdata(30,run$age)
B qdata(30,run$age)
C pdata(30.1,run$age)
D qdata(30.1,run$age)
? What will the answer be?
In percent: 29.3 30.1 33.7 35.9 38.0 39.3
5. What fraction of runners are 65 or older? (Caution:
This isn’t yet in the form of a BELOW question.)
Which statement will do the correct calculation?
A pdata(65,run$age)
B pdata(64.99,run$age)
C pdata(65.01,run$age)
D 1-pdata(65,run$age)
E 1-pdata(64.99,run$age)
F 1-pdata(65.01,run$age)
What will the answer be?
In percent: 0.5 1.1 1.7 2.3 2.9
6. The time it takes for a runner to get to the start line
after the starting gun is fired is the difference between
the time and net.
run$to.start = run$time - run$net
How long is it before 75% of runners get to the
start line?
In seconds: 164 192 213 294 324 351
What fraction of runners get to the start line before
one minute? (Caution: the times are measured in
seconds.)
In percent: 10 15 19 22 25 31 34
7. What is the 95% coverage interval on the ages of female
runners?
A 19 to 61 years
B 22 to 61 years
C 19 to 56 years
D 22 to 56 years
8. What fraction of runners have a net time BELOW 4000
seconds? (That is, don’t include those who are at exactly
4000 seconds.)
In percent: 3.72 4.00 4.03 4.07 5.21
Chapter 4
Group-wise Models
Reading Questions
1. Which is larger: variance of residuals, variance of the
model values, or the variance of the actual values?
2. How can a difference in group means clearly shown by
your data nonetheless be misleading?
3. What does it mean to partition variation? What’s special
about the variance — the square of the standard
deviation — as a way to measure variation?
Prob 4.03 To exercise your ability to calculate groupwise
quantities, use the swimming records in swim100m.csv and
calculate the mean and minimum swimming time for the subset.
(Answers have been rounded to one decimal place.)
> require(mosaic)
> swim = fetchData("SwimRecords")
(a) Record times for women:
Mean: 47.8 53.5 54.7 57.3 61.4 63.4 65.2 73.8 84.2
Minimum: 47.8 53.5 54.7 57.3 61.4 63.4 65.2 73.8 84.2
(b) All records before 1920. (Hint: the construction
year<1920 can be used as a variable.)
Mean: 47.8 53.8 54.7 57.3 61.4 63.4 65.2 73.8 84.2
Minimum: 47.8 53.8 54.7 57.3 61.4 63.4 65.2 73.8 84.2
(c) All records that are slower than 60 seconds. (Hint:
Think what “slower” means in terms of the swimming
times.)
Mean: 47.8 53.8 54.7 60.2 61.6 63.4 65.2 73.8 84.2
Minimum: 47.8 53.8 54.7 60.2 61.6 63.4 65.2 73.8 84.2
Prob 4.04 Here is a model of wages in 1985 constructed
using the CPS85 data.
> mod = mm( wage ~ sector, data=CPS85 )
wage is the “response variable,” while sector is the explanatory
variable.
For every case in the data, the model will give a “fitted
model value.” Different cases will have different fitted model
values if they have different values for the explanatory variable.
Here, the model assigns different fitted model values to
workers in different sectors of the economy.
You can see the groupwise means for the different sectors
by looking at the model. Just give the model name, like this:
> mod
(a) What is the mean wage for workers in the construction
sector (const)?
6.54 7.42 7.59 8.04 8.50 9.50 11.95 12.70
(b) What is the mean wage for workers in the management
sector (manag)?
6.54 7.42 7.59 8.04 8.50 9.50 11.95 12.70
(c) Which sector has the lowest mean wage?
clerical const manag manuf prof sales service
(d) Statistical models attempt to account for case-to-case
variability. One simple way to measure the success of
a model is to look at the variation in the fitted model values.
What is the standard deviation in the fitted model
values for mod?
0 0.95 1.10 1.53 2.03 2.20 2.43 3.43 4.13 4.65
(e) The residuals of the model tell how far each case is from
that case’s fitted model value. In interpreting models,
it’s often important to know the typical size of a residual.
The standard deviation is often used to quantify “size”.
What’s the standard deviation of the residuals of mod?
0 0.95 1.10 1.53 2.03 2.20 2.43 3.43 4.13 4.65
Prob 4.05 Here are two models of wages in 1985 in the
CPS85 data:
> mod1 = mm( wage ~ 1, data=CPS85 )
> mod2 = mm( wage ~ sector, data=CPS85 )
The model mod1 corresponds to the grand mean, as if all
cases were in the same group. The model mod2 breaks down
the mean wage into groups depending on what sector of the
economy the worker is in.
(a) Which model has the greater variation from case to case
in fitted model values?
mod1 mod2 same for both
23
24 CHAPTER 4. GROUP-WISE MODELS
(b) Which model has the greater variation from case to case
in residuals?
mod1 mod2 same for both
(c) Which of these statements is true for both model 1 and 2
(and all other groupwise mean models)?
The mean residual is always zero. True or False
The standard deviation of residuals plus the standard
deviation of fitted model values gives the standard
deviation of the variable being modeled (the
“response variable”). True or False
The variance of residuals plus the variance of fitted
model values gives the variance of the variable being
modeled.True or False
Prob 4.06 Read in the Current Population Survey wage
data:
> w = fetchData("CPS85")
(a) What is the grand mean of wage?
7.68 7.88 8.26 8.31 9.02 9.40 10.88
(b) What is the group-wise mean of wage for females?
7.68 7.88 8.26 8.31 9.02 9.40 10.88
(c) What is the group-wise mean of wage for married people?
7.68 7.88 8.26 8.31 9.02 9.40 10.88
(d) What is the group-wise mean of wage for married females?
(Hint: There are two grouping variables involved.)
7.68 7.88 8.26 8.31 9.02 9.40 10.88
Prob 4.07 Read in the Galton height data
> g = fetchData("Galton")
(a) What is the standard deviation of the height?
(b) Calculate the grand mean and, from that, the residuals
of the actual heights from the grand mean.
> mod0 = mm(height~1, data=g)
> res = resid(mod0)
What is the standard deviation of the residuals from this
”grand mean” model? 2.51 2.58 2.92 3.58 3.82
(c) Calculate the group-wise mean for the different sexes and,
from that, the residuals of the actual heights from this
group-wise model.
> mod1 = mm( height ~ sex, data=g)
> res1 = resid(mod1)
What is the standard deviation of the residuals from this
group-wise model?
2.51 2.58 2.92 3.58 3.82
(d) Which model has the smaller standard deviation of residuals?
mod0 mod1 they are the same
Prob 4.08 Create a spreadsheet with the three variables
distance, team, and position, in the following way:
distance team position
5 Eagles center
12 Eagles forward
11 Eagles end
2 Doves center
18 Doves end
12 Penguins forward
15 Penguins end
19 Eagles back
5 Penguins center
12 Penguins back
(a) After entering the data, you can calculate the mean
distance in various ways.
What is the grand mean distance?
4 9.25 10 11 11.1 11.75 12 14.67 15.5
What is the group mean distance for the three
teams?
– Eagles 4 9.25 10 11 11.1 11.75 12 14.67 15.5
– Doves 4 9.25 10 11 11.1 11.75 12 14.67 15.5
– Penguins 4 9.25 10 11 11.1 11.75 12 14.67 15.5
What is the group mean distance for the following
positions?
– back 4 9.25 10 11 11.1 11.75 12 14.67 15.5
– center 4 9.25 10 11 11.1 11.75 12 14.67 15.5
– end 4 9.25 10 11 11.1 11.75 12 14.67 15.5
(b) Now, just for the sake of developing an understanding of
group means, you are going to change the dist data. Make
up values for dist so that the mean dist for Eagles is 14,
for Penguins is 13, and for Doves is 15.
Cut and paste the output from R showing the means for
these groups and then the means taken group-wise according
to position.
(c) Now arrange things so that the means are as stated in (b)
but every case has a residual of either 1 or ?1.
Prob 4.10 It can be helpful when testing and evaluating
statistical methods to use simulations. In this exercise, you
are going to use a simulation of salaries to explore groupwise
means. Keep in mind that the simulation is not reality; you
should NOT draw conclusions about real-world salaries from
the simulation. Instead, the simulation is useful just for showing
how statistical methods work in a setting where we know
the true answer.
To use the simulations, you’ll need both the mosaic package
and some additional software. Probably you already have
mosaic loaded, but it doesn’t hurt to make sure. So give both
these commands:
> require(mosaic)
> source("http://www.mosaic-web.org/StatisticalModeling/additional-software.r")
The simulation you will use in this exercise is called
salaries. It’s a simulation of salaries of college professors.
To carry out the simulation, give this command:
25
> run.sim( salaries, n=5 )
age sex children rank salary
1 46 M 3 Assoc 50989.76
2 67 F 1 Full 63916.00
3 41 F 0 Assoc 42662.20
4 53 M 0 Full 53213.34
5 47 M 0 Assoc 49762.19
The argument n tells how many cases to generate. By looking
at these five cases, you can see the structure of the data.
Chances are, the data you generate by running the simulation
will differ from the data printed here. That’s because
the simulation generates cases at random. Still, underlying
the simulation is a mathematical model that imposes certain
patterns and relationships on the variables. You can get an
idea of the structure of the model by looking at the salaries
simulation itself:
> salaries
Causal Network with 5 vars: age, sex, children, rank, salary
===============================================
age is exogenous
sex <== age
children is exogenous
rank <== age & sex & children
salary <== age & rank
This structure, and the equations that underlie it, might or
might not correspond to the real world; no claim about the
realism of the model is being made here. Instead, you’ll use
the model to explore some mathematical properties of group
means.
Generate a data set with n = 1000 cases using the simulation.
> s = run.sim( salaries, n=1000 )
1. What is the grand mean of the salary variable? (Choose
the closest.)
39000 42000 48000 51000 53000 59000 65000 72000
2. What is the grand mean of the age variable? (Choose the
closest.)
41 45 48 50 53 55 61
3. Calculate the groupwise means for salary broken down by
sex.
For women
39000 42000 48000 51000 53000 59000 65000 72000
For men?
39000 42000 48000 51000 53000 59000 65000 72000
What’s the pattern indicated by these groupwise
means?
A Women and mean earn almost exactly the
same, on average.
B Men earn less than women, on average.
C Women earn less than men, on average.
4. Make side-by-side boxplots of the distribution of salary,
broken down by sex. Use the graph to answer the following
questions. (Choose the closest answer.)
? What fraction of women earn more than the median
salary for men?
None 0.25 0.50 0.75 All
What fraction of men earn less than the median salary
for women?
None 0.25 0.50 0.75 All
Explain how it’s possible that the mean salary for men
can be higher than the mean salary for women, and
yet some men earn less than some women. (If this is
obvious to you, then state the obvious!)
5. There are other variables involved in the salary simulation.
In particular, consider the rank variable. At most colleges
and universities, professors start at the assistant level, then
some are promoted to associate and some further promoted
to “full” professors.
Find the mean salary broken down by rank.
What’s the mean salary for assistant professors?
(Choose the closest.) 37000 41000 46000 52000 58000 63000
What’s the mean salary for associate professors?
(Choose the closest.) 37000 41000 46000 52000 58000 63000
What’s the mean salary for “full” professors? (Choose
the closest.) 37000 41000 46000 52000 58000 63000
6. Make the following side-by-side boxplot. (Make sure to
copy the command exactly.)
> bwplot( salary ~ cross(rank,sex), data=s )
Based on the graph, which choose one of the following:
A Adjusted for rank, women and mean earn
about the same.
B Adjusted for rank, men systematically earn
less than women.
C Adjusted for rank, women earn less than
men.
7. Look at the distribution of rank, broken down by sex.
(Hint: rank is a categorical variable, so it’s meaningless to
calculate the mean. But you can tally up the proportions.
Explain how the different distributions of rank for the different
sexes can account for the pattern of salaries.
Keep in mind that this is a simulation and says nothing
directly about the real-world distribution of salaries. In analyzing
real-world salaries, however, you might want to use
some of the same techniques.
26 CHAPTER 4. GROUP-WISE MODELS
Chapter 5
Confidence Intervals
Reading Questions
? What is a sampling distribution? What sort of variation
does it reflect?
? What is resampling and bootstrapping?
? What is the difference between a “confidence interval”
and a “coverage interval?”
Prob 5.01 The mean of the adult children in Galton’s data
is
> mean( height, data=Galton )
[1] 66.76069
Had Galton selected a different sample of kids, he would
likely have gotten a slightly different result. The confidence
interval indicates a likely range of possible results around the
actual result.
Use bootstrapping to calculate the 95% confidence interval
on the mean height of the adult children in Galton’s data.
The following statement will generate 500 bootstrapping trials.
> trials = do(500) * mean(height, data=resample(Galton) )
(a) What’s the 95% confidence interval on the mean height?
A 66.5 to 67.0 inches.
B 66.1 to 67.3 inches.
C 61.3 to 72 inches.
D 65.3 to 66.9 inches.
(b) A 95% coverage interval on the individual children’s
height can be calculated like this:
> qdata(c(0.025,0.975), height, data=Galton)
2.5% 97.5%
60 73
Explain why the 95% coverage interval of individual children’s
heights is so different from the 95% confidence interval
on the mean height of all children.
(c) Calculate a 95% confidence interval on the median height.
A 66.5 to 67.0 inches.
B 66.1 to 67.3 inches.
C 61.3 to 72 inches.
D 65.3 to 66.9 inches.
Prob 5.02 Consider the mean height in Galton’s data,
grouped by sex:
> g = fetchData("Galton")
> mean( height ~ sex, data=g )
F M
64.11016 69.22882
In interpreting this result, it’s helpful to have a confidence
interval on the mean height of each group. Resampling and
bootstrapping can do the job.
Resampling simulates a situation where a new sample is
being drawn from the population, as if Galton had restarted
his work. Since you don’t have access to the original population,
you can’t actually pick another sample. But you can
resample from you sample, introducing much of the variation
that would occur if you had drawn a new sample from the
population.
Each resample gives a somewhat different result:
> mean( height ~ sex, data=resample(g) )
F M
64.06674 69.18453
> mean( height ~ sex, data=resample(g) )
F M
63.95768 69.17194
> mean( height ~ sex, data=resample(g) )
F M
64.03630 69.33382
By repeating this many times, you can estimate how much
variation there is in the resampling process:
> trials = do(1000) * mean(height~sex, data=resample(g))
To quantify the variation, it’s conventional to take a 95% coverage
interval. For example, here’s the coverage interval for
heights of females
> qdata( c(0.025, 0.975), F, data=trials )
27
28 CHAPTER 5. CONFIDENCE INTERVALS
2.5% 97.5%
63.87739 64.34345
What is the 95% coverage interval for heights of males
in the resampling trials? (Choose the closest answer.)
A 63.9 to 64.3 inches
B 63.9 to 69.0 inches
C 69.0 to 69.5 inches
Do the confidence intervals for mean height overlap between
males and females. Yes No
Make a box-and-whisker plot of height versus sex.
> bwplot( height ~ sex, data=g )
This displays the overall variation of individual heights.
– Is there any overlap between males and females in
the distribution of individual heights? Yes No
– How does the spread of individual heights compare
to the 95% confidence intervals on the means?
A The CIs on the means are about as wide as
the distribution of individual heights.
B The CIs on the means are much wider than
the distribution of individual heights.
C The CIs on the means are much narrower
than the distribution of individual heights.
Prob 5.03 Resampling and bootstrapping provides one
method to find confidence intervals. By making certain mathematical
assumptions, it’s possible to estimate a confidence interval
without randomization. This is the traditional method
and still widely used. It’s implemented by the confint function
when applied to a model such as created by mm or lm.
For example:
> g = fetchData("Galton")
> mod = mm(height ~ sex, data=g)
> confint(mod)
group 2.5 % 97.5 %
1 F 63.87352 64.34681
2 M 69.00046 69.45717
For the present, you can imagine that confint is going
through the work of resampling and repeating trials, although
that is not what is actually going on.
For each of the following groupwise models, find the 95%
confidence interval on the group means. Then answer whether
the confidence intervals overlap. When there are more than
two groups, consider whether any of the groups overlap with
any of the others.
Note: Although mean and mm are doing much the same
thing, mm retains additional information about the data
needed to calculate a confidence interval. So use mm in this
exercise rather than mean.
(a) In the kids’ feet data, the foot width broken down by sex.
> mm( width ~ sex, data=KidsFeet )
Overlap: None Barely Much
(b) In the CPS85 data, the hourly wage broken down by sex.
> mm( wage ~ sex, data=CPS85 )
Overlap: None Barely Much
(c) In the CPS85 data, the hourly wage broken down by
married.
> mm( wage ~ married, data=CPS85 )
Overlap: None Barely Much
(d) In the CPS85 data, the hourly wage broken down by
sector.
> mm( wage ~ sector, data=CPS85 )
Overlap: None Barely Much
Prob 5.09 A student writes the following on a homework
paper:
“The 95% confidence interval is (9.6, 11.4). I’m
very confident that this is correct, because my
sample mean of 10.5 lies within this interval.”
Comment on the student’s reasoning.
Source: Prof. Rob Carver, Stonehill College
Prob 5.12 Scientific papers very often contain graphics
with “error bars.” Unfortunately, there is little standardization
of what such error bars mean so it is important for the
reader to pay careful attention in interpreting the graphs.
The following four graphs — A through D — each show
a distribution of data along with error bars. The meaning
of the bars varies from graph to graph according to different
conventions used in different areas of the scientific literature.
In each graph, the height of the filled bar is the mean of the
data. Your job is to associate each error bar with its meaning.
You can do this by comparing the actual data (shown as dots)
with the error bar.
29
Graph A Graph B
Wild?type Mutant
Range of the data
Graph A Graph B Graph C Graph D
Standard deviation of the data
Graph A Graph B Graph C Graph D
Standard error of the mean
Graph A Graph B Graph C Graph D
95% confidence interval on the mean
Graph A Graph B Graph C Graph D
This problem is based on G. Cumming, F. Fidler, and DL
Vaux (2007), “Error bars in experimental biology”, J. Cell
Biology 177(1):7-11
Prob 5.13 An advertisement for “America’s premier weight
loss destination” states that “a typical two week stay results
in a loss of 7-14 lbs.” (The New Yorker, 7 April 2008, p 38.)
The advertisement gives no details about the meaning of
“typical,” but here are some possibilities:
The 95% coverage interval of the weight loss of the individual
clients.
The 50% coverage interval of the weight loss of the individual
clients.
The 95% confidence interval on the mean weight loss of
all the clients.
The 50% confidence interval on the mean weight loss of
all the clients.
Explain what would be valid and what misleading about
advertising a confidence interval on the mean weight loss.
Why might it be reasonable to give a 50% coverage interval
of the weight loss of individual clients, but not appropriate
to give a 50% confidence interval on the mean weight loss.
Prob 5.17 Standard errors and confidence interval apply
not just to model coefficients, but to any numerical description
of a variable. Consider, for instance, the median or IQR
or standard deviation, and so on.
A quick and effective way to find a standard error is a
method called bootstrapping, which involves repeatedly resampling
the variable and calculating the description on each
resample. This gives the sampling distribution of the description.
From the sampling distribution, the standard error —
which is just the standard deviation of the sampling distribution
— can be computed.
Here’s an example, based on the inter-quartile range of the
kids’ foot length measurements.
First, compute the desired sample statistic on the actual
data. As it happens, IQR is not part of the mosaic package,
so you need to use the with function:
> with( IQR(length), data=KidsFeet )
[1] 1.6
Next, modify the statement to incorporate resampling of
the data:
> with( IQR(length), data=resample(KidsFeet) )
[1] 1.3
Finally, run this statement many times to generate the
sampling distribution and find the standard error of this distribution:
> samps = do(500) * with( IQR(length), data=resample(KidsFeet) )
> sd(samps)
result
0.3379714
Use the bootstrapping method to find an estimate of the
standard error of each of these sample statistics on the kids’
foot length data:
1. The sample median. (Pick the closest answer.)
0.01 0.07 0.14 0.24 0.34 0.71 1.29 1.32 24.6
2. The sample standard deviation. (Pick the closest answer.)
0.01 0.07 0.14 0.24 0.34 0.71 1.29 1.32 24.6
3. The sample 75th percentile.
0.01 0.07 0.14 0.24 0.34 0.71 1.29 1.32 24.6
Bootstrapping works well in a broad set of circumstances,
but if you have a very small sample, say less than a dozen
cases, you should be skeptical of the result.
30 CHAPTER 5. CONFIDENCE INTERVALS
Prob 5.20 A perennial problem when writing scientific reports
is figuring out how many significant digits to report. It’s
na¨?ve to copy all the digits from one’s calculator or computer
output; the data generally do not justify such precision.
Once you have a confidence interval, however, you do not
need to guess how many significant digits are appropriate.
The standard error provides good guidance. Here is a rule of
thumb: keep two significant digits of the margin of error and
round the point estimate to the same precision.
For example, suppose you have a confidence interval of
1.7862 ± 0.3624 with 95% confidence. Keeping the first two
significant digits of the margin of error gives 0.36. We’ll keep
the point estimate to the same number of digits, giving altogether
1.79 ± 0.36.
Another example: suppose the confidence interval is
6548.23 ± 1321. Keeping the first two digits of the margin
of error gives 1300, with a corresponding margin of error of
6500 ± 1300.
(a) Suppose the computer output is 0.03234232±0.01837232.
Using this rule of thumb, what should you report in as
the confidence interval?
A 0.3234 ± 0.01837
B 0.3234 ± 0.0183
C 0.0323 ± 0.0184
D 0.0323 ± 0.018
E 0.032 ± 0.018
F 0.032 ± 0.01
G 0.03 ± 0.01
(b) Now suppose the computer output were 99.63742573 ±
1.48924367.
What should you report as the confidence interval?
A 100 ± 1
B 99 ± 1.5
C 99.6 ± 1.5
D 99.64 ± 1.49
E 99.647 ± 1.489
Prob 5.23 Robert Hooke (1635-1703) was a contemporary
of Isaac Newton. He is famous for his law of elasticity (Hooke’s
Law) and is considered the father of microscopy. He was
the first to use the word “cell” to name the components of
plant tissue; the structures he observed during his observations
through a microscope reminded him of monks’ cells in
a monastery. He drew this picture of cork cells under the
microscope:
Regarding these observations of cork, Hooke wrote:
I could exceedingly plainly perceive it to be all perforated
and porous, much like a Honey-comb, but
that the pores of it were not regular. . . . these
pores, or cells, . . . were indeed the first microscopical
pores I ever saw, and perhaps, that were
ever seen, for I had not met with any Writer or
Person, that had made any mention of them before
this ....
He went on to measure the cell size.
But, to return to our Observation, I told several
lines of these pores, and found that there were usually
about threescore of these small Cells placed
end-ways in the eighteenth part of an Inch in
length, whence i concluded that there must be neer
eleven hundred of them, or somewhat more then
a thousand in the length of an Inch, and therefore
in a square Inch above a Million, or 1166400. and
in a Cubick Inch, above twelve hundred Millions,
or 1259712000. a thing almost incredible, did not
our Microscope assure us of it by ocular demonstration
.... — from Robert Hooke, Micrographia,
1665
There are several aspects of Hooke’s statement that reflect
its origins at the start of modern science. Some are quaint,
such as the spelling and obsolete use of Capitalization and
the hyperbolic language (“a thing almost incredible,” which,
to be honest, is true enough, but not a style accepted today
in scientific writing). Hooke worked before the development
of the modern notion of precision. The seeming exactness of
the number 1,259,712,000 for the count of cork cells in a cubic
inch leaves a modern reader to wonder: did he really count
over a billion cells?
It’s easy enough to trace through Hooke’s calculation. The
observation at the base of the calculation is threescore cells
— that’s 60 cells — in 1/18 of an inch. This comes out to
60 × 18 = 1080 cells per linear inch. Modeling each cell as a
little cube allows this to be translated into the number of cells
covering a square inch: 10802 or 1,116,400. To estimate the
number of cells in a cubic inch of cork material, the calculation
is 10803 or 1,259,712,000.
31
To find the precision of these estimates, you need to go
back to the precision of the basic observation: 60 cells in
1/18th of an inch. Hooke didn’t specify the precision of this,
but it seems reasonable to think it might be something like
60 ± 5 or so, at a confidence level of 95%.
1. When you change the units of a measurement (say, miles
into kilometers), both the point estimate and the margin
of error are multiplied by the conversion factor.
Translate Hooke’s count of the number of cells in 1/18
inch, 60 ± 5 into a confidence interval on the number of
cells per linear inch.
A 1080 ± 5
B 1080 ± 90
C 1080 ± 180
2. In calculating the number of cells to cover a square
inch, Hooke simply squared the number of cells per inch.
That’s a reasonable approximation.
To carry this calculation through a confidence interval,
you can’t just square the point estimate and the margin
of error separately. Instead, a reasonable way to proceed
is to take the endpoints of the interval (e.g., 55 to
65 for the count of cells in 1/18 inch), and square those
endpoints. Then convert back to ± format.
What is a reasonable confidence interval for the number
of cells covering a square inch?
A 1, 200, 000 ± 500, 000
B 1, 170, 000 ± 190, 000
C 1, 166, 000 ± 19, 000
D 1, 166, 400 ± 1, 900
3. What is a reasonable confidence interval for the number
of cork cells that fit into a cubic inch?
A 1, 300, 000, 000 ± 160, 000, 000
B 1, 260, 000, 000 ± 16, 000, 000
C 1, 260, 000, 000 ± 1, 600, 000
D 1, 259, 700, 000 ± 160, 000
E 1, 259, 710, 000 ± 16, 000
F 1, 259, 712, 000 ± 1, 600
It’s usually better to write such numbers in scientific
notation, so that the reader doesn’t have to count digits
to make sense of them. For example, 1, 260, 000, 000 ±
16, 000, 000 might be more clearly written as 1260±16×
106
.
Prob 5.30 After a month’s hard work in the laboratory,
you have measured a growth hormone from each of 40 plants
and computed a confidence interval on the grand mean hormone
concentration of 36 ± 8 ng/ml. Your advisor asks you
to collect more samples until the margin of error is 4 ng/ml.
Assuming the typical 1/
√
n relationship between the number
of cases in the sample and the size of the margin of error, how
many plants, including the 40 you have already processed,
will you need to measure?
40 80 160 320 640
Prob 5.31 You are calculating the mean of a variable B and
you want to know the standard error, that is, the standard
deviation of the sampling distribution of the mean. Which of
the following statements will estimate the standard error by
bootstrapping?
A sd(do(500)*resample(mean(B)))
B resample(do(500)*mean(sd(B)))
C mean(do(500)*mean(resample(B)))
D sd(do(500)*mean(resample(B)))
E resample(sd(do(500)*mean(B)))
Prob 5.40 In this activity, you are going to look at the
sampling distribution and how it depends on the size of the
sample. This will be done by simulating a sample drawn from
a population with known properties. In particular you’ll be
looking at a variable that is more or less like the distribution
of human adult heights — normally distributed with a mean
of 68 inches and a standard deviation of 3 inches.
Here’s one random sample of size n = 10 from this simulated
population:
rnorm(10, mean=68, sd=3)
[1] 62.842 71.095 62.357 68.896 67.494
[6] 67.233 69.865 71.664 69.241 70.581
These are the heights of a random sample of n = 10. The
sampling distribution refers to some numerical description of
such data, for example, the sample mean. Consider this sample
mean the output of a single trial.
mean( rnorm(10, mean=68, sd=3) )
[1] 67.977
If you gave exactly this statement, it’s very likely that
your result was different. That’s because you have a different
random sample — rnorm generates random numbers. And if
you repeat the statement, you’ll likely get a different value
again, for instance:
mean( rnorm(10, mean=68, sd=3) )
[1] 66.098
Note that both of the sample means above differ somewhat
from the population mean of 68.
The point of examining a sampling distribution is to be
able to see the reliability of a random sample. Do to this, you
generate many trials — say, 1000 — and look at the distribution
of the trials.
For example, here’s how to look at the sampling distribution
for the mean of 10 random cases from the population:
s = do(1000)*mean( rnorm(10, mean=68, sd=3) )
By examining the distribution of the values stored in s, you
can see what the sampling distribution looks like.
Generate your own sample
1 What is the mean of this distribution?
2 What is the standard deviation of this distribution?
3 What is the shape of this distribution?
Now modify your simulation to look at the sampling distribution
for n = 1000.
32 CHAPTER 5. CONFIDENCE INTERVALS
4 What is the mean of this distribution?
5 What is the standard deviation of this distribution?
6 What is the shape of this distribution?
Which of these two sample sizes, n = 10 or n = 1000, gave
a sampling distribution that was more reliable? How might
you measure the reliability?
The idea of a sampling distribution applies not just to
means, but to any numerical description of a variable, to the
coefficients on models, etc.
Now modify your computer statements to examine the
sampling distribution of the standard deviation rather
than the mean. Use a sample size of n = 10. (Note: Read
the previous sentence again. The statistic you are asked to
calculate is the sample standard deviation, not the sample
mean.)
7 What is the mean of this distribution?
8 What is the standard deviation of this distribution?
9 What is the shape of this distribution?
Repeat the above calculation of the distribution of the
sample standard deviation with n = 1000.
10 What is the mean of this distribution?
11 What is the standard deviation of this distribution?
12 What is the shape of this distribution?
For this simulation of heights, the population standard deviation
was set to 3. You expect the result from a random
sample to be close to the population parameter. Which of the
two sample sizes, n = 10 or n = 1000 gives results that are
closer to the population value?
Chapter 6
The Language of Models
Reading Questions
What is an “explanatory variable” and how does it differ
from a “response variable?”
What is the difference between a ”model term” and a
”variable?”
Why are there sometimes multiple model terms in a
model?
What is an interaction term and how does it differ from
a variable?
In graphs of the model response value versus an explanatory
variable, quantitative explanatory variables
are associated with slopes and categorical explanatory
variables are associated with step-like differences. Explain
why.
How can models be useful that fail to represent the actual
causal connections (if any) between variables? Give
an example.
Prob 6.01 In McClesky vs Georgia, lawyers presented data
showing that for convicted murderers, a death sentence was
more likely if the victim was white than if the victim was
black. For each case, they tabulated the race of the victim
and the sentence (death or life in prison). Which of the following
best describe the variables their models?
A Response is quantitative; explanatory variable
is quantitative.
B Response is quantitative; explanatory variable
is categorical.
C Response is categorical; explanatory variable
is quantitative.
D Response is categorical; explanatory variable
is categorical.
E There is no explanatory variable.
[Note: Based on an example from George Cobb.]
Prob 6.02 In studies of employment discrimination, several
attributes of employees are often relevant:
age, sex, race, years of experience, salary, whether
promoted, whether laid off
For each of the following questions, indicate which is the
response variable and which is the explanatory variable.
1. Are men paid more than women?
Response Variable:
age sex race years.experience salary promoted laid.off
Explanatory Variable:
age sex race years.experience salary promoted laid.off
2. On average, how much extra salary is a year of experience
worth?
Response Variable:
age sex race years.experience salary promoted laid.off
Explanatory Variable:
age sex race years.experience salary promoted laid.off
3. Are whites more likely than blacks to be promoted?
Response Variable:
age sex race years.experience salary promoted laid.off
Explanatory Variable:
age sex race years.experience salary promoted laid.off
4. Are older employees more likely to be laid off than
younger ones?
Response Variable:
age sex race years.experience salary promoted laid.off
Explanatory Variable:
age sex race years.experience salary promoted laid.off
[Note: Thanks to George Cobb.]
Prob 6.04 The drawings show some data involving three
variables:
? D — a quantitative variable
? A — a quantitative variable
? G — a categorical variable with two levels: S & K
On each of the plots, sketch a graph of the fitted model function
of the indicated structure.
33
34 CHAPTER 6. THE LANGUAGE OF MODEL
Draw these models:
? D ~ A+G
? D ~ A*G
? D ~ A-1
? D ~ 1
? D ~ A
? D ~ poly(A,2)
Only a qualitative sketch is needed. It will be good enough
to draw out the graph on a piece of paper, roughly approximating
the patterns of S and K seen in the graph. Then draw
the model values right on your paper. (You can’t hand this
in with AcroScore.)
Example: D ~ G
Prob 6.05 Using your general knowledge about the world,
think about the relationship between these variables:
speed of a bicyclist.
steepness of the road, a quantitative variable measured
by the grade (rise over run). 0 means flat, + means
uphill, means downhill.
fitness of the rider, a categorical variable with three levels:
unfit, average, athletic.
On a piece of paper, sketch out a graph of speed versus
steepness for reasonable models of each of these forms:
1. Model 1: speed ~ 1 + steepness
2. Model 2: speed ~ 1 + fitness
3. Model 3: speed ~ 1 + steepness+fitness
4. Model 4: speed ~ 1 + steepness+fitness + steepness:fitness
Prob 6.10 The graphic (from the New York Times, April
17, 2008) shows the fitted values from a model of the survival
of babies born extremely prematurely.
Caption: “A new study finds that doctors could
better estimate an extremely premature baby’s
chance of survival by considering factors including
birth weight, length of gestation, sex and whether
the mother was given steroids to help develop the
baby’s lungs.”
Two different response variables are plotted: (1) the probability
of survival and (2) the probability of survival without
moderate to severe disabilities. Remarkably for a statistical
graphic, there are three explanatory variables:
1. Birth weight (measured in pounds (lb) in the graphic).
2. The sex of the baby.
3. Whether the mother took steroids intended to help the
fetus’s lungs develop.
Focus on the survival rates without disabilities — the
darker bars in the graphic.
(a) Estimate the effect of giving steroids, that is, how
much extra survival probability is associated with giving
steroids?
A No extra survival probability with steroids.
B About 1-5 percentage points
C About 10 to 15 percentage points
D About 50 percentage points
E About 75 percentage points
35
(b) For the babies where the mother was given steroids, how
does the survival probability depend on the birth weight
of the baby:
A No dependence.
B Increases by about 25 percentage points.
C Increases by about 50 percentage points.
D Increases by about 25 percentage points per
pound.
E Increases by about 50 percentage points per
pound.
(c) For the babies where the mother was given steroids, how
does the survival probability depend on the sex of the
baby?
A No dependence.
B Higher for girls by about 15 percentage
points.
C Higher for boys by about 20 percentage
points.
D Higher for girls by about 40 percentage
points.
E Higher for boys by about 40 percentage
points.
(d) How would you look for an interaction between birth
weight and baby’s sex in accounting for survival?
A Compare survival of males to females at a
given weight.
B Compare survival of males across different
weights.
C Compare survival of females across different
weights.
D Compare the difference in survival between
males and females across different weights.
Do you see signs of a substantial interaction between birth
weight and sex in accounting for survival? (Take substantial
to mean “greater than 10 percentage points.”)
Yes No
(e) How would you look for a substantial interaction between
steroid use and baby’s sex in accounting for survival.
A Compare survival of males to females when
the mother was given steroids.
B Compare survival of males between steroid
given and steroid not given.
C Compare survival of females between
steroid given and steroid not given.
D Compare the difference in survival between
males and females between steroid given
and steroid not given.
Do you see signs of a substantial interaction between
steroid use and sex in accounting for survival? (Take
substantial to mean “greater than 10 percentage points.”)
Yes No
Prob 6.11 The graphic in the Figure is part of a report describing
a standardized test for college graduates, the Collegiate
Learning Assessment (CLA). The test consists of several
essay questions which probe students’ critical thinking skills.
Although individual students take the test and receive a
score, the purpose of the test is not to evaluate the students
individually. Instead, the test is intended to evaluate the effect
that the institution has on its students as indicated by
the difference in test scores between 1st- and 4th-year students
(freshmen and seniors). The cases in the graph are
institutions, not individual students.
Council for Aid to Education, “Collegiate Learning Assessment:
Draft Institutional Report, 2005-6” http://www.
cae.org
There are three variables involved in the graphic:
cla The CLA test score (averaged over each institution)
shown on the vertical axis
sat The SAT test score of entering students (averaged over
each institution) shown on the horizontal axis
class Whether the CLA test was taken by freshmen or seniors.
(In the graph: blue for freshmen, red for seniors)
What model is being depicted by the straight lines in the
graph? Give your answer in the standard modeling notation
(e.g, A ~ B+C) using the variable names above. Make sure
to indicate what interaction term, if any, has been included
in the model and explain how you can tell whether the interaction
is or is not there.
Prob 6.12 Time Magazine reported the results of a poll of
people’s opinions about the U.S. economy in July 2008. The
results are summarized in the graph.
36 CHAPTER 6. THE LANGUAGE OF MODELS
[Source: Time, July 28, 2008, p. 41]
The variables depicted in the graph are:
? Pessimism, as indicated by agreeing with the statement
that the U.S. was a better place to live in the 1990s and
will continue to decline.
? Ethnicity, with three levels: White, African American,
Hispanic.
? Income, with five levels.
? Age, with four levels.
Judging from the information in the graph, which of these
statements best describes the model pessimism ~ income?
A Pessimism declines as incomes get higher.
B Pessimism increases as incomes get higher.
C Pessimism is unrelated to income.
Again, judging from the information in the graph, which of
these statements best describes the model pessimism ~ age?
A Pessimism is highest in the 18-29 age group.
B Pessimism is highest in the 64 and older
group.
C Pessimism is lowest among whites.
D Pessimism is unrelated to age.
Poll results such as this are almost always reported using
just one explanatory variable at a time, as in this graphic.
However, it can be more informative to know the effect of one
variable while adjusting for other variables. For example, in
looking at the connection between pessimism and age, it would
be useful to be able to untangle the influence of income.
Prob 6.13 Here is a display constructed using the Current
Population Survey wage data:
Which of the following commands will make this? Each of
the possibilities is a working command, so try them out and
see which one makes the matching plot. Before you start,
make sure to read in the data with
> cps = fetchData("cps.csv")
A bwplot(wage~sex,groups=sector,data=
cps)
B bwplot(wage~sex|sector,data=cps)
C bwplot(wage~cross(sex,sector),data=
cps)
Prob 6.20 It’s possible to have interaction terms that involve
more than two variables. For instance, one model of
the swimming record data displayed in Chapter 4 was time
~ year*sex. This model design includes an interaction term
between year and sex. This interaction term produces fitted
model values that fall on lines with two different slopes, one
for men and one for women. Now consider a third possible
term to add to the model, the transform term that is “yes”
when the year is larger than 1948 and “no” when the year
is 1948 or earlier. Call this variable “post-war,” since World
War II ended in 1945 and the Olympic games resumed in 1948.
This can be interpreted to represent the systematic changes
that occurred after the war.
Here is the model of the swimming record data that includes
an intercept term, main terms for year, sex, and postwar,
and interaction terms among all of those: a three-way
interaction.
A two-way interaction term between sex and year allows
there to be differently sloping lines for men and women. A
37
three-way interaction term among sex, year, and post-war allows
even more flexibility; the difference between slopes for
men and women can be different before and after the war.
You can see this from the graph. Before 1948, men’s and
women’s slopes are very different. After the war the slopes
are almost the same.
Explain how this graph gives support for the following interpreation:
Before the war, women’s participation in competitive
sports was rapidly increasing. As more women became
involved in swimming, records were rapidly beaten. After the
war, both women and men had high levels of participation
and so new records were the result of better methods of training.
Those methods apply equally to men and women and so
records are improving at about the same rate for both sexes.
Prob 6.21 Consider the following situation. In order to
encourage schools to perform well, a school district hires an
external evaluator to give a rating to each school in the district.
The rating is a single number based on the quality of
the teachers, absenteeism among the teachers, the amount
and quality of homeworks the the teachers assign, and so on.
To reward those schools that do well, the district gives
a moderate salary bonus to each teacher and a fairly large
budget increase to the school itself.
The next year, the school district publishes data showing
that the students in schools that received the budget increases
had done much better on standarized test scores than the
schools that hadn’t gotten the increases. The school district
argues that this means that increasing budgets overall will
improve performance on standardized tests.
The teacher’s union supports the idea of higher budgets,
but objects to the rating system, saying that it is meaningless
and that teacher pay should not be linked to it. The
Taxpayers League argues that there is no real evidence that
higher spending produces better results. They interpret the
school district’s data as indicating only that the higher ranked
schools are better and, of course, better schools give better results.
Those schools were better before they won the ratingsbased
budget increase.
This is a serious problem. Because of the way the school
district collected its data, being a high-rated school is confounded
with getting a higher budget.
A modeling technique for dealing with situations like this
is called threshold regression. Threshold regression models
student test scores at each school as a function of the school
rating, but includes another variable that indicates whether
the school got a budget increase. The budget increase variable
is almost the same thing as the school rating: because
of the way the school district awarded the increases, it is a
threshold transformation of the school rating.
The graph shows some data (plotted as circles) from a
simulation of this situation in which the budget increase had
a genuine impact of 50 points in the standardized test. The
solid line shows the model of test score as a function of school
rating, with only the main effect. This model corresponds
to the claim that the threshold has no effect. The solid dots
are the model values from another model, with rating as a
main effect and a threshold transformation of rating that corresponds
to which schools got the budget increase.
Explain how to interpret the models as indicating the effect
of the budget increase. In addition to your explanation,
make sure to give a numerical estimate of how big the effect
is, according to the model as indicated in the graph.
An important statistical question is whether the data provide
good support for the claim that the threshold makes a
difference. (Techniques for answering this question are discussed
later in the book). The answer depends both on the
size of the effect, and how much data is used for constructing
the model. For the simulation here, it turns out that
the threshold model has successfully detected the effect of the
budget increase.
38 CHAPTER 6. THE LANGUAGE OF MODELS
Chapter 7
Model Formulas and Coefficients
Reading Questions
What is the role of the response variable in a model
formula?
What is the purpose of constructing indicator variables
from categorical variables?
How can model coefficients be used describe relationships?
What are the relationships between?
What is Simpson’s paradox?
Given an example of how the meaning of a coefficient
of a particular term can depend on what other model
terms are included in the model?
Prob 7.01 There is a correspondence between the model
formula and the coefficients found when fitting a model.
For each of the following model formulas, tell what the
coefficient is:
(a) 3 7x + 4y + 17z
Intercept: -7 3 4 17
z coef: -7 3 4 17
y coef: -7 3 4 17
x coef: -7 3 4 17
(b) 1.22 + 0.12age + 0.27educ ? 0.04age : educ
Intercept: -0.04 0.12 0.27 1.22
educ coef: -0.04 0.12 0.27 1.22
age coef: -0.04 0.12 0.27 1.22
age:educ coef: -0.04 0.12 0.27 1.22
(c) 8 + 3colorRed ? 4colorBlue
Intercept: -4 3 8 colorRed coef: -4 3 8
colorBlue coef: -4 3 8
Prob 7.02 For each of the following coefficient reports, tell
what the corresponding model formula is:
(a)
term coef
Intercept 10
x 3
y 5
A x + y
B 1 + x + y
C 10 + 3 + 5
D 10 + 3x + 5y
E 10x + 5y + 3
(b)
term coef
Intercept 4.15
age -0.13
educ 0.55
A age
B age + educ
C 4.15 ? 0.13 + 0.55
D 4.15age ? 0.13educ + 0.55
E 4.15 ? 0.13age + 0.55educ
Prob 7.04 For some simple models, the coefficients can be
interpreted as grand means, group-wise means, or differences
between group-wise means. In each of the following, A, B, and
C are quantitative variables and color is a categorical variable
with levels “red,”“blue,” and “green.”
(a) The model A ~ color gave these coefficients:
term coefficient
Intercept 10
colorBlue 5
colorGreen 12
What is the mean of A for those cases that are Blue:
5 10 12 15 17 22 27 unknown
What is the mean of A for those cases that are Green:
5 10 12 15 17 22 27 unknown
What is the mean of A for those cases that are Red:
5 10 12 15 17 22 27 unknown
What is the grand mean of A for all cases:
5 10 12 15 17 22 27 unknown
39
40 CHAPTER 7. MODEL FORMULAS AND COEFFICIENTS
(b) The model B ~ color - 1 gave these coefficients:
term coefficient
colorRed 100
colorBlue -40
colorGreen 35
What is the group mean of B for those cases that are
Blue:
-40 -5 0 35 60 65 100 135 unknown
What is the group mean of B for those cases that are
Red:
-40 -5 0 35 60 65 100 135 unknown
What is the group mean of B for those cases that are
Green:
-40 -5 0 35 60 65 100 135 unknown
What is the grand mean of B for all cases:
-40 -5 0 35 60 65 100 135 unknown
(c) The model C ~ 1 gave this coefficient:
term coefficient
Intercept 4.7
What is the group mean of C for those cases that are
Blue:
0.0 4.7 unknown
? What is the grand mean of C for all cases:
0.0 4.7 unknown
Prob 7.05 Using the appropriate data set and correct modeling
statements, compute each of these quantities and give
the model statement you used (e.g., age ~ sex)
(a) From the CPS85 data, what is the mean age of single people?
(Pick the closest answer.)
28 31 32 35 39 years.
What was your model expression?
(b) From the CPS85 data, what is the difference between the
mean ages of married and single people? (Pick the closest
answer.)
A Single people are, on average, 5 years
younger.
B Single people are, on average, 5 years older.
C Single people are, on average, 7 years
younger.
D Single people are, on average, 7 years older.
What was your model expression?
(c) From the SwimRecords data, what is the mean swimming
time for women? (Pick the closest.)
55 60 65 70 75 80 seconds.
What is your model expression?
(d) From the utilities.csv data, what is the mean CCF
for November? (Pick the closest.) (Hint: use as.
factor(month) to convert the month number to a categorical
variable.)
-150 -93 42 150 192
What is your model expression?
Prob 7.10 Here is a graph of the kids feet data showing a
model of footwidth as a function of footlength and sex. Both
the length and width variables are measured in cm.
The model values are solid symbols, the measured data
are hollow symbols.
Judging from the graph, what is the model value for a boy
with a footlength of 22 cm?
A 8.0cm
B 8.5cm
C 9.0cm
D 9.5cm
E Can’t tell from this graph.
According to the model, after adjusting for the difference
in foot length, what is the typical difference between the
width of a boy’s foot and a girl’s foot?
A no difference
B 0.25cm
C 0.50cm
D 0.75cm
E 1.00cm
F Can’t tell from this graph.
Judging from the graph, what is a typical size of a residual
from the model?
A 0.10cm
B 0.50cm
C 1.00cm
D 1.50cm
E Can’t tell from this graph.
Prob 7.11 In the swim100m.csv data, the variables are
? time: World record time (in seconds)
? year: The year in which the record was set
? sex: Whether the record is for men or women.
Here are the coefficients from several different fitted models.
41
> lm( time ~ year, data=swim)
Coefficients:
(Intercept) year
567.2420 -0.2599
> lm( time ~ year+sex, data=swim)
Coefficients:
(Intercept) year sexM
555.7168 -0.2515 -9.7980
> lm( time ~ year*sex, data=swim)
Coefficients:
(Intercept) year sexM year:sexM
697.3012 -0.3240 -302.4638 0.1499
> lm( time ~ sex, data=swim)
Coefficients:
(Intercept) sexM
65.19 -10.54
For each of the following, pick the appropriate model from
the set above and use its coefficients to answer the question.
(a) How does the world record time typically change from one
year to the next for both men and women taken together?
-302.4 -10.54 -9.79 -0.2599 -0.2515 -0.324 -0.174
(b) How does the world record time change from one year to
the next for women only?
-302.4 -10.54 -9.79 -0.2599 -0.2515 -0.324 -0.174
(c) How does the world record time change from one year to
the next for men only?
-302.4 -10.54 -9.79 -0.2599 -0.2515 -0.324 -0.174
Prob 7.12 In the SAT data sat.csv , the variables have
these units:
sat has units of “points.”
expend has units of “dollars.”
ratio has units of “students.”
frac has units of “percentage points.”
Consider the model formula
sat = 994 + 12.29 expend - 2.85 frac
(a) What are the units of the coefficient 994?
A points
B dollars
C students
D percentage points
E points per dollar
F students per point
G points per student
H points per percentage points
(b) What are the units of the coefficient 12.29?
A points
B dollars
C students
D dollars per student
E points per dollar
F students per point
G points per student
(c) What are the units of the coefficient 2.85?
A points
B dollars
C percentage points
D points per dollar
E students per point
F points per student
G points per percentage points
Prob 7.13 The graph shows schematically a possible relationship
between used car price, mileage, and the car model
year.
Mileage
Price
Year 2007
Year 2005
Consider the model price ~ mileage*year.
In your answers, treat year as a simple categorical variable,
and use year 2005 as the reference group when thinking about
coefficients.
(a) What will be the sign of the coefficient on mileage?
A Negative
B Zero
C Positive
D No way to tell from the information given
(b) What will be the sign of the coefficient on model year?
A Negative
B Zero
C Positive
D No way to tell from the information given
(c) What will be the sign of the interaction coefficient?
A Negative
B Zero
C Positive
D There is no interaction coefficient.
E No way to tell from the information given
42 CHAPTER 7. MODEL FORMULAS AND COEFFICIENTS
Prob 7.14 The graph shows schematically a hypothesized
relationship between how fast a person runs and the person’s
age and sex.
Age
Speed
Men
Women
Consider the model speed ~ age*sex.
(a) What will be the sign of the coefficient on age?
A Negative
B Zero
C Positive
D No way to tell, even roughly, from the information
given
(b) What will be the sign of the coefficient on sex? (Assume
that the sex variable is an indicator for women.)
A Negative
B Zero
C Positive
(c) What will be the sign of the interaction coefficient?
(Again, assume that the sex variable is an indicator for
women.)
A Negative
B Zero
C Positive
D There is no interaction coefficient.
E No way to tell, even roughly, from the information
given
Prob 7.15 Consider this model of a child’s height as a function
of the father’s height, the mother’s height, and the sex
of the child.
height ~ father*sex + mother*sex
Use the Galton data galton.csv to fit the model and
examine the coefficients. Based on the coefficients, answer the
following:
(a) There are two boys, Bill and Charley. Bill’s father is 1
inch taller than Charley’s father. According to the model,
and assuming that their mothers are the same height, how
much taller should Bill be than Charley?
A They should be the same height.
B 0.01 inches
C 0.03 inches
D 0.31 inches
E 0.33 inches
F 0.40 inches
G 0.41 inches
(b) Now imagine that Bill and Charley’s fathers are the same
height, but that Charley’s mother is 1 inch taller than
Bill’s mother. According to the model, how much taller
should Charley be than Bill?
A They should be the same height.
B 0.01 inches
C 0.03 inches
D 0.31 inches
E 0.33 inches
F 0.40 inches
G 0.41 inches
(c) Now put the two parts together. Bill’s father is one inch
taller than Charley’s, but Charley’s mother is one inch
taller than Bill’s. How much taller is Bill than Charley?
A They should be the same height.
B 0.03 inches
C 0.08 inches
D 0.13 inches
E 0.25 inches
Prob 7.16 The file diamonds.csv contains several variables
relating to diamonds: their price, their weight (in carats),
their color (which falls into several classes — D, E, F, G, H,
I), and so on. The following several graphs show different
models fitted to the data: price is the response variable and
weight and color are the explanatory variables.
43
Graph 1 Graph 2
Graph 3 Graph 4
Which model corresponds to which graph?
(a) lm( price~carat + color, data=diamonds)
Which graph? Graph 1 Graph 2 Graph 3 Graph 4
(b) lm( price~carat * color, data=diamonds)
Which graph? Graph 1 Graph 2 Graph 3 Graph 4
(c) lm( price~poly(carat,2) + color, data=diamonds)
Which graph? Graph 1 Graph 2 Graph 3 Graph 4
(d) lm( price~poly(carat,2) * color, data=diamonds)
Which graph? Graph 1 Graph 2 Graph 3 Graph 4
Prob 7.20 The graph shows data on three variables,
SCORE, AGE, and SPECIES. The SCORE and AGE are
quantitative. SPECIES is categorical with levels x and y.
| x
| y
| y x
| y x
SCORE | y x
| y y
| x
| x
|x x
|_______________________________________
AGE
Explain which of the following models is plausibly a candidate
to describe the data. (Don’t do any detailed calculuations;
you can’t because the axes aren’t marked with a scale.)
Note SPECIESx means that the case has a level of x for variable
SPECIES. For each model explain in what ways it agrees
or disagrees with the graphed data.
(a) SCORE = 10 - 2.7 AGE + 1.3 SPECIESx
(b) SCORE = 10 + 5.0 AGE - 2 AGE^2 - 1.3 SPECIESx
(c) SCORE = 10 + 5.0 AGE + 2 AGE^2 - 1.3 SPECIESx
(d) SCORE = 10 + 2.7 AGE + 2 AGE^2 - 1.3 SPECIESx +
0.7 AGE * SPECIESx
Enter your answers for all four models here:
Prob 7.21 The graphs below show models values for different
models of the Old Faithful geyser, located in Yellowstone
National Park in the US. The geyser blows water and steam
high in the air in periodic eruptions. These eruptions are
fairly regularly spaced, but there is still variation in the time
that elapses from one eruption to the next.
The variables are
waiting The time from the previous eruption to the current
one
duration The duration of the previous eruption
biggerThan3 A categorical variable constructed from duration,
which depicts simply whether the duration was
greater or less than 3 minutes.
In each case, judge from the shape of the graph which
model is being presented.
? (A) waiting ~ duration
? (B) waiting ~ duration + biggerThan3
? (C) waiting ~ duration*biggerThan3
? (D) waiting ~ biggerThan3
? (E) waiting ~ poly(duration,2)
? (F) waiting ~ poly(duration,2)*biggerThan3
44 CHAPTER 7. MODEL FORMULAS AND COEFFICIENTS
1. A B C D E F 2. A B C D E F
3. A B C D E F 4. A B C D E F
5. A B C D E F
Prob 7.22 Here is a report from the New York Times:
It has long been said that regular physical activity
and better sleep go hand in hand. But only
recently have scientists sought to find out precisely
to what extent. One extensive study published
this year looked for answers by having healthy
children wear actigraphs — devices that measure
movement — and then seeing whether more movement
and activity during the day meant improved
sleep at night.
The study found that sleep onset latency —
the time it takes to fall asleep once in bed —
ranged from as little as roughly 10 minutes for
some children to more than 40 minutes for others.
But physical activity during the day and sleep
onset at night were closely linked: every hour of
sedentary activity during the day resulted in an
additional three minutes in the time it took to
fall asleep at night. And the children who fell
asleep faster ultimately slept longer, getting an
extra hour of sleep for every 10-minute reduction
in the time it took them to drift off. (Anahad
O’Connor, Dec. 1, 2009 — the complete article
is at http://www.nytimes.com/2009/12/01/
health/01really.html.)
There are two models described here with two different
response variables: sleep onset latency and duration of sleep.
(a) In the model with sleep onset latency as the response
variable, what is the explanatory variable?
A Time to fall asleep.
B Hours of sedentary activity.
C Duration of sleep.
(b) In the model with duration of sleep as the response variable,
what is the explanatory variable?
A Time to fall asleep.
B Hours of sedentary activity.
C Duration of sleep.
(c) Suppose you are comparing two groups of children. Group
A has 3 hour of sedentary activity each day, Group B has
8 hours of sedentary activity. Which of these statements
is best supported by the article?
A The children in Group A will take, on average,
3 minutes less time to fall asleep.
B The children in Group B will have, on average,
10 minutes less sleep each night.
C The children in Group A will take, on average,
15 minutes less time to fall asleep.
D The children in Group B will have, on average,
45 minutes less sleep each night.
(d) Again comparing the two groups of children, which of
these statements is supported by the article?
A The children in Group A will get, on average,
about an hour and a half hours of extra
sleep compared to the Group B children.
B The children in Group A will get, on average,
about 15 minutes more sleep than the
Group B children.
C The two groups will get about the same
amount of sleep.
Prob 7.23 Car prices vary. They vary according to the
model of car, the optional features in the car, the geographical
location, and the respective bargaining abilities of the
buyer and the seller.
In this project, you are going to investigate the influence
of at least three variables on the asking price for used cars:
Model year
Mileage
Geographical location
These variables are relatively easy to measure and code.
There are web sites that allow us quickly to collect a lot of
cases. One site that seems easy to use is www.cars.com. Pick
a particular model of car that is of interest to you. Also, pick a
few scattered geographical locations. (At www.cars.com you
can specify a zip code, and restrict your search to cars within
a certain distance of that zip code.)
For each location, use the web site to find prices and the
other variables for 50-100 cars. Record these in a spreadsheet
with five variables: price, model year, mileage, location,
model name. (The model name will be the same for all your
45
data. Recording it in the spreadsheet will help in combining
data for different types of cars.) You may also choose to
record some other variables of interest to you.
Using your data, build models make a series of claims
about the patterns seen in used-car prices. Some basic claims
that you should make are in this form:
Looking just at price versus mileage, the price of car
model XXX falls by 12 cents per mile driven.
Looking just at price versus age, the price of car model
XXX falls by 1000 dollars per year of age driven.
Considering both age and mileage, the price of car model
XXX falls by ...
Looking at price versus location, the price differs ...
You may also want to look at interaction terms, for example
whether the effect of mileage is modulated by age or
location.
Note whether there are outliers in your data and indicate
whether these are having a strong influence on the coefficients
you find.
Price and other information about used Mazda
Miatas in the Saint Paul, Minnesota area from
www.cars.com.
Prob 7.30 Here is a news article summarizing a research
study by Bingham et al., “Drinking Behavior from High
School to Young Adulthood: Differences by College Education,”
Alcoholism: Clinical & Experimental Research; Dec.
2005; vol. 29; No. 12
After reading the article, answer these questions:
1. The article headline is about “drinking behavior.”
Specifically, how are they measuring drinking behavior?
2. What explanatory variables are being studied?
3. Are any interactions reported?
4. Imagine that the study was done using a single numerical
indicator of drinking behavior, a number that would
be low for people who drink little and don’t binge drink,
and would be high for heavy and binge drinkers. For a
model with this numerical index of drinking behavior as
the output, what structure of model is implied by the
article?
5. For the model you wrote down, indicate which coeffi-
cients are positive and which negative.
Binge Drinking Is Age-Related Phenomenon
By Katrina Woznicki, MedPage Today Staff Writer December
14, 2005
ANN ARBOR, Mich., Dec. 14 - Animal House notwithstanding,
going to college isn’t an excess risk factor for binge drinking
any more than being 18 to 24 years old, according to researchers
here.
The risks of college drinking may get more publicity, but
the college students are just late starters, Raymond Bingham,
Ph.D., of the University of Michigan and colleagues reported
in the December issue of Alcoholism: Clinical & Experimental
Research.
Young adults in the work force or in technical schools are
more likely to have started binge drinking in high school and
kept it up, they said.
The investigators said the findings indicated that it’s incorrect
to assume, as some do, that young adults who don’t
attend college are at a lower risk for alcohol misuse than college
students.
“The ones who don’t go on to a college education don’t
change their at-risk alcohol consumption,” Dr. Bingham said.
“They don’t change their binge-drinking and rates of drunkenness.”
In their study comparing young adults who went to college
with those who did not, they found that men with only a high
school education were 91% more likely to have greater alcohol
consumption than college students in high school. Men
with only a postsecondary education (such as technical school)
were 49% more likely to binge drink compared with college
students.
There were similar results with females. Women with only
a high school education were 88% more likely to have greater
alcohol consumption than college students.
The quantity and frequency of alcohol consumption increased
significantly from the time of high school graduation
at the 12th grade to age 24 (p < 0.001), investigators reported
in the December issue of Alcoholism: Clinical & Experimental
Research
College students drank, too, but their alcohol use peaked
later than their non-college peers. By age 24, there was little
difference between the two groups, the research team reported.
“In essence,” said Dr. Bingham, “men and women who
did not complete more than a high-school education had high
46 CHAPTER 7. MODEL FORMULAS AND COEFFICIENTS
alcohol-related risk, as measured by drunkenness and heavy
episodic drinking while in the 12th grade, and remained at
the same level into young adulthood, while levels for the other
groups increased.”
The problem, Dr. Bingham said, is that while it’s easier
for clinicians to target college students, a homogenous population
conveniently located on concentrated college campuses,
providing interventions for at-risk young adults who don’t go
on to college is going to be trickier.
“The kids who don’t complete college are everywhere,” Dr.
Bingham said. “They’re in the work force, they’re in the military,
they’re in technical schools.”
Dr. Bingham and his team surveyed 1,987 young adults
who were part of the 1996 Alcohol Misuse Prevention Study.
All participants had attended six school districts in southeastern
Michigan. They were interviewed when they were in 12th
grade and then again at age 24. All were unmarried and had
no children at the end of the study. Fifty-one percent were
male and 84.3% were Caucasian.
The 1,987 participants were divided into one of three education
status groups: high school or less; post-secondary education
such as technical or trade school or community college,
but not a four-year degree college; and college completion.
The investigators looked at several factors, including
quantity and frequency of alcohol consumption, frequency
of drunkenness, frequency of binge-drinking, alcohol use at
young adulthood, cigarette smoking and marijuana use.
Overall, the men tend to drink more than the women regardless
of education status. The study also showed while
lesser-educated young adults may have started heavier drinking
earlier on, college students quickly caught up.
For example, the frequency of drunkenness increased between
12th grade and age 24 for all groups except for men
and women with only a high school education (p < 0.001).
“The general pattern of change was for lower-education
groups to have higher levels of drunkenness in the 12th grade,
and to remain at nearly the same level while college-completed
men and women showed the greatest increases in drunkenness,”
the authors wrote.
Lesser-educated young adults also started binge-drinking
earlier, but college students, again, caught up. High schooleducated
women were 27% more likely to binge drink than
college women, for example. High-school-educated men were
25% more likely to binge drink than men with post-secondary
education.
But binge-drinking frequency increased 21% more for
college-educated men than post-secondary educated men.
And college women were 48% more likely to have an increase
in binge-drinking frequency than high school-educated
women.
The study also found post-secondary educated men had
the highest frequency of drunken-driving. High school educated
men and women reported the highest frequencies of
smoking in the 12th grade and at age 24 and also showed
the greater increase in smoking prevalence over this period
whereas college-educated men and women had the lowest levels
of smoking.
Then at age 24, the investigators compared those who were
students to those who were working and found those who were
working were 1.5 times more likely to binge drink (p < 0.003),
1.3 times more likely to be in the high drunkenness group
(p < 0.018), and were 1.5 times more likely to have a greater
quantity and frequency of alcohol consumption (p < 0.005).
“The transition from being a student to working, and the
transition from residing with one’s family of origin to another
location could both partially explain differences in patterns,”
the authors wrote.
Dr. Bingham said the findings reveal that non-college
attending young adults “experience levels of risk that equal
those of their college-graduating age mates.”
Prob 7.31 For the simple model A ~ G. where G is a categorical
variable, the coefficients will be group means. More
precisely, there will be an intercept that is the mean of one of
the groups and the other coefficients will show how the mean
of the other groups each differ from the reference group.
Similarly, when there are two grouping variables, G and H,
the model A ~ G + H + G:H (which can be abbreviated A
~ G*H) will have coefficients that are the group-wise means
of the crossed groups. Perhaps “subgroup-wise means”is more
appropriate, since there will be a separate mean for each subgroup
of G divided along the lines of H. The interaction term
G:H allows the model to account for the influence of H separately
for each level of G.
However, the model A ~ G + H does not produce coef-
ficients that are group means. Because no interaction term
has been included, this model cannot reflect how the effect
of H differs depending on the level of G. Instead, the model
coefficients reflect the influence of H as if it were the same for
all levels of G.
To illustrate these different models, consider some simple
data.
Suppose that you found in the literature an article about
the price of small pine trees (either Red Pine or White Pine)
of different heights in standard case/variable format, which
would look like this:
Case # Color Height Price
1 Red Short 11
2 Red Short 13
3 White Tall 37
4 White Tall 35
and so on ...
Commonly in published papers, the raw case-by-case data
isn’t reported. Rather some summary of the raw data is presented.
For example, there might be a summary table like
this:
SUMMARY TABLE
Mean Price
Color
Height Red White Both Colors
Short $12 $18 $15
Tall $20 $34 $27
Both Heights $16 $26 $21
The table gives the mean price of a sample of 10 trees in
each of the four overall categories (Tall and Red, Tall and
White, Short and Red, Short and White). So, the ten Tall
47
and Red pines averaged $20, the ten Short and White pines
averaged $18, and so on. The margins show averages over
larger groups. For instance, the 20 white pines, averaged $26,
while the 20 short pines averaged $15.
The average price of all 40 trees in the sample was $21.
Based on the summary table, answer these questions:
1. In the model price ~ color, which involves the coeffi-
cients “intercept” and “colorWhite”, what will be the
values of the coefficients?
Intercept 12 15 16 18 20 21 26 27 34
colorWhite -10 -8 0 5 8 10
2. In the model price ~ height, which involves the coef-
ficients “intercept” and “heightTall”, what will be the
values of the coefficients?
Intercept 0 4 8 12 15 16 18 20 21 26 27 34
heightTall 0 4 8 12 15 16 18 20 21 26 27 34
3. The model price ~ height * color, with an interaction
between height and color, has four coefficients and therefore
can produce an exact match to the prices of the
four different kinds of trees. But they are in a different
format: not just one coefficient for each kind of
tree. What are the values of these coefficients from the
model? (Hint: Start with the kind of tree that corresponds
to the intercept term.)
Intercept 0 4 6 8 10 12 16
heightTall 0 4 6 8 10 12 16
colorWhite 0 4 6 8 10 12 16
heightTall:colorWhite 0 4 6 8 10 12 16
4. The model price ~ height + color gives these three coefficients:
Intercept : 10
heightTall : 12
colorWhite : 10
It would be hard to figure out these coefficients by hand
because they can’t be read off from the summary table
of Mean Price.
According to the model, what are the fitted model values
for these trees:
Short Red 10 12 15 16 20 22 32 34
Short White 10 12 15 16 20 22 32 34
Tall Red 10 12 15 16 20 22 32 34
Tall White 10 12 15 16 20 22 32 34
Notice that the fitted model values aren’t a perfect
match to the numbers in the table. That’s because a
model with three coefficients can’t exactly reproduce a
set of four numbers.
48 CHAPTER 7. MODEL FORMULAS AND COEFFICIENTS
Chapter 8
Fitting Models to Data
Reading Questions
? What is a residual? Why does it make sense to make
them small when fitting a model?
What is “least squares?”
What does it mean to “partition variability” using a
model?
How can a model term be redundant? Why are redundant
terms a problem?
Prob 8.01 Here are some (made-up) data from an experiment
growing trees. The height was measured for trees in
different locations that had been watered and fertilized in different
ways.
height water light compost nitrogen
5 2 shady none little
4 1 bright none lot
5 1.5 bright some little
6 3 shady rich lot
7 3 bright some little
6 2 shady rich lot
(a) In the model expression height ~ water, which is the explanatory
variable?
A height
B water
C light
D compost
E Can’t tell from this information.
(b) Ranger Alan proposes the specific model formula
height = 2 ? water + 1.
Copy the table to a piece of paper and fill in the table
showing the model values and the residuals.
height water model values resids
5 2
4 1
5 1.5
6 3
7 3
6 2
(c) Ranger Bill proposes the specific model formula
height = water + 3.
Again, fill in the model values and residuals.
height water model values resids
5 2
4 1
5 1.5
6 3
7 3
6 2
(d) Based on your answers to the previous to parts, which
of the two models is better? Give a specific definition of
“better” and explain your answer quantitatively.
(e) Write down the set of indicator variables that arise from
the categorical variable compost.
(f) The fitted values are exactly the same for the two models
water ~ compost and water ~ compost-1. This suggests
that the 1 vector (1, 1, 1, 1, 1, 1) is redundant with the set
of indicator variables due to the variable compost. Explain
why this redundancy occurs. Is it because of something
special about the “compost” variable?
(g) Estimate, as best you can using only very simple calculations,
the coefficients on the model water ~ compost-1.
(Note: there is no intercept term in this model.)
(h) Ranger Charley observes that the the following model is
perfect because all of the residuals are zero.
height ~ 1+water+light+compost+nitrogen
Charley believes that using this model will enable him to
make excellent predictions about the height of trees in the
future. Ranger Donald, on the other hand, calls Charley’s
regression “ridiculous rot” and claims that Charley’s explanatory
terms could fit perfectly any set of 6 numbers.
Donald says that the perfect fit of Charley’s model does
not give any evidence that the model is of any use whatsoever.
Who do you think is right, Donald or Charley?
49
50 CHAPTER 8. FITTING MODELS TO DATA
Prob 8.02 Which of these statements will compute the sum
of square residuals of the model stored in the object mod?
A resid(mod)
B sum(resid(mod))
C sum(resid(mod))^2
D sum(resid(mod)^2)
E sum(resid(mod^2))
F None of the above.
Prob 8.04 Here is a simple model that relates foot width to
length in children, fit to the data in kidsfeet.csv:
> kids = fetchData("kidsfeet.csv")
> mod = lm( width ~ length, data=kids)
> coef(mod)
(Intercept) length
2.8623 0.2479
(a) Using the coefficients, calculate the predicted foot width
from this model for a child with foot length 27cm.
2.86 3.10 7.93 9.12 9.56 12.24 28.62
(b) The sum of squares of the residuals from the model provides
a simple indication of how far typical values are
from the model. In this sense, the standard deviation of
the residuals tells us how much uncertainty there is in the
prediction. (Later on, we’ll see that another term needs
to be added to this uncertainty.) What is the sum of
squares of the residuals?
4.73 5.81 5.94 6.10 6.21
(c) What is the sum of squares of the fitted values for the
kids in kidsfeet.csv?
42.5 286.3 3157.7 8492.0 15582.1
(d) What is the sum of squares of the foot widths for the kids
in kidsfeet.csv.
3163.5 3167.2 3285.1 3314.8 3341.7
(e) There is a simple relationship between the sum of squares
of the response variable, the residuals, and the fitted values.
You can confirm this directly. Which of the following
R statements is appropriate to do this:
A sum(kids$width)-(sum(resid(mod))+
sum(fitted(mod)))
B sum(kids$width^2)-(sum(resid(mod)
^2)+sum(fitted(mod)^2))
C sum(resid(mod))-sum(fitted(mod))
D sum(resid(mod)^2)-sum(fitted(mod)
^2)
Note: It might seem natural to use the == operator to
compare the equality of two values, for instance A==B.
However, arithmetic on the computer is subject to small
round-off errors, too small to be important when looking
at the quantities themselves but sufficient to cause the ==
operator to say the quantities are different. So, it’s usually
better to compare numbers by subtracting one from
the other and checking whether the result is very small.
Prob 8.05 Consider the data collected by Francis Galton
in the 1880s, stored in a modern format in the galton.csv
file. In this file, heights is the variable containing the child’s
heights, while the father’s and mother’s height is contained
in the variables father and mother. The family variable is
a numerical code identifying children in the same family; the
number of kids in this family is in nkids.
> galton = fetchData("Galton")
> lm( height ~ father, data=galton)
Coefficients:
(Intercept) father
39.1104 0.3994
(a) What is the model’s prediction for the height of a child
whose father is 72 inches tall? 67.1 67.4 67.9 68.2
(b) Construct a model using both the father’s and mother’s
heights, using just the main effect but not including their
interaction. What is the model’s prediction for the height
of a child whose father is 72 inches tall and mother is 65
inches tall? 67.4 68.1 68.9 69.2
(c) Construct a model using mother and father’s height, including
the main effects as well as the interaction. What
is the model’s prediction for the height of a child whose
father is 72 inches tall and mother is 65 inches tall?
67.4 68.1 68.9 69.2
Galton did not have our modern techniques for including
multiple variables into a model. So, he tried an expedient,
defining a single variable, “mid-parent,” that reflected
both the father’s and mother’s height. We can mimic this
approach by defining the variable in the same way Galton
did:
> galton = transform( galton,midparent=(father+1.08*mother)/2 )
Galton used the multiplier of 1.08 to adjust for the fact
that the mothers were, as a group, shorter than the fathers.
Fit a model to the Galton data using the mid-parent variable
and child’s sex, using both the main effects and the
interaction. This will lead to a separate coefficient on
mid-parent for male and female children.
(d) What is the predicted height for a girl whose father is 67
inches and mother 64 inches?
63.6 63.9 64.2 65.4 65.7
The following questions are about the size of the residuals
from models.
(e) Without knowing anything about a randomly selected
child except that he or she was in Galton’s data set, we
can say that the child’s height is a random variable with a
certain mean and standard deviation. What is this standard
deviation?
2.51 2.73 2.95 3.44 3.58 3.67 3.72
51
(f) Now consider that we are promised to be told the sex
of the child, but no other information. We are going to
make a prediction of the child’s height once we get this
information, and we are asked to say, ahead of time, how
good this prediction will be. A sensible way to do this is
to give the standard deviation of the residuals from the
best fitting model based on the child’s sex. What is this
standard deviation of residuals?
2.51 2.73 2.95 3.44 3.58 3.67 3.72
Prob 8.10 The “modern physics” course has a lab where
students measure the speed of sound. The apparatus consists
of an air-filled tube with a sound generator at one end and a
microphone that can be set at any specified position within
the tube. Using an oscilloscope, the transit time between the
sound generator and microphone can be measured precisely.
Knowing the position p and transit time t allows the speed of
sound v to be calculated, based on the simple model:
distance = velocity × time or p = vt.
Here are some data recorded by a student group calling
themselves “CDT”.
position transit time
(m) (millisec)
0.2 0.6839
0.4 1.252
0.6 1.852
0.8 2.458
1.0 3.097
1.2 3.619
1.4 4.181
Part 1.
Enter these data into a spreadsheet in the standard casevariable
format. Then fit an appropriate model. Note that
the relationship p = vt between position, velocity, and time
translates into a statistical model of the form p ~ t - 1 where
the velocity will be the coefficient on the t term.
What are the units of the model coefficient corresponding
to velocity, given the form of the data in the table above?
A meters per second
B miles per hour
C millimeters per second
D meters per millisecond
E millimeters per millisecond
F No units. It’s a pure number.
G No way to know from the information provided.
Compare the velocity you find from your model fit to the
accepted velocity of sound (at room temperature, at sea level,
in dry air): 343 m/s. There should be a reasonable match.
If not, check whether your data were entered properly and
whether you specified your model correctly.
Part 2.
The students who recorded the data wrote down the transit
time to 4 digits of precision, but recorded the position to
only 1 or 2 digits, although they might simply have left off
the trailing zeros that would indicate a higher precision.
Use the data to find out how precise the position measurement
is. To do this, make two assumptions that are very
reasonable in this case:
1. The velocity model is highly accurate, that is, sound
travels at a constant velocity through the tube.
2. The transit time measurements are correct. This assumption
reflects current technology. Time measurements
can be made very precisely, even with inexpensive
equipment.
Given these assumptions, you should be able to calculate the
position from the transit time and velocity. If the measured
position differs from this model value — as reflected by the
residuals — then the measured position is imprecise. So, a
reasonable way to infer the precision of the position is by the
typical size of residuals.
How big is a typical residual? One appropriate way to
measure this is with the standard deviation of the residuals.
? Give a numerical value for this.
0.001 0.006 0.010 0.017 0.084 0.128
Part 3.
The students’ lab report doesn’t indicate how they know
for certain that the sound generator is at position zero. One
way to figure this out is to measure the generator’s position
from the data themselves. Denoting the actual position of the
sound generator as p0, then the equation relating position and
transit time is
p ? p0 = vt or p = p0 + vt
This suggests fitting a model of the form p ~ 1 + t, where
the coefficient on 1 will be p0 and the coefficient on t will be
v.
Fit this model to the data.
? What is the estimated value of p0?
-0.032 0.012 0.000 0.012 0.032
Notice that adding new terms to the model reduces the
standard deviation of the residuals.
? What is the new value of the standard deviation of the
residuals?
0.001 0.006 0.010 0.017 0.084 0.128
Compare the estimated speed of sound found from the
model p ~ t to the established value: 343 m/s . Notice that
the estimate is better than the one from the model p ~ t - 1
that didn’t take into account the position of the sound generator.
52 CHAPTER 8. FITTING MODELS TO DATA
Prob 8.11 The graph shows some data on natural gas usage
(in ccf) versus temperature (in deg. F) along with a model of
the relationship.
(a) What are the units of the residuals from a model in which
natural gas usage is the response variable?
ccf degF ccf.per.degF none
(b) Using the graph, estimate the magnitude of a typical
residual, that is, approximately how far a typical case is
from the model relationship. (Ignore whether the residual
is positive or negative. Just consider how far the case is
from the model, whether it be above or below the model
curve.)
2ccf 20ccf 50ccf 100ccf
(c) There are two cases that are outliers with respect to the
model relationship between the variables. Approximately
how big are the residuals in these two cases?
2ccf 20ccf 50ccf 100ccf
Now ignore the model and focus just on those two outlier
cases and their relationship to the other data points.
(d) Are the two cases outliers with respect to natural gas usage?
True or False
(e) Are the two cases outliers with respect to temperature?
True or False
Prob 8.12 It can be helpful to look closely at the residuals
from a model. Here are some things you can easily do:
1. Look for outliers in the residuals. If they exist, it can be
worthwhile to look into the cases involved more deeply.
They might be anomalous or misleading in some way.
2. Plot the residuals versus the fitted model values. Ideally
there should be no evident relationship between the
two — the points should be a random scatter. When
there is a strong relationship, even though it might be
complicated, the model may be missing some important
term.
3. Plot the residuals versus the values of an important explanatory
variable. (If there are multiple explanatory
variables, there would be multiple plots to look at.)
Again, ideally there should be no evident relationship.
If there is, there is something to think about.
Using the world-record swim data, swim100m.csv construct
the model time ~ year + sex + year:sex. This model
captures some of the variability in the record times, but
doesn’t reflect something that’s obvious from a plot of the
data: that records improved quickly in the early years (especially
for women) but the improvment is much slower in
recent years. The point of this exercise is to show how the
residuals provide information about this.
Find the cases in the residuals that are outliers. Explain
what it is about these cases that fits in with the
failure of the model to reflect the slowing improvement
in world records.
Plot the residuals versus the fitted model values. What
pattern do you see that isn’t consistent with the idea
that the residuals are unrelated to the fitted values?
Plot the residuals versus year. Describe the pattern you
see.
Now use the kids-feet data kidsfeet.csv and the model
width ~ length + sex + length:sex.
Look at the residuals in the three suggested ways. Are
there any outliers? Describe any patterns you see in relationship
to the fitted model values and the explanatory variable
length.
Prob 8.20 We’re going to use the ten-mile-race data to explore
the idea of redundancy: Why redundancy is a problem
and what we can do about it.
Read in the data:
> run = fetchData("ten-mile-race.csv")
The data includes information about the runner’s age and
sex, as well as the time it took to run the race.
I’m interested in how computer and cell-phone use as a
child may have affected the runner’s ability. I don’t have any
information about computer use, but as a rough proxy, I’m
going to use the runner’s year of birth. The assumption is
that runners who were born in the 1950s, 60s, and 70s, didn’t
have much chance to use computers as children.
Add in a new variable: yob. We’ll approximate this as the
runner’s age subtracted from the year in which the race was
run: 2005. That might be off by a year for any given person,
but it will be pretty good.
> run$yob = 2005 - run$age
Each of the following models has two terms.
mod1 = lm( net ~ age + yob - 1, data=run)
mod2 = lm( net ~ 1 + age, data = run)
mod3 = lm( net ~ 1 + yob, data=run )
? Fit each of the the models and interpret the coefficients
in terms of the relationship between age and year of
birth and running time. Then look at the R2 and the
sum of square residuals in order to decide which is the
better model.
53
Using special software that you don’t have, I have fitted a
model — I’ll call it mod4 — with all three terms: the intercept,
age, and year of birth. The model coefficients are:
My Fantastic Model: mod4
Intercept age yob
-20050 20.831891052 12.642004612
My conclusion, based on the mod4 coefficients, is that people
slow down by 20.8 seconds for every year they age. Making
up for this, however, is the fact that people who were born
earlier in the last century tend to run slower by 12.6 seconds
for every year later they were born. Presumably this is because
those born earlier had less opportunity to use computers
and cell phones and therefore went out and did healthful,
energetic, physical play rather than typing.
? Using these coefficients, calculate the model values. The
statement will look like this:
mod4vals = -20050 + 20.831891052*run$age +
12.642004612*run$yob
Calculate the residuals from mod4 by substracting the
model values from the response variable (net running
time). Compare the size of the residuals using a sum of
squares or a standard deviation or a variance to the size
of the residuals from models 1 through 3. Judging from
this, which is the better model?
I needed special software to find the coefficients in mod4
because R won’t do it. See what happens when you try the
models with three terms, like this:
lm( net ~ 1 + age + yob, data=run )
lm( net ~ 1 + yob + age, data=run )
Can you get three coefficients from the R software?
I’m very pleased with mod4 and the special methods I
used to find the coefficients.
Unfortunately, my statistical arch-enemy, Prof. Nalpak
Ynnad, has proposed another model. He claims that computer
and cell-phone use is helpful. According to his twisted
theory, people actually run faster as they get older. Impossible!
But look at his model coefficients.
Ynnad’s Evil Model: mod5
Intercept age yob
60150 -19.16810895 -27.35799539
Ynnad’s ridiculous explanation is that the natural process
of aging (that you run faster as you age), is masked by the
beneficial effects of exposure to computers and cell phones as
a child. That is, today’s kids would be even slower (because
they are young) except for the fact that they use computers
and cell phones so much. Presumably, when they grow up,
they will be super fast, benefiting both from their advanced
age and from the head start they got as children from their
exposure to computers and cell phones.
Looking at Ynnad’s model in terms of the R2 or size
of residuals, how does it compare to my model? Which
one should you believe?
Give an explanation of why both my model and Ynnad’s
model are bogus. See if you can also explain why
we shouldn’t take the coefficients in mod1 seriously at
face value.
54 CHAPTER 8. FITTING MODELS TO DATA
Chapter 9
Correlation and Partitioning of Variance
Reading Questions
How does R2
summarize the extent to which a model
has captured variability?
What does it mean for one model to be nested in another?
How does the correlation coefficient differ from R2
Prob 9.01 The R2
statistic is the ratio of the variance of
the fitted values to the variance of the response variable.
Using the kidsfeet.csv data:
1. Find the variance of the response variable in the model
width ~ sex + length + sex:length .
0.053 0.119 0.183 0.260 0.346
2. Find the variance of the fitted values from the model
0.053 0.119 0.183 0.260 0.346
3. Compute the R2 as the ratio of these two variances.
0.20 0.29 0.46 0.53 0.75
4. Is this the same as the “Multiple R2” given in the
summary(mod) report? Yes No
Prob 9.02 The variance of a response variable A is 145 and
the variance of the residuals from the model A ~ 1+B is 45.
? What is the variance of the fitted model values?
45 100 145 190 Cannot tell
? What is the R2
for this model?
0 45/145 100/145 100/190 145/190 Cannot tell
Prob 9.04 For each of the following pairs of models, mark
the statement that is most correct.
Part 1
Model 1 . A ~ B+C
Model 2 . A ~ B*C
A Model 1 is nested in Model 2.
B Model 2 is nested in Model 1.
C The two models are the same.
D None of the above is true.
Part 2
Model 1 . A ~ B
Model 2 . B ~ A
A Model 1 is nested in Model 2.
B Model 2 is nested in Model 1.
C The two models are the same.
D None of the above is true.
Part 3
Model 1 . A ~ B + C
Model 2 . B ~ A * C
A Model 1 is nested in Model 2.
B Model 2 is nested in Model 1.
C The two models are the same.
D None of the above is true.
Part 4
Model 1 . A ~ B + C + B:C
Model 2 . A ~ B * C
A Model 1 is nested in Model 2.
B Model 2 is nested in Model 1.
C The two models are the same.
D None of the above is true.
Prob 9.05 For each of the following pairs of models, mark
the statement that is most correct.
Part 1
Model 1 . A ~ B+C
Model 2 . A ~ B*C
A Model 1 can have a higher R2
than Model
2
B Model 2 can have a higher R2
than Model
1
C The R2 values will be the same.
D None of the above are necessarily true.
Part 2
55
56 CHAPTER 9. CORRELATION AND PARTITIONING OF VARIANCE
Model 1 . A ~ B + C
Model 2 . B ~ A * C
A Model 1 can have a higher R2
than Model
2
B Model 2 can have a higher R2
than Model
1
C The R2 values will be the same.
D None of the above are necessarily true.
Part 3
Model 1 . A ~ B + C + B:C
Model 2 . A ~ B * C
A Model 1 can have a higher R2
than Model
2
B Model 2 can have a higher R2
than Model
1
C The R2 values will be the same.
D None of the above are necessarily true.
[From a suggestion by a student]
Going further
In answering this question, you might want to try out a
few examples using real data: just pick two quantitative variables
to stand in for A and B.
What will be the relationship between R2
for the following
two models?
Model 1 . A ~ B
Model 2 . B ~ A
A Model 1 can have a higher R2
than Model
2
B Model 2 can have a higher R2
than Model
1
C The R2 values will be the same.
D None of the above are necessarily true.
Prob 9.10 Which of the following statements is true about
R2
?
1. True or False R2 will never go down when you add
an additional explanatory term to a model.
2. For a perfectly fitting model,
A R2
is exactly zero.
B R2
is exactly one.
C Neither of the above.
3. In terms of the variances of the fitted model points, the
residual, and the response variable, R2
is the:
A Variance of the residuals divided by the
variance of the fitted.
B Variance of the response divided by the
variance of the residuals.
C Variance of the fitted divided by the variance
of the residuals.
D Variance of the fitted divided by the variance
of the response.
E Variance of the response divided by the
variance of the fitted.
Prob 9.11 Consider models with a form like this
> lm( response ~ 1, data=whatever)
The R2 of such a model will always be 0. Explain why.
Prob 9.12 Consider the following models where a response
variable A is modeled by explanatory variables B, C, and D.
1 A ~ B
2 A ~ B + C + B:C
3 A ~ B + C
4 A ~ B * C
5 A ~ B + D
6 A ~ B*C*D
Answer the following:
(a) Model 1 is nested in model 2. True or False
(b) Model 5 is netsted in model 3. True or False
(c) Model 1 is nested in model 3. True or False
(d) Model 5 is nested in model 1. True or False
(e) Model 2 is nested in model 3. True or False
(f) Model 3 is nested in model 4.True or False
(g) All the other models are nested in model 6. True or
False
Prob 9.13 Consider two models, Model 1 and Model 2, with
the same response variable.
1. Model 1 is nested in Model 2 if the
variables model terms of Model 1 are a subset of
those of Model 2.
2. True or False If Model 1 is nested in Model 2, then
model 1 cannot have a higher R2
than model 2.
3. Which of the following are nested in A ~ B*C + D?
True or False A ~ B
True or False A ~ B + D
True or False B ~ C
True or False A ~ B+C+D
True or False A ~ B*D + C
True or False A ~ D
57
Prob 9.21 Here is a set of models:
Model A: wage ~ 1
Model B: wage ~ age + sex
Model C: wage ~ 1 + age*sex
Model D: wage ~ educ
Model E: wage ~ educ + age - 1
Model F: wage ~ educ:age
Model G: wage ~ educ*age*sex
You may want to try fitting each of the models to the
Current Population Survey data cps.csv to make sure you
understand how the * shorthand for interaction and main effects
expands to a complete set of terms. That way you can
see exactly which coefficients are calculated for any of the
models.
Answer the following:
1. B is nested in A. True or False
2. D is nested in E. True or False
3. B is nested in C. True or False
4. All of the models A-F are nested in G. True or False
5. D is nested in F. True or False
6. At least one of the models A-G is nested in educ ~ age.
True or False
Prob 9.22 A data set on US Congressional Districts (provided
by Prof. Julie Dolan), congress.csv contains information
on the population of each congressional district in 2004.
There are 436 districts listed (corresponding to the 435 voting
members of the House of Representatives from the 50 states
and an additional district for Washington, D.C., whose citizens
have only a non-voting “representative.”
The US Supreme Court (Reynolds v. Sims, 377 US 533,
1964) ruled that state legislature districts had to be roughly
equal in population: the one-person one-vote principle. Before
this ruling, some states had grossly unequally sized districts.
For example, one district in Connecticut for the state General
Assembly had 191 people, while another district in the same
state had 81,000. Los Angeles County had one representative
in the California State Senate for a population of six million,
while another county with only 14,000 residents also had one
representative.
Of course, exact equality of district sizes is impossible
in every district, since districts have geographically defined
boundaries and the population can fluctuate within each
boundary. The Supreme Court has written, “... mathematical
nicety is not a constitutional requisite...” and “so long as the
divergences from a strict population standard are based on
legitimate considerations incident to the effectuation of a rational
state policy, some deviations from the equal-population
principle are constitutionally permissible ....” (Reynolds v.
Simms)
The situation in the US House of Representatives is more
complicated, since congressional districts are required to be
entirely within a single state.
Let’s explore how close the districts for the US House
of Representatives comes to meeting the one-person one-vote
principle.
One way to evaluate how far districts are from equality of
population size is to examine the standard deviation across
districts.
? What is the standard deviation of the district populations
across the whole US?
4823 9468 28790 342183 540649
Another way to look at the spread is to try to account for
the differences in populations by modeling them and looking
at how much of the difference remains unexplained.
Let’s start with a very simple model that treats all the
districts as the same: population ~ 1.
What is the meaning of the single coefficient from this
model?
A The mean district population across all
states.
B The mean district population across all districts.
C The median population across all districts.
D The median population across all states.
E None of the above.
Calculate the standard deviation of the residuals. How
does this compare to the standard deviation of the district
population itself?
A It’s much larger.
B It’s somewhat larger.
C It’s exactly the same.
D It’s much smaller.
Now model the district size by the state population ~
1 + state.
What is the standard deviation of the residuals from this
model?
#
A box plot of the residuals shows a peculiar pattern. What
is it?
A The residuals are all the same.
B Every residual is an outlier.
C The residuals are almost all very close to
zero, except for a few outliers.
The variable state accounts for almost all of the variability
from district to district. That is, districts within a state are
almost exactly the same size, but that size differs from state
to state. Why is there a state-to-state difference? The number
of districts within a state must be a whole number: 1, 2,
3, and so on. Ideally, the district populations are the state
population divided by the number of districts. The number
of districts is set to make the district population as even
as possible between states, but exact equality isn’t possible
since the state populations differ. Notice that the largest and
smallest districts (Montana and Wyoming, respectively) are
in states with only a single district. Adding a second district
to Montana would dramatically reduce the district size below
the national mean. And even though Wyoming has a very
low-population district, it’s impossible to take a district away
since Wyoming only has one.
58 CHAPTER 9. CORRELATION AND PARTITIONING OF VARIANCE
Prob 9.23 Consider this rule of thumb:
In comparing two models based on the same
data, the model with the larger R2
is better than
the model with the smaller R2
.
Explain what makes sense about this rule of thumb and
also what issues it might be neglecting.
Chapter 10
Total and Partial Relationships
Reading Questions
What is a covariate? Why use a special word for it when
it is just a variable?
What is the difference between a partial change and a
total change?
In the experimental method, how are covariates dealt
with?
What is the modeling approach to dealing with covariates?
What is Simpson’s paradox?
Prob 10.01 Consider the data set on kids’ feet in
kidsfeet.csv.
First, you’re going to look at a total change. Build a model
of foot width as a function of foot length: width ~ length. Fit
this model to the kids’ feet data.
According to this model, how much does the typical width
change when the foot length is increased from 22 to 27 cm?
0.187 0.362 0.744 0.953 1.060 1.105 1.240 1.487
This is a total change, because it doesn’t hold any other
variable constant, e.g. sex. That might sound silly, since obviously
a kid’s sex doesn’t change as his or her foot grows.
But the model doesn’t know that. It happens that most of
the kids with foot lengths near 22 cm are girls, and most of
the kids with foot lengths near 27 cm are boys. So when you
compare feet with lengths of 22 and 27, you are effectively
changing the sex at the same time as you change the foot
length.
To look at a partial change, holding sex constant, you need
to include sex in the model. A simple way to do this is width
~ length + sex. Using this model fitted to the kids’ feet data,
how much does a typical foot width change if the foot length
is increased from 22 to 27 cm?
0.187 0.362 0.744 0.953 1.060 1.105 1.142 1.240 1.487
You can also build more detailed models, for example a
model that includes an interaction term: width ~ length * sex.
Using this model fitted to the kids’ feet data, how much will a
typical girl’s foot width change if the foot length is increased
from 22 to 27 cm?
0.187 0.362 0.744 0.953 1.060 1.105 1.142 1.240 1.487
Prob 10.02 In each of the following, a situation is described
and a question is asked that is to be answered by modeling.
Several variables are listed. Imagine an appropriate model
and identify each variable as either the response variable, an
explanatory variable, a covariate, or a variable to be ignored.
EXAMPLE: Some people have claimed that police
foot patrols are more effective at reducing the
crime rate than patrols done in automobiles. Data
from several different cities is available; each city
has its own fraction of patrols done by foot, its
own crime rate, etc. The mayor of your town
has asked for your advice on whether it would be
worthwhile to shift to more foot patrols in order
to reduce crime. She asks, “Is there evidence that
a larger fraction of foot patrols reduces the crime
rate?”
Variables:
1. Crime rate (e.g., robberies per 100000 population)
variable.
2. Fraction of foot patrols
3. Number of policemen per 1000 population
4. Demographics (e.g., poverty rate)
The question focuses on how the
fraction of foot patrols might influence crime rate,
so crime rate is the response variable and
fraction of foot patrols is an explanatory variable.
But, the crime rate might also depend on
the overall level of policing (as indicated by
the number of policemen), or on the social conditions
that are associated with crime (e.g.,
demographics). Since the mayor has no power to
change the demographics of your town, and probably
little power to change the overall level number
of policemen, in modeling the data from the
different cities, you would want to hold constant
number of policemen and the demographics. You
can do this by treating number of policement and
demographics as covariates and including them in
your model.
Alcohol and Road Safety
59
60 CHAPTER 10. TOTAL AND PARTIAL RELATIONSHIPS
Fifteen years ago, your state legislature raised the legal
drinking age from 18 to 21 years. An important motivation
was to reduce the number of car accident deaths due to drunk
or impaired drivers. Now, some people are arguing that the
21-year age limit encourages binge drinking among 18 to 20
year olds and that such binge drinking actually increases car
accident deaths. But the evidence is that the number of car
accident deaths has gone down since the 21-year age restriction
was introduced. You are asked to examine the issue:
Does the reduction in the number of car-accident deaths per
year point to the effectiveness of the 21-year drinking age?
Variables:
1. Drinking age limit. Levels: 18 or 21.
response explanatory covariate ignore
2. Number of car-accident deaths per year.
response explanatory covariate ignore
3. Prevalence of seat-belt use.
response explanatory covariate ignore
4. Fraction of cars with air bags.
response explanatory covariate ignore
5. Number of car accidents (with or without death).
response explanatory covariate ignore
Rating Surgeons
Your state government wants to guide citizens in choosing
physicians. As part of this effort, they are going to rank all
the surgeons in your state. You have been asked to build the
rating system and you have a set of variables available for
your use. These variables have been measured for each of the
342,861 people who underwent surgery in your state last year:
one person being treated by one doctor. How should you construct
a rating system that will help citizens to choose the
most effective surgeon for their own treatment?
Variables:
? Outcome score. A high score means that the operation
did what it was supposed to. A low score reflects failure,
e.g. death. Death is a very bad outcome, post-operative
infection a somewhat bad outcome.)
response explanatory covariate ignore
Surgeon. One level for each of the operating surgeons.
response explanatory covariate ignore
Experience of the surgeon.
response explanatory covariate ignore
Difficulty of the case.
response explanatory covariate ignore
School testing
Last year, your school district hired a new superintendent
to “shake things up.” He did so, introducing several controversial
new policies. At the end of the year, test scores were
higher than last year. A representative of the teachers’ union
has asked you to examine the score data and answer this question:
Is there reason to think that the higher scores were the
result of the superintendent’s new policies?
Variables:
1. Superintendent (levels: New or Former superintendent)
response explanatory covariate ignore
2. Exam difficulty
response explanatory covariate ignore
3. Test scores
response explanatory covariate ignore
Gravity
In a bizarre twist of time, you find yourself as Galileo’s
research assistant in Pisa in 1605. Galileo is studying gravity:
Does gravity accelerate all materials in the same way, whether
they be made of metal, wood, stone, etc.? Galileo hired you
as his assistant because you have brought with you, from the
21st century, a stop-watch with which to measure time intervals,
a computer, and your skill in statistical modeling. All of
these seem miraculous to him.
He drops objects off the top of the Leaning Tower of Pisa
and you measure the following:
Variables
1. The size of the object (measured by its diameter).
response explanatory covariate ignore
2. Time of fall of the object.
response explanatory covariate ignore
3. The material from which the object is made (brass, lead,
wood, stone).
response explanatory covariate ignore
[Thanks to James Heyman.]
Prob 10.04 Economists measure the inflation rate as a percent
change in price per year. Unemployment is measured as
the fraction (percentage) of those who want to work who are
seeking jobs.
According to economists, in the short run — say, from
one year to another — there is a relationship between in-
flation and unemployment: all other things being equal, as
unemployment goes up, inflation should go down. (The relationship
is called the “Phillips curve,” but you don’t need to
know that or anything technical about economics to do this
question.)
If this is true, in the model Inflation ~ Unemployment,
what should be the sign of the coefficient on Unemployment?
positive zero negative
But despite the short term relationship, economists claim
that In the long run — over decades — unemployment and
inflation should be unrelated.
If this is true, in the model Inflation ~ Unemployment,
what should be the sign of the coefficient on Unemployment?
positive zero negative
61
The point of this exercise is to figure out how to arrange
a model so that you can study the short-term behavior of the
relationship, or so that you can study the long term relationship.
For your reference, here is a graph showing a scatter plot of
inflation and unemployment rates over about 30 years in the
US. Each point shows the inflation and unemployment rates
during one quarter of a year. The plotting symbol indicates
which of three decade-long periods the point falls into.
Unemployment (%)
Inflation (%)
● Decade A
Decade B
Decade C
The relationship between inflation and unemployment
seems to be different from one decade to another — that’s
the short term.
Which decade seems to violate the economists’ Phillips
Curve short-term relationship?
A B C none all
Using the modeling language, express these different
possible relationships between the variables Inflation,
Unemployment, and Decade, where the variable Decade is a
categorical variable with the three different levels shown in
the legend for the graph.
1. Inflation depends on Unemployment in a way that
doesn’t change over time.
A Inflation ~ Decade
B Inflation ~ Unemployment
C Inflation ~ Unemployment+Decade
D Inflation ~ Unemployment*Decade
2. Inflation changes with the decade, but doesn’t depend
on Unemployment.
A Inflation ~ Decade
B Inflation ~ Unemployment
C Inflation ~ Unemployment+Decade
D Inflation ~ Unemployment*Decade
3. Inflation depends on Unemployment in the same way
every decade, but each decade introduces a new background
inflation rate independent of Unemployment.
A Inflation ~ Decade
B Inflation ~ Unemployment
C Inflation ~ Unemployment+Decade
D Inflation ~ Unemployment*Decade
4. Inflation depends on Unemployment in a way that differs
from decade to decade.
A Inflation ~ Decade
B Inflation ~ Unemployment
C Inflation ~ Unemployment+Decade
D Inflation ~ Unemployment*Decade
Whether a model examines the short-term or the longterm
behavior is analogous to whether a partial change
or a total change is being considered.
Suppose you wanted to study the long-term relationship
between inflation and unemployment. Which of these is
appropriate?
A Hold Decade constant. (Partial change)
B Let Decade vary as it will. (Total change)
Now suppose you want to study the short-term relationship.
Which of these is appropriate?
A Hold Decade constant. (Partial change)
B Let Decade vary as it will. (Total change)
Prob 10.05 Consider two models that you are to fit to a
single data set involving three variables: A, B, and C.
Model 1 A ~ B
Model 2 A ~ B + C
(a) When should you say that Simpson’s Paradox is occuring?
A When Model 2 has a lower R2
than Model
1.
B When Model 1 has a lower R2
than Model
2.
C When the coef. on B in Model 2 has the
opposite sign to the coef. on B in Model 1.
D When the coef. on C in Model 2 has the
opposite sign to the coef. on B in Model 1.
(b) True or False: If B is uncorrelated with A, then the coef-
ficient on B in the model A ~ B must be zero.
True or False
(c) True or False: If B is uncorrelated with A, then the coef-
ficient on B in a model A ~ B+C must be zero.
True or False
(d) True or False: Simpson’s Paradox can occur if B is uncorrelated
with C.
True or False
Based on a suggestion by student Atang Gilika.
62 CHAPTER 10. TOTAL AND PARTIAL RELATIONSHIPS
Prob 10.10 Standard & Poor’s is a RATING AGENCY
that provides information about various financial instruments
such as stocks and bonds. The S&P 500 Stock Index, for instance,
provides a summary of the value of stocks.
Bonds issued by governments, corporations, and other entities
are rated using letters. As described on the Standard &
Poor’s website, the ratings levels are AAA, AA+, AA, AA-
, A+, A, A-, BBB+, BBB, BBB-, BB+, BB, BB-, B+, B,
B-, CCC+, CCC, CCC-, CC, C, and D. The AAA rating is
the best. (“The obligor’s capacity to meet its financial commitment
on the obligation is extremely strong.”) D is the
worst. (“The ‘D’ rating category is used when payments on
an obligation are not made on the date due ....)
? The bond ratings are a categorical variable. True or
False
? The bond ratings are an ordinal variable. True or
False
Bonds are a kind of debt; they pay interest and the principal
is paid back at the end of a maturity period. The people
and institutions who invest in bonds are willing to accept
somewhat lower interest payments in exchange for greater security.
Thus, AAA-rated bonds tend to pay the lowest interest
rates and worse-rated bonds pay more. A report on
interest rates on bonds (www.fmsbonds.com, for 8/21/2008)
listed interest rates on municipal bonds:
Issue Maturity Rate
AAA Rated
National 10 Year 3.75
National 20 Year 4.60
National 30 Year 4.75
Florida 30 Year 4.70
AA Rated
National 10 Year 3.90
National 20 Year 4.70
National 30 Year 4.85
Florida 30 Year 4.80
A Rated
National 10 Year 4.20
National 20 Year 5.05
National 30 Year 5.20
Florida 30 Year 5.15
How many explanatory variables are given in this table to
account for the interest rate:
A Two: Issue and Maturity
B Three: Issue, Maturity, and S & P Rating
C Four: Issue, Maturity, S & P Rating, and
Interest Rate
Judging from the table, and holding all other explanatory
variables constant, what is the change in interest rate associated
with a change from AAA to AA rating?
0.05 0.15 0.25 0.30 0.40
Again, holding all other explanatory variables constant,
what is the change in interest rate for a 10-year compared to
a 20-year maturity bond? (Pick the closest answer.)
0.15 0.50 0.85 1.20 1.45
Sometimes it is unclear when a variable should be considered
quantitative and when it should be taken as categorical.
For example, the maturity variable looks on the surface to be
quantitative (10-year, 20-year, 30-year, etc.). What is it about
these data that suggests that it would be unrealistic to treat
maturity as a quantitative variable in a model of interest rate?
Prob 10.11 A study on drug D indicates that patients who
were given the drug were less likely to recover from their condition
C. Here is a table showing the overall results:
Drug # recovered # died Recovery Rate
Given 1600 2400 40%
Not given 2000 2000 50%
Strangely, when investigators looked at the situation separately
for males and females, they found that the drug improves
recovery for each group:
Females
Drug num recovered # died Recovery Rate
Given 900 2100 30%
Not given 200 800 20%
Males
Drug # recovered # died Recovery Rate
Given 700 300 70%
Not given 1800 1200 60%
Which is right? Does the drug improve recovery or hinder
recovery? What advice would you give to a physician
about whether or not to prescribe the drug to her patients?
Give enough of an explanation that the physician can judge
whether your advice is reasonable.
Based on an example from Judea Pearl (2000) Causality: Models,
Reasoning, and Inference, Cambridge Univ. Press, p. 175
Prob 10.12 Time Magazine reported the results of a poll of
people’s opinions about the U.S. economy in July 2008. The
results are summarized in the graph.
[Source: Time, July 28, 2008, p. 41]
In a typical news media report of a poll, the results are
summarized using one explanatory variable at a time. The
point of this exercise is to show that such univariate explanations
can be misleading.
63
The poll involves three explanatory variables: ethnicity,
income, and age. Regretably, the reported results treat each
of these explanatory variables separately, even though there
are likely to be correlations among them. For instance, relatively
few people in the 18 to 29 age group have high incomes.
The original data set from which the Time graphic was
made contains the information needed to study the multiple
explanatory variables simultaneously, for example looking at
the connection between pessimism and age while adjusting for
income. This data set is not available, so you will need to
resort to a simulation which attempts to mimic the poll results.
Of course, the simulation doesn’t necessarily describe
people’s attitudes directly, but it does let you see how the
conclusions drawn from the poll might have been different if
the results for each explanatory variable had been presented
in a way that adjusts for the other explanatory variables.
To install the simulation software, run this statement:
> fetchData("simulate.r")
Once that software has been installed, the following statement
will run a simulation of a poll in which 10,000 people
are asked to rate their level of pessimism (on a scale from 0
to 10) and to indicate their age group and income level:
> poll = run.sim(economic.outlook.poll, 10000)
The output of the simulation will be a data frame that looks
something like this:
> head(poll)
age income pessimism
1 [18 to 29] [less than $20000] 10
2 [40 to 64] [$50,000 to $99,999] 5
3 [40 to 64] [less than $20000] 9
4 [40 to 64] [$50,000 to $99,999] 7
5 [65 and older] [$50,000 to $99,999] 7
6 [18 to 29] [less than $20000] 10
Your output will differ because the simulation reflects random
sampling.
? Construct the model pessimism ~ age-1. Look at the
coefficients and choose the statement that best reflects
the results:
A Middle aged people have lower pessimism
than young or old people.
B Young people have the least pessimism.
C There is no relationship between age and
pessimism.
? Now construct the model pessimism ~ income-1. Look
at the coefficients and choose the statement that best
reflects the results:
A Higher income people are more pessimistic
than low-income people.
B Higher income people are less pessimistic
than low-income people.
C There is no relationship between income
and pessimism.
? Construct a model in which you can look at the relationship
between pessimism and age while adjusting for
income. That is, include income as a covariate in your
model. Enter your model formula here: .
Look at the coefficients from your model and choose the
statement that best reflects the results:
A Holding income constant, older people tend
to have higher levels of pessimism than
young people.
B Holding income constant, young people
tend to have higher levels of pessimism
than old people.
C Holding income constant, there is no relationship
between age and pessimism.
? You can also interpret that same model to see the relationship
between pessimism and income while adjusting
for age. Which of the following statements best reflects
the results? (Hint: make sure to pay attention to the
sign of the coefficients.)
A Holding age constant, higher income people
are more pessimistic than low-income
people.
B Holding age constant, higher income people
are less pessimistic than low-income people.
C Holding age constant, there is no relationship
between income and pessimism.
Prob 10.20 Whenever you seek to study a partial relationship,
there must be at least three variables involves: a response
variable, an explanatory variable that is of direct interest,
and one or more other explanatory variables that will
be held constant: the co-variates. Unfortunately, it’s hard
to graph out models involving three variables on paper: the
usual graph of a model just shows one variable as a function
of a second.
One way to display the relationship between a response
variable and two quantitative explanatory variables is to use
a contour plot. The two explanatory variables are plotted on
the axes and the fitted model values are shown by the contours.
The figure shows such a display of the fitted model of
used car prices as a function of mileage and age.
The dots are the mileage and age of the individual cars —
the model Price is indicated by the contours.
64 CHAPTER 10. TOTAL AND PARTIAL RELATIONSHIPS
The total relationship between Price and mileage involves
how the price changes for typical cars of different mileage.
Pick a dot that is a typical car with about 10,000 miles. Using
the contours, find the model price of this car.
Which of the following is closest to the model price (in
dollars)?
18000 21000 25000 30000
Now pick another dot that is a typical car with about
70,000 miles. Using the contours, find the model price of this
car.
18000 21000 25000 30000
The total relationship between Price and mileage is re-
flected by this ratio: change in model price divided by change
in mileage. What is that ratio (roughly)?
A
30000?21000
70000?10000 = 0.15 dollars/mile
B
70000?10000
25000?21000 = 15.0 dollars/mile
C
25000?18000
70000?10000 = 0.12 dollars/mile
In contrast, the partial relationship between Price and
mileage holding age constant is found in a different way, by
comparing two points with different mileage but exactly the
same age.
Mark a point on the graph where age is 3 years and mileage
is 10000. Keep in mind that this point doesn’t need to be an
actual car, that is, a data point in the graph typical car. There
might be no actual car with an age of 3 years and mileage
10000. But using the contour model, find the model price at
this point:
22000 24000 26000 28000 30000
Now find another point, one where the age is exactly the
same (3 years) but the mileage is different. Again there might
not be an actual car there. Let’s pick mileage as 80000. Using
the contours, find the model price at this point:
22000 24000 26000 28000 30000
The partial relationship between price and mileage (holding
age constant) is reflected again reflected by the ratio of
the change in model price divided by the change in mileage.
What is that ratio (roughly)?
A
80000?10000
25000?21000 = 17.50 dollars/mile
B
28000?22000
80000?10000 = 0.09 dollars/mile
C
26000?24000
80000?10000 = 0.03 dollars/mile
Both the total relationship and the partial relationship are
indicated by the slope of the model price function given by
the contours. The total relationship involves the slope between
two points that are typical cars, as indicated by the
dots. The partial relationship involves a slope along a different
direction. When holding age constant, that direction is
the one where mileage changes but age does not (vertical in
the graph).
There’s also a partial relationship between price and age
holding mileage constant. That partial relationship involves
the slope along the direction where age changes but mileage
is held constant. Estimate that slope by finding the model
price at a point where age is 2 years and another point where
age is 5 years. You can pick whatever mileage you like, but
it’s key that your two points be at exactly the same mileage.
Estimate the slope of the price function along a direction
where age changes but mileage is held constant (horizontally
on the graph).
A 100 dollars per year
B 500 dollars per year
C 1000 dollars per year
D 2000 dollars per year
The contour plot above shows a model in which both
mileage and age are explanatory variables. By choosing
the direction in which to measure the slope, one determines
whether the slope reflects a total relationship (a direction between
typical cars), or a partial relationship holding age constant
(a direction where age does not change, which might not
be typical for cars), or a partial relationship holding mileage
constant (a direction where mileage does not change, which
also might not be typical for cars).
In calculus, the partial derivative of price with respect to
mileage refers to an infinitesimal change in a direction where
age is held constant. Similarly, the partial derivative of price
with respect to age refers to an infinitesimal change in a direction
where mileage is held constant.
Of course, in order for the directional derivatives to make
sense, the price function needs to have both age and mileage
as explanatory variables. The following contour plot shows
a model in which only age has been used as an explanatory
variable: there is no dependence of the function on mileage.
Such a model is incapable of distinguishing between a partial
relationship and a total relationship. Both the partial and
the total relationship involve a ratio of the change in price and
change in age between two points. For the total relationship,
those two points would be typical cars of different ages. For
the partial relationship, those two points would be different
ages at exactly the same mileage. But, because the model
depend on mileage, the two ratios will be exactly the same.
Prob 10.21 Sometimes people use data in aggregated form
to draw conclusions about individuals. For example, in 1950
W.S. Robinson described the correlation between immigration
and illiteracy done in two different ways.[?] In the first, the
unit of analysis is individual US states as shown in the figure
— the plot shows the fraction of people in each state who are
illiterate versus the fraction of people who are foreign born.
The correlation is negative, meaning that states with higher
foreign-born populations have less illiteracy.
65
Robinson’s second analysis involves the same data, but
takes the unit of analysis as an individual person. The table
gives the number of people who are illiterate and who are
foreign born in the states included in the scatter plot.
The data in the table leads to a different conclusion than
the analysis of states: the foreign born people are more likely
to be illiterate.
This conflict between the results of the analyses, analogous
to Simpson’s paradox, is called the ecological fallacy.
(The word “ecological” is rooted in the Greek word oikos for
house — think of the choice between studying individuals or
the groups of individuals in their houses.)
The ecological fallacy is not a paradox; it isn’t a question
of what is the correct unit of analysis. If you want to study the
characteristic of individuals, your unit of analysis should be
individuals. If you want to study groups, your unit of analysis
should be those groups. It’s a fallacy to study groups when
your interest is in individuals.
One way to think about the difference between Robinson’s
conclusions with groups (the states) and the very different
conclusions with individuals, is the factors that create the
groups. Give an explanation, in everyday terms, why the immigrants
that Robinson studied might tend to be clustered in
states with low illiteracy rates, even if the immigrants themselves
had high rates of illiteracy.
66 CHAPTER 10. TOTAL AND PARTIAL RELATIONSHIPS
Chapter 11
Modeling Randomness
Reading Questions
What is a probability model?
What are some of the different probability models and
what different situations do they describe?
What is a “parameter” in a probability model? Give
some examples.
Prob 11.01 Two basic operations that you need to perform
on probability models are these:
percentile Given a value, what is the probability (according
to the model) of the outcome from one trial being that
value or less?
quantile Given a probability, what is the value whose percentile
is that probability?
To illustrate by example, suppose that you are dealing
with a probability model for an IQ test score that is a normal
distribution with these parameters: mean = 100 and standard
deviation = 15.
Percentile question: What is the percentile that corresponds
to a test score of 120? Answer: 0.91 or, in other
words, the 91st percentile.
> pnorm(120, mean=100,sd=15)
[1] 0.9087888
Quantile question: What score will 95% of scores be less
than or equal to? Answer: a score of 125.
> qnorm(0.95, mean=100, sd=15)
[1] 124.6728
Here are two very basic questions about percentile and
quantile calculations:
(a) True or False The output of a percentile question will
always be a probability, that is, a number between 0 and
1.
(b) True or False The output of a quantile question will
always be a value, that is, something in the same units as
the random variable.
Sometimes to answer more complicated questions, you need
first to answer one or more percentile or quantile questions.
Answer the following questions, using the normal probability
model with the parameters given above:
(a) What’s the test score that almost everybody, say, 99% of
people, will do better than?
Which kind of calculation is this? percentile quantile
What is the answer? 55 65 75 95 115 125 135
(b) To calculate a coverage interval on a probability model,
you need to calculate two quantities: one for the left end
of the interval and one for the right. Which type of calculation
are these probabilities from: percentile quantile
(c) Calculate a 50% coverage interval on the test scores, that
is the range from the 0.25 quantile to the 0.75 quantile:
Left end of interval: 80 85 90 95 100 105 110 115 120
Right end of interval: 80 85 90 95 100 105 110 115 120
(d) Calculate an 80% coverage interval, that is the range from
the 0.10 to the 0.90 quantile:
Left end of interval: 52 69 73 81 85 89
Right end of interval: 117 119 123 128 136
(e) To calculate the probability of an outcome falling in
a given range, you need to do two percentile calculations,
one for each end of the range. Then you substract
the two different probabilities. What is the probability
of a test score falling between 100 and 120?
0.25 0.37 0.41 0.48 0.52 0.61 0.73
Prob 11.02 A coverage interval gives a range of values. The
“level” of the interval is the probability that a random trial
will fall inside that range. For example, in a 95% coverage
interval, 95% of the trials will fall within the range.
To construct a coverage interval, you need to translate the
level into two quantiles, one for the left side of the range and
one for the right side. For example, a 50% coverage interval
runs from the 0.25 quantile on the left to the 0.75 quantile on
the right; a 60% coverage interval runs from 0.20 on the left
to 0.80 on the right. The probabilities used in calculating the
quantiles are set so that
? the difference between them is the level of the interval.
For instance, 0.75 and 0.25 give a 50% interval.
67
68 CHAPTER 11. MODELING RANDOMNESS
? they are symmetric. That is, the left probability should
be exactly as far from 0 as the right probability is from
1
A classroom of students was asked to calculate the left
and right probabilities for various coverage intervals. Some of
their answers were wrong. Explain what is wrong, if anything,
for each of these answers.
(a) For a 70% interval, the 0.20 and 0.90 quantiles
A The difference between them isn’t 0.70
B They are not symmetrical.
C Nothing is wrong.
(b) For a 95% interval, the 0.05 and 0.95 quantiles.
A The difference between them isn’t 0.95
B They are not symmetrical.
C Nothing is wrong.
(c) For a 95% interval, the 0.025 and 0.975 quantiles.
A The difference between them isn’t 0.95
B They are not symmetrical.
C Nothing is wrong.
Prob 11.04 For each of the following probability models,
calculate a 95% coverage interval. This means that you should
specify a left value and a right value. The left value corresponds
to a probability of 0.025 and the right value to a
probability of 0.975.
(a) The number of cars driving along a highway in one hour,
when the mean number of cars is 2000 per hour. Hint:
Poisson model
Left side of interval: 1812 1904 1913 1928 1935
Right side of interval: 2064 2072 2077 2088 2151
(b) The number of heads out of 100 flips of a fair coin. Hint:
Binomial model.
Left side of interval: 36 38 40 42 44 46
Right side of interval: 54 56 58 60 62 64
(c) The angle of a random spinner, ranging from 0 to 360
degrees. Hint: Uniform model.
Left side of interval: 9 15 25 36 42 60
Right side of interval: 300 318 324 335 345 351
Prob 11.05 For each of these families of probability distributions,
what are the parameters used to describe a specific
distribution?
(a) Uniform distribution
A Mean and Standard Deviation
B Max and Min
C Probability and Size
D Average Number per Interval
(b) Normal distribution
A Mean and Standard Deviation
B Max and Min
C Probability and Size
D Average Number per Interval
(c) Exponential distribution
A Mean and Standard Deviation
B Max and Min
C Probability and Size
D Average Number per Interval
(d) Poisson distribution
A Mean and Standard Deviation
B Max and Min
C Probability and Size
D Average Number per Interval
(e) Binomial distribution
A Mean and Standard Deviation
B Max and Min
C Probability and Size
D Average Number per Interval
Prob 11.10 College admissions offices collect information
about each year’s applicants, admitted students, and matriculated
students. At one college, the admissions office knows
from past years that 30% of admitted students will matriculate.
The admissions office explains to the administration each
year that the results of the admissions process vary from year
to year due to random sampling fluctuations. Each year’s
results can be interpreted as a draw from a random process
with a particular distribution.
Which family of probability distribution can best be used
to model each of the following situations?
(a) 1500 students are offered admission. The number of students
who will actually matriculate is:
A Normal
B Uniform
C Binomial
D Poisson
E Exponential
F Lognormal
(b) The average SAT score of the admitted applicants:
A Normal
B Uniform
C Binomial
D Poisson
E Exponential
F Lognormal
(c) The number of women in the matriculated class:
69
A Normal
B Uniform
C Binomial
D Poisson
E Exponential
F Lognormal
Prob 11.11 In 1898, Ladislaus von Bortkiewicz published
The Law of Small Numbers, in which he demonstrated the
applicability of the Poisson probability law. One example
dealt with the number of Prussian cavalry soldiers who were
kicked to death by their horses. The Prussian army monitored
10 cavalry corps for 20 years and recorded the number
X of fatalities each year in each corps. There were a total of
10 × 20 = 200 one-year observations, as shown in the table:
Number of Number of Times X
Deaths X Deaths Were Observed
0 109
1 65
2 22
3 3
4 1
(a) From the data in the table, what was the mean number
of deaths per year per cavalry corps?
A (109+65+22+3+1)/5
B (109+65+22+3+1)/200
C (0*109 + 1*65 + 2*22 + 3*3 + 4*1)/5
D (0*109 + 1*65 + 2*22 + 3*3 + 4*1)/200
E Can’t tell from the information given.
(b) Use this mean number of deaths per year and the Poisson
probability law to determine the theoretical proportion
of years that 0 deaths should occur in any given calvary
corps.
0.3128 0.4286 0.4662 0.5210 0.5434
(c) Repeat the probability calculation for 1, 2, 3, and 4 deaths
per year per calvary corps. Multiply the probabilities by
200 to find the expected number of calvary corps with
each number of deaths. Which of these tables is closest
to the theoretical values:
A 112.67 64.29 16.22 5.11 1.63
B 108.67 66.29 20.22 4.11 0.63
C 102.67 70.29 22.22 6.11 0.63
D 106.67 68.29 17.22 6.11 1.63
Prob 11.12 Experience shows that the number of cars entering
a park on a summer’s day is approximately normally
distributed with mean 200 and variance 900. Find the probability
that the number of cars entering the park is less than
195.
(a) Which type of calculation is this?
percentile quantile
(b) Do the calculation with the given parameters. (Watch
out! Look carefully at the parameters and make sure they
are in a standard form.)
0.3125 0.4338 0.4885 0.5237 0.6814 0.7163
Prob 11.20 The graph shows a cumulative probability.
(a) Use the graph to estimate by eye the 20th percentile of
the probability distribution. (Select the closest answer.)
0 2 4 6 8
(b) Using the graph, estimate by eye the probability of a randomly
selected x falling between 5 and 8? (Select the
closest answer.)
0.05 0.25 0.50 0.75 0.95
Prob 11.21 Ralph’s bowling scores in a single game are
normally distributed with a mean of 120 and a standard deviation
of 10.
(1) He plays 5 games. What is the mean and standard deviation
of his total score?
Mean: 10 120 240 360 600 710
Standard deviation: 20.14 21.69 22.36 24.31 24.71
(2) What is the mean and standard deviation of his average
score for the 5 games?
Mean: 10 60 90 120 150 180
Standard deviation: 4.47 4.93 5.10 5.62 6.18
Lucky Lolly bowls games that with scores randomly distributed
with a mean of 100 and standard deviation of 15.
(3) What is the z-score of 150 for Lolly?
1.00 2.00 2.33 3.00 3.33 7.66 120 150
(4) What is the z-score of 150 for Ralph?
1.00 2.00 2.33 3.00 3.33 7.66 120 150
(5) Is Lolly or Ralph more likely to score over 150?
A Lolly
B Ralph
C Equally likely
D Can’t tell from the information given.
(6) What is the z-score of 130 for Lolly?
1.00 2.00 2.33 3.00 3.33 7.66 120 150
(7) What is the z-score of 130 for Ralph?
1.00 2.00 2.33 3.00 3.33 7.66 120 150
70 CHAPTER 11. MODELING RANDOMNESS
(8) Is Lolly or Ralph more likely to score over 130?
A Lolly
B Ralph
C Equally likely
D Can’t tell from the information given.
Prob 11.22 Jim scores 700 on the mathematics part of the
SAT. Scores on the SAT follow the normal distribution with
mean 500 and standard deviation 100. Julie takes the ACT
test of mathematical ability, which has mean 18 and standard
deviation 6. She scores 24. If both tests measure the same
kind of ability, who has the higher score?
A Jim
B Julie
C They are the same.
D No way to tell.
Prob 11.23 For each of the following, decide whether the
random variable is binomial or not. Then choose the best
answer from the set offered.
(a) Number of aces in a draw of 10 cards from a shuffled deck
with replacement.
A It is binomial.
B It’s not because the sample size is not fixed.
C It’s not because the probability is not fixed
for every individual component.
D It’s not for both of the above reasons.
(b) Number of aces in a draw of 10 cards from a shuffled deck
without replacement.
A It is binomial.
B It’s not because the sample size is not fixed.
C It’s not because the probability is not fixed
for every individual component.
D It’s not for both of the above reasons.
(c) A broken typing machine has probability of 0.05 to make
a mistake on each character. The number of erroneous
characters in each sentence of a report.
A It is binomial.
B It’s not because the sample size is not fixed.
C It’s not because the probability is not fixed
for every individual component.
D It’s not for both of the above reasons.
(d) Suppose screws produced by a certain company will be defective
with probability 0.01 independent of each other.
The company sells the screws in a package of 10. The
number of defective screws in a randomly selected pack.
A It is binomial.
B It’s not because the sample size is not fixed.
C It’s not because the probability is not fixed
for every individual component.
D It’s not for both of the above reasons.
(e) Observe the sex of the next 50 children born at a local
hospital. Let x= # of girls among them.
A It is binomial.
B It’s not because the sample size is not fixed.
C It’s not because the probability is not fixed
for every individual component.
D It’s not for both of the above reasons.
(f) A couple decides to continue to have children until their
first daughter. Let x = # of children the couple has.
A It is binomial.
B It’s not because the sample size is not fixed.
C It’s not because the probability is not fixed
for every individual component.
D It’s not for both of the above reasons.
(g) Jason buys the state lottery ticket every month using his
favorite combination based on his birthday and his wife’s.
x= # of times he wins a prize in one year.
A It is binomial.
B It’s not because the sample size is not fixed.
C It’s not because the probability is not fixed
for every individual component.
D It’s not for both of the above reasons.
Prob 11.30 Just before a referendum on a school budget, a
local newspaper plans to poll 400 random voters out of 50,000
registered voters in an attempt to predict whether the budget
will pass. Suppose that the budget actually has the support
of 52% of voters.
(a) What is the probability that the newspaper’s sample will
wrongly lead them to predict defeat, that is, less than
50% of the poll respondents will indicate support?
A qbinom(.5,size=400,prob=.52)
B pbinom(.52,size=400,prob=.50)
C rnorm(400,mean=0.52,sd=.50)
D pbinom(199,size=400,prob=0.52)
E qnorm(.52,mean=400,sd=.50)
(b) What is the probability that more than 250 of those 400
voters will support the budget?
A pbinom(250,size=400,prob=.50)
B pbinom(249,size=400,prob=0.52)
C 1-pbinom(250,size=400,prob=0.52)
D qbinom(.5,size=400,prob=.52)
E 1-qnorm(.52,mean=400,sd=.50)
Prob 11.31 Here is a graph of a probability density.
71
(a) Using the graph, estimate by eye the probability of a randomly
selected x falling between 2 and 4. (Give the closest
answer.)
0.05 0.25 0.50 0.75 0.95
(b) Using the graph of probability density above, estimate by
eye the probability of a randomly selected x being less
than 2. (Give the closest answer.)
0.05 0.25 0.50 0.75 0.95
Prob 11.32 A student is asked to calculate the probability
that x = 3.5 when x is chosen from a normal distribution
with the following parameters: mean=3, sd=5. To calculate
the answer, he uses this command:
> dnorm(3.5, mean=3, sd=5)
[1] 0.0794
This is not right. Why not?
A He should have used pnorm.
B The parameters are wrong.
C The answer is zero since the variable x is
continuous.
D He should have used qnorm.
Prob 11.33 A paint manufacture has a daily production, x,
that is normally distributed with a mean 100,000 gallons and
a standard deviation of 10,000 gallons. Management wants
to create an incentive for the production crew when the daily
production exceeds the 90th percentile of the distribution.
You are asked to calculate at what level of production should
management pay the incentive bonus?
A qnorm(0.90,mean=100000,sd=10000)
B pnorm(0.90,mean=100000,sd=10000)
C qbinom(10000,size=100000,prob=0.9)
D dnorm(0.90,mean=100000,sd=10000)
Prob 11.34 Suppose that the height, in inches, of a randomly
selected 25-year-old man is a normal random variable
with standard deviation 2.5 inches. In the strange universe
in which statistics problems are written, we don’t know the
mean of this distribution but we do know that approximately
12.5% of 25-year-old man are taller than 6 feet 2 inches. Using
this information, calculate the following.
(a) What’s the average height of 25-year-old men? That is,
find the mean of a normal distribution with standard deviation
of 2.5 inches such that 12.5% of the distribution
lies at or above 74 inches.
68.54 70.13 71.12 73.82 74.11 75.23 75.88 76.14
(b) Using this distribution, how tall should a man be in order
to be in the tallest 5% of 25-year-old men?
68.54 70.13 71.12 73.82 74.11 75.23 75.88 76.14
Prob 11.35 In commenting on the “achievement gap” between
different groups in the public school, the Saint Paul
Public School Board released the following information:
Saint Paul Public Schools (SPPS) serve more
than 42,000 students. Thirty percent are African
American, 30% Asian, and 13% Hispanic. The
stark reality is that reading scores for two-thirds
of our district’s African American students fall below
the national average, while reading scores for
90% of their white counterparts surpass it.
The point of this exercise is to translate this information into
the point-score increase needed to bring African American
students’ scores into alignment with the white students.
Imagine that the test scores for white students form a normal
distribution with a mean of 100 and a standard deviation
of 25. Suppose also that African American students have test
scores that form a normal distribution with a standard deviation
of 25. What would have to be the mean of the African
American students’ test scores in order to match the information
given by the School Board?
(a) What is the score threshold for passing the test if 90% of
white students pass? One of the following R commands
will calculate it. Which one?
A pnorm(0.1,mean=100,sd=25)
B pnorm(0.9,mean=100,sd=25)
C qnorm(25,mean=100,sd=0.9)
D qnorm(0.1,mean=100,sd=25)
E qnorm(0.9,mean=100,sd=25)
F qnorm(25,mean=100,sd=0.1)
G rnorm(25,mean=100,sd=0.9)
(b) Using that threshold, what would be the mean score for
African Americans such that two-thirds (66.7%) are below
the threshold? Hint: If you knew the answer, then it
would produce this result.
> pnorm(67.96,mean=YourAnswer,sd=25)
[1] 0.667
Start by proposing an answer; a guess will do. Look at
the resulting response and use that to guide refining your
proposal until you hit the target response: 0.667. When
you are at the target, your proposal will be close to one
of these:
47 57 63 71 81
(c) Suppose scores for the African American students were
to increase by 15 points on average. What would be the
failure rate (in percent)?
21 35 44 53 67 74
72 CHAPTER 11. MODELING RANDOMNESS
(d) A common way to report the difference between two
groups is the number of standard deviations that separate
the means. How big is this for African American
students in the Saint Paul Public Schools compared to
whites (under the assumptions made for this problem)?
0.32 0.64 1.18 1.72 2.66 5.02
It would be more informative if school districts gave the
actual distribution of scores rather than the passing rate.
Prob 11.36 A manufacturer of electrical fiber optic cables
produces spools that are 50,000 feet long. In the production
process, flaws are introduced randomly. A study of the flaws
indicates that, on average, there is 1 flaw per 10000 feet.
(a) Which probability distribution describes the situation of
how many flaws there will be in a spool of cable.
A normal
B uniform
C binomial
D exponential
E poisson
(b) What’s the probability that a 50,000 foot-long cable has
3 or fewer flaws? (Enter your answer to 3 decimal places,
e.g., 0.456.)
# to within ±0.001
Prob 11.37 To help reduce speeding, the local governments
sometimes put up speed signs at locations where speeding is
a problem. These signs measure the speed of each passing
car and display that speed to the driver. In some countries,
such as the UK, the devices are equiped with a camera which
records an image of each speeding car and a speeding ticket
is sent to the registered owner.
At one location, the data recorded from such a device indicates
that between 7 and 10 PM, 32% of cars are speeding
and that 4.3 cars per minute pass the intersection, on average.
Which probability distributions can be best used to model
each of the following situations for 7 to 10PM?
(a) The number of speeding cars in any 1 hour period.
A Normal
B Uniform
C Binomial
D Poisson
E Exponential
F Lognormal
(b) The time that elapses between cars:
A Normal
B Uniform
C Binomial
D Poisson
E Exponential
F Lognormal
(c) Out of 100 successive cars passing the device, the number
that are speeding
A Normal
B Uniform
C Binomial
D Poisson
E Exponential
F Lognormal
(d) The mean speed of 100 successive cars passing the device.
A Normal
B Uniform
C Binomial
D Poisson
E Exponential
F Lognormal
Prob 11.40 As part of a test of the security inspection system
in an airport, a government supervisor adds 5 suitcases
with illegal materials to an otherwise shipping load, bringing
the total to 150 suitcases.
In order to determine whether the shipment should be accepted,
security officers randomly select 15 of the suitcases
and X-rays them. If one or more of the suitcases is found to
contain the materials, the entire shipment will be searched.
1. What probability model best applies here in describing
the probability that at one or more of the five added
suitcases will be X-rayed?
A Normal
B Uniform
C Binomial
D Poisson
E Exponential
2. What is the probability that one or more of the five
added suitcases will be X-rayed?
A 1-pnorm(5,mean=150,sd=15)
B 1-pnorm(15,mean=150,sd=5)
C 1-punif(5,min=0,max=150)
D 1-punif(15,min=0,max=150)
E 1-pbinom(0,size=15,prob=5/150)
F 1-ppois(0,45/150)
G 1-ppois(0,5/150)
H Not enough information to tell.
Prob 11.41 Do the best job you can answering this question.
The information provided is not complete, but that’s
the way things often are.
Linda is 31 years old, single, outspoken, and very bright.
She majored in philosophy. As a student, she was deeply concerned
with issues of discrimination and social justice, and
also participated in antinuclear rallies.
(a) Rank the following possibilities from most likely to least
likely, for instance “A B C D E”:
(a) Linda is a teacher.
(b) Linda works in a bookstore and takes yoga classes.
(c) Linda is a bank teller.
(d) Linda sells insurance.
(e) Linda is a bank teller and is active in the feminist
movement.
73
(b) Is there any relationship between the probabilities of the
above items that you can be absolutely sure is true?
From ’Judgments of and by representativeness,’ in “Judgment
under uncertainty : heuristics and biases” / edited by
Daniel Kahneman, Paul Slovic, Amos Tversky. Pub info
Cambridge ; New York : Cambridge University Press, c1982.
Prob 11.42 According to the website http://www.
wikihealth.com/Pregnancy, approximately 3.6% of pregnant
women give birth on the predicted date (using the
method that calculates gestational duration starting at the
time of the last menstrual period). Assume that the probability
of giving birth is a normal distribution whose mean is
at the predicted date. The standard deviation quantifies the
spread of gestational durations.
Using just the 3.6% “fact,” make an estimate of the standard
deviation of all pregnancies assuming that pregnancies
are distributed as a normal distribution centered on the predicted
date. Hint: Think of the area under the distribution
over the range that covers one 24-hour period.
Your answer should be in days. Select the closest value:
8 9 10 11 12
Prob 11.43 You have decided to become a shoe-maker.
Contrary to popular belief, this is a highly competitive field
and you had to take the Shoe-Maker Apprenticeship Trial
(SAT) as part of your apprenticeship application.
(a) The Shoe-maker’s union has told you that among all the
people taking the SAT, the mean score is 700 and the
standard deviation is 35.
According to this information, what is the percentile corresponding
to your SAT score of 750?
A 1-pnorm(750,mean=700,sd=35)
B pnorm(750,mean=700,sd=35)
C qnorm(750,mean=700,sd=35)
D 1-qnorm(750,mean=700,sd=35)
E Not enough information to answer.
(b) Your friend just told you that she scored at the 95th percentile,
but she can’t remember her numerical score. Using
the information from the Shoe-maker’s union, what
was her score?
A pnorm(.95,mean=700,sd=35)
B pnorm(.95,mean=750,sd=35)
C qnorm(.95,mean=700,sd=35)
D qnorm(.95,mean=750,sd=35)
E Not enough information to answer.
Prob 11.44 Both the poisson and binomial probability distributions
describe a count of events.
The binomial distribution describes a series of identical
discrete events with two possible outcomes to each event: yes
or no, true or false, success or failure, and so on. The number
of “success” or “true” or “yes” events in the series is given
by the binomial distribution, so long as the individual events
are independent of one another. An example is the number
of heads that occur when flipping a coin ten times in a row.
Each flip has a heads or tails outcome. The individual flips
are independent.
There are two parameters to the binomial distribution:
the number of events (“size”) and the probability of the outcome
that will be counted as a success. For the distribution
to be binomial, the number of events must be fixed ahead of
time and the probability of success must be the same for each
event. The outcome whose probability is represented by the
binomial distribution is the number of successful events.
Example: You flip 10 fair coins and count the number of
heads. In this case the size is 10 and the probability of success
is 1/2.
Counter-example: You flip coins and count the number of
flips until the 10th head. This is not a binomial distribution
because the size is not fixed.
The poisson probability model is different. It describes a
situation where the rate at which events happen is fixed but
there is no fixed number of events.
Example: Cars come down the street in a random way but
at an average rate of 3 per minute. The poisson distribution
describes the probability of seeing any given number of cars
in one minute. Unlike the binomial distribution, there is no
fixed number of events; potentially 50 cars could pass by in
one minute (although this is very, very unlikely).
Both the poisson and binomial distributions are discrete.
You can’t have 5.5 heads in 10 flips of a coin. You can’t have
4.2 cars pass by in one minute. Because of this, the basic way
to use the tabulated probabilities is as a probability assigned
to each possible outcome.
Here is the table of outcomes for the number of heads in
6 flips of a fair coin:
> dbinom(0:5, size=6, prob=.5)
[1] 0.016 0.094 0.234 0.313 0.234 0.094 0.016
So, the probability of exactly zero heads is 0.016, the prob.
of 1 head is 0.094, and so on.
You may also be interested in the cumulative probability:
> pbinom(0:5, size=6, prob=.5)
[1] 0.016 0.109 0.344 0.656 0.891 0.984 1.000
So, the probability of 1 head or fewer is 0.109, just the sum
of the probability of exactly 0 heads and exactly one head.
Note that in both cases, there was no point in asking for
the probability of more than 6 heads; six is the most that
could possibly happen. If you do ask, the answer will be
“zero”: it can’t happen.
> dbinom(7, size=6, prob=.5)
[1] 0
Similarly, there is no point in asking for the probability of 3.5
heads, that can’t happen either.
> dbinom(3.5, size=6, prob=.5)
[1] 0
Warning message:
non-integer x = 3.500000
The software sensibly returns a probability of zero, but warns
that you are asking something silly.
The poisson distribution is similar, but different in important
ways. If cars pass by a point randomly at an average
74 CHAPTER 11. MODELING RANDOMNESS
rate of 3 per minute, here is the probability of seeing 0, 1, 2,
... cars in any randomly selected minute.
> dpois(0:6, 3)
[1] 0.05 0.15 0.22 0.22 0.17 0.10 0.05
So, there is a 5% chance of seeing no cars in one minute.
But unlike the binomial situation, where the maximum
number of successful outcomes is fixed by the number of
events, it’s possible for a very large number of cars to pass
by.
> dpois(0:9, 3)
[1] 0.0498 0.1494 0.2240 0.2240 0.1680
[6] 0.1008 0.0504 0.0216 0.0081 0.0027
For instance, there is a 0.2% chance that 9 cars will pass by
in one minute. That’s small, but it’s definitely non-zero.
The poisson model, like the binomial, describes a situation
where the outcome is a whole number of events. It makes little
sense to ask for a fractional outcome. The probability of
a fractional outcome is always zero.
> dpois(3.5, 3)
[1] 0
Warning message:
non-integer x = 3.500000
Often one wants to consider a poisson event over a longer
or shorter interval than the one implicit in the specified rate.
For example, when you say that the average rate of cars passing
a spot is 3 per minute, the interval of one-minute is implicit.
Suppose, however, that you want to know the number
of cars that might pass by a spot in one hour. To calculate
this, you need to find the rate in terms of the new interval.
Since one hour is 60 minutes, the rate of 3 per minute is equivalent
to 180 per hour. You can use this to find the probability.
For example, the probability that 150 or fewer cars will
pass by in one hour (when the average rate is 3 per minute)
is given by a cumulative probability:
> ppois(150, 180)
[1] 0.0122061
It can be hard to remember whether the above means “150
or fewer” or “fewer than 150.” When in doubt, you can always
make the situation explicit by using a non-integer argument
> ppois(150.1, 180) # includes 150
[1] 0.0122061
> ppois(149.9, 180) # excludes 150
[1] 0.00991012
This works only when asking for cumulative probabilities,
since 150.1 or less includes the integers 150, 149, and so on.
Were you to ask for the probability of getting exactly 150.1
cars in one hour, using the dpois operator, the answer would
be zero:
> dpois(150.1, 180)
[1] 0
Warning message:
non-integer x = 150.100000
For each of the following, figure out the computer statement
with which you can compute the probability.
1. If cars pass a point randomly at an average rate of 10
per minute, what is the probability of exactly 15 cars
passing in one minute?
0.000 0.0026 0.035 0.053 0.087 0.263 0.337 0.334 0.437 0.559 0.915 0.951
2. If cars pass a point randomly at an average rate of 10
per minute, what is the probability of 15 or fewer cars
passing in one minute?
0.000 0.0026 0.035 0.053 0.087 0.263 0.334 0.337 0.437 0.559 0.915 0.951
3. If cars pass a point randomly at an average rate of 10
per minute, what is the probability of 20 or fewer cars
passing in two minutes?
0.000 0.0026 0.035 0.053 0.087 0.263 0.334 0.337 0.437 0.559 0.915 0.951
4. If cars pass a point randomly at an average rate of 10
per minute, what is the probability of 1200 or fewer
cars passing in two hours and ten minutes (that is, 130
minutes)?
0.000 0.0026 0.035 0.053 0.087 0.263 0.334 0.337 0.437 0.559 0.915 0.951
5. A department at a small college has 5 faculty members.
If those faculty are effectively random samples from a
population of potential faculty that is 40% female, what
is the probability that 1 or fewer of the five deparment
members will be female?
0.000 0.0026 0.035 0.053 0.087 0.263 0.334 0.337 0.437 0.559 0.915 0.951
6. What is the probability that 4 or more will be female?
0.000 0.0026 0.035 0.053 0.087 0.263 0.334 0.337 0.437 0.559 0.915 0.951
Prob 11.45 Government data indicates that the average
hourly wage for manufacturing workers in the United States is
$14. (Statistics Abstract of the United States, 2002) Suppose
the distribution of the manufacturing wage rate nationwide
can be approximated by a normal distribution. If a worker
did a nationwide job search and found that 15% of the jobs
paid more than $15.30 per hour. In order to find the standard
deviation of the hourly wage for manufacturing workers, what
process should we try?
A qnorm(0.15, mean=14, sd=15.3)
B Look for x such that pnorm(15.3,
mean=14,sd=x) gives 0.85
C Calculate a z-score using 1.3 as the standard
deviation
D Not enough information is being given.
Prob 11.46 In many social issues, policy recommendations
are based on cases from the extremes of a distribution. Consider,
for example, a news story (Minnesota Public Radio,
March 19, 2006) comparing the high-school graduation rates
of Native Americans (85%) and whites (97%). The disparity
becomes more glaring when one compares high-school dropout
rates, 3% for whites and 15% for Native Americans.
One way to compare these two drop-out rates is simply to
take the ratio: five times as many children in one group drop
out as in the other. Or, one could claim that the graduation
75
rate for one group is “only” a factor of 1.14 higher than that
for the other. While both of these descriptions are accurate,
neither of them has a unique claim to truth.
Another way to interpret the data is to imagine a student’s
high-school experience as a point on a quantitative continuum.
If the experience is below a threshold, the student does not
graduate. We can imagine the outcome as being the sum of
many contributions: quality of the school and teachers, support
from family, peer influences, personality of the student,
and so on.
Suppose that we model the high-school experience as a
normal distribution with the same standard deviation for
whites and Native Americans but with different means. For
the sake of specificity, let the mean for whites be 100 with a
standard deviation of 20.
1) What is the threshold for graduation? Find a number
such that 3% of whites are below this.
2) Using the threshold you found above, find the mean for
Native Americans such that 15% of children are below
the threshold.
This model is, of course, arbitrary. We don’t know that
there is anything that corresponds to a quantitative highschool
“experience,” and we certainly don’t know that even
if there were it would be distributed according to a normal
distribution. Nevertheless, this can be a helpful way to interpret
data about the “extremes” when making comparisons of
the means.
3) Suppose, contrary to fact, that the drop-out rate for
group A is 15% and that for group B were five times
as high: 75%. If group A has a high-school experience
with mean 100 and standard deviation 20, and group
B has a standard deviation of 20, what should be the
mean of group B to produce the higher drop-out rate.
Enter the work you used to answer these questions in the
box. You can cut and paste from the computer output, but
make sure to indicate clearly and concisely what your answers
are.
Prob 11.47 Geology Professor Karl Wirth studies the age
of rocks as determined by ratios of isotropes. The figure shows
the results of an age assay of rocks collected at seven sites.
Because of the intrinsically random nature of radioactive decay,
the measured age is a random variable and has been
reported as a mean (in millions of years before the present)
and a standard deviation (in the same units).
From the geology of the sites, four of them have been classified
as “early stage” and three as “main stage.” The graph
clearly indicates that the early stage rocks tend to be younger
than the main stage rocks. But perhaps this is just the luck
of the draw.
Professor Wirth wants to calculate a new random variable:
the difference in mean ages between the early and main
stage rocks. Since the age difference is a random variable,
Prof. Wirth needs to know both the variable’s mean and it’s
standard deviation.
To get you started on the calculation, here’s the formula
for the difference in mean ages, ?age of the rocks from the
two different stages.
?age =
1
3
(M1 + M2 + M3) + 1
4
(E1 + E2 + E3 + E4)
where Mi
is a rock from the main stage and Ei
is a rock from
the early stage.
To remind you, here are the arithmetic rules for the means
and variances of random variables V and W when summed
and multiplied by fixed constants a and b:
? mean( aV ) = a mean( V )
? var( aV ) = a
2 var( V )
? mean( aV + bW ) = a mean( V ) + b mean( W )
? var( aV + bW ) = a
2 var( V ) + b
2 var( W )
Do calculations based on the above formulas to answer
these questions:
1. What is the mean of ?age? What are the units?
2. What is the variance of ?age? What are the units?
3. What is the variance of ?age? What are the units?
4. A skeptic claims that the two stages do not differ in age.
He points out, correctly, that since ?age is a random
variable, there is some possibility that its value is zero?
What is the z-score of the value 0 in the distribution of
?age?
76 CHAPTER 11. MODELING RANDOMNESS
Prob 11.48 Below are two different graphs of cumulative
probability distributions. Using the appropriate graph, not
the computer, estimate the items listed below. Your estimates
are not expected to be perfect, but do mark on the
graph to show the reasoning behind your answer:
(a) The 75th percentile of a normal distribution with mean 2
and standard deviation 3.
(b) When flipping 20 fair coins, the probability of getting 7
or fewer heads.
(c) The probability of x being 1 standard deviation or more
below the mean of a normal distribution.
(d) The range that covers 90% of the most likely number of
heads when flipping 20 fair coins.
Chapter 12
Confidence in Models
Reading Questions
What is the difference between a “standard error” and
a “confidence interval?”
Why are confidence intervals generally constructed at a
95% level? What does the level mean?
What is a sampling distribution? What is resampling? What is it used for?
How does a confidence interval describe what might be
called the reliability of a measurement?
? Does collinearity between explanatory vectors tend to
make confidence intervals smaller or larger?
Prob 12.01 Here’s a confidence interval: 12.3 ± 9.8 with
95% confidence.
(a) What is 12.3?
margin.of.error
point.estimate
standard.error
confidence.level
confidence.interval
(b) What is 9.8?
margin.of.error
point.estimate
standard.error
confidence.level
confidence.interval
(c) What is 95%?
margin.of.error point.estimate
standard.error
confidence.level
confidence.interval
Prob 12.02 Look at this report from a model of the kids’
feet data,
summary(lm(width~length+sex,data=kids))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6412 1.2506 2.912 0.00614
length 0.2210 0.0497 4.447 8.02e-05
sexG -0.2325 0.1293 -1.798 0.08055
(a) Based on the output of the report, which of these statements
is a correct confidence interval on the sexG coeffi-
cient?
A ?.23 ± 0.13 with 95 percent confidence
B ?.23 ± 0.13 with 50 percent confidence
C ?.23 ± 0.13 with 68 percent confidence
D ?.23 ± 0.0805 with 95 percent confidence
E ?.23 ± 0.23 with 68 percent confidence
F None of the above
(b) Based on the output of the report, which of these statements
is a correct confidence interval on the length coefficient?
A 0.22 ± 0.050 with 95 percent confidence
B 0.22 ± 0.050 with 68 percent confidence
C 0.22 ± 0.100 with 50 percent confidence
D 0.22 ± 0.070 with 50 percent confidence
E None of the above
Prob 12.04 A confidence interval is often written as a point
estimate plus or minus a margin of error: P ± M with C percent
confidence. How does the size of the margin of error M
depend on the confidence level C?
A It doesn’t.
B It increases as C increases.
C It decreases as C increases.
77
78 CHAPTER 12. CONFIDENCE IN MODELS
Prob 12.05 A statistics professor sends her students out
to collect data by interviewing other students about various
quantities, e.g., their SAT scores, GPAs and other data for
which the college registrar retains official records. Each student
is assigned one quantity and one target population, for
example, the verbal SAT scores of all female students, or the
cumulative grade point average for sophomore males.
Each student interviews some students — how many depends
on the student’s initiative and energy. The student
then reports the results in the form of a confidence interval,
P ± M with 95% confidence.
After the students hand in their reports, the professor contacts
the registrar to find the population parameters for each
group that the students surveyed. Assuming that the interviewed
students provided accurate information, what fraction
of the students’ confidence intervals will contain the corresponding
population parameter?
A 95%
B 50%
C 5%
D Can’t tell, it depends on how much data
each student collected.
E Can’t tell, the students each looked at different
quantities, not all of them at the
same quantity.
Prob 12.10 Here are three different model statements for
the kids’ feet data.
width ~ 1 width ~ sex
width ~ sex - 1
Each of the above models for kids’ feet is relevant to
one of the problems below. Fit the model to the data in
kidsfeet.csv and interpret your results to give a 95% con-
fidence interval on these quantities written in the standard
form: point estimate ± margin of error.
1. The mean width of boys’ feet.
Point estimate: 8.76 9.19 9.37 9.98 10.13
Margin of error: 0.041 0.176 0.211 0.352 1.430 6.540
2. The mean width of all children’s feet.
Point estimate: 8.15 8.99 9.13 9.86 12.62
Margin of error: 0.16 0.18 0.22 0.35 1.74
3. The difference between the means of boys’ and girls’
foot widths. (The differences can be either positive or
negative, depending on whether it is boys minus girls
or girls minus boys. State your difference as a positive
number.)
Point estimate: 0.406 0.458 0.514 0.582 0.672
Margin of error: 0.16 0.18 0.22 0.30 1.74
Prob 12.11 What’s wrong with the following statement
written by a student on an exam?
The the larger the number of cases examined
and taken into account, the more likely your estimation
will be accurate. Having more cases decreases
your risk of having a bias and increases the
probability that your sample accurately represents
the real world.
Prob 12.12 In 1882, Charles Darwin wrote about earthworms
and their importance in producing soil.
Hensen, who has published so full and interesting
an account of the habits of worms, calculates,
from the number which he found in a measured
space, that there must exist 133,000 living worms
in a hectare of land, or 53,767 in an acre. — p.
161, “The Formation of Vegetable Mould, through
the Action of Worms with Observations on their
Habits”
While 133,000 seems sensibly rounded, 53,767 is not. This
problem investigates some of the things you can find out about
the precision of such numbers and how to report them using
modern notation, which wasn’t available to Darwin or his contemporaries.
Background: A hectare is a metric unit of area, 10,000
square meters. An acre is a traditional unit of measure, with
one acre equivalent to 0.4046863 hectares. That is, an acre is
a little less than half a hectare.
The implicit precision in Hensen’s figure is 133, 000 ± 500,
since it is rounded to the thousands place. Correctly translate
the Hensen figure to be in worms per acre.
1. Literally translating 133,000 worms per hectare to
worms per acre gives what value?
53760 53767 53770 53823 53830
2. Literally translating ±500 worms per hectare to worms
per acre gives what value?
197 200 202 205 207
3. Which one of these reports gives a proper account for
the number of worms per acre?
A 53767 ± 202
B 53823 ± 200
C 53820 ± 200
D 53830 ± 200
Of course, it’s just an assumption that Hensen’s precision
is ±500. Imagine that the way Hensen made his estimate
was to dig up 10 different patches of ground, each of size one
square meter. In each patch, Hensen counted the worms found
then added these together to get the total number of worms
in 10 square meters. Since Hensen reported 133,000 worms
per hectare, he would have found a total of 133 worms in the
ten square meters he actually dug up.
Of course, if Hensen had picked a different set of 10 patches
of soil in the same field, he would likely not have found exactly
133 worms. There is some sampling variability to the
number of worms found.
79
Using an appropriate probability model for the number of
worms to be found in 10 square meters of soil, estimate the
standard deviation of the number found, assuming that on
average the number is 133 per 10 square meters.
1. What is an appropriate probability model?
gaussian uniform exponential poisson binomial
2. Using the appropriate probability model, what standard
deviation corresponds to a mean of 133 per 10 square
meters? (Hint: You can use a random number generator
to make a large sample of draws and then find the
standard deviation of this sample.)
2.1 7.9 11.5 15.9 58.2 102
3. Using your standard deviation, and recalling that the
number of worms in one hectare will be 1000 times that
found in 10 square meters, give an appropriate 95% con-
fidence interval to use today in reporting Hensen’s result.
A 133, 000 ± 23000
B 133, 000 ± 2100
C 133, 000 ± 16000
D 133, 000 ± 20000
E 130, 000 ± 120000
4. Now imagine, almost certainly contrary to fact, that
Hensen had actually dug up an entire hectare and found
133,201 worms, and rounded this to 133,000 just for the
sake of not seeming silly. Of course, this would have
been a heroic effort just to gain precision on the count.
It would also be futile, since the number in a “hectare
of soil” presumably differs widely depending on the soil
conditions. But if Hensen had calculated a 95% confi-
dence interval using an appropriate probability model
on the count of 133,201 worms, rather than just rounding
to what seems reasonable, what would have been his
margin of error?
730 2100 16000 58000 190000
80 CHAPTER 12. CONFIDENCE IN MODELS
Chapter 13
The Logic of Hypothesis Testing
Reading Questions
What is a “null hypothesis?” Why does it play a special
role in hypothesis testing?
Why is it good to have a hypothesis test with a low
“significance level?”
Why is it good to have a hypothesis test with a high
“power?”
What is a p-value?
Why are there two kinds of potential errors in hypothesis
testing, Type I and Type II?
Prob 13.01 Which of the following are legitimate possible
outcomes of a hypothesis test? Mark as “true” any legitimate
ones.
(a) True or False accept the alternative hypothesis
(b) True or False accept the null hypothesis
(c) True or False reject the alternative hypothesis
(d) True or False fail to reject the alternative hypothesis
(e) True or False reject the null hypothesis
(f) True or False fail to reject the null hypothesis
(g) True or False indeterminate result
Pick one of the illegitimate outcomes and explain why it
is illegitimate.
Prob 13.02 Consider the two conditional sampling distributions
shown in the figure.
0 20 40 60 80 100
0.00 0.01 0.02 0.03 0.04
Conditional Sampling Distributions
Test Statistic
Prob. Density
Null Alternative
Imagine that you set rejection criteria so that the power
was 99%. What would be the significance of the test? (Choose
the best answer.)
A About 0.05.
B About 0.30
C About 0.95
D Can’t tell (even approximately) from the
information given.
We sometimes speak of the probability of Type I or Type
II errors. We want to know what sort of probability these are.
To simplify notation, we’ll define the following outcomes:
N The Null Hypothesis is correct.
A The Alternative Hypothesis is correct.
Fail Fail to reject the Null Hypothesis
Reject Reject the Null Hypothesis.
What is the probability of a Type I error?
A A joint probability: p(Reject and N)
B A joint probability: p(Fail and A)
C A conditional probability: p(Reject given
N)
D A conditional probability: p(Reject given
A)
E A marginal probability: p(Reject)
F A marginal probability: p(Fail)
81
82 CHAPTER 13. THE LOGIC OF HYPOTHESIS TESTING
What is the probability of a Type II error?
A A joint probability: p(Reject and A)
B A joint probability: p(Fail and N)
C A conditional probability: p(Fail given N)
D A conditional probability: p(Fail given A)
E A marginal probability: p(N)
F A marginal probability: p(A)
Prob 13.14 Consider these two sampling distributions:
0 50 100
0.00 0.01 0.02 0.03 0.04
Conditional Sampling Distributions
Test Statistic
Prob. Density
Null
Alternative
Suppose that you insisted that every hypothesis test have
a significance of 0.05. For the conditional sampling distributions
shown above, what would be the power of the most
powerful possible one-tailed test?
A About 5%
B About 50%
C About 95%
D Can’t tell from the information given.
What would be the power of the most powerful possible
two-tailed test?
A About 10% bigger than the one-tailed test.
B About 10% smaller than the one-tailed test.
C About 50% bigger than the one-tailed test.
D About 50% smaller than the one-tailed test.
E Can’t tell from the information given.
Prob 13.15 The Ivory-billed woodpecker has been thought
to be extinct; the last known observation of one was in 1944.
A new sighting in the Pearl River forest in Arkansas in 2004
became national news and excited efforts to confirm the sighting.
These have not been successful and there is skepticism
whether the reported 2004 sighting was correct. This is not
an easy matter since the 2004 sighting was fleeting and the
Ivory-billed woodpecker is very similar to a relatively common
bird, the Pileated woodpecker.
Pileated (left) and Ivory-billed Woodpeckers
Drawings by Ernest S Thompson & Rex Brasher
This problem is motivated by the controversy over the
Ivory-billed sighting, but the problem is a gross simplification.
Ivory-billed woodpeckers are 19 to 21 inches long, while
Pileated woodpeckers are 16 to 19 inches. Ivory-bills are generally
glossy blue-black, whereas Pileated woodpeckers are
generally dull black, slaty or brownish-black. Ivory-bills have
white wing tops while Pileated woodpeckers have white underwings.
These differences would make it easy to distinguish
between the two types of birds, but sightings are often at a
distance in difficult lighting conditions. It can be difficult,
particularly in the woods, to know whether a distant bird is
flying toward the observer or away.
Imagine a study where researchers display bird models in
realistic observing conditions and record what the observers
saw. The result of this study can usefully be stated as conditional
probabilities:
Alternative Null
Observed Code Ivory-billed Pileated
Short & Dull A 0.01 0.60
Long & Dull B 0.10 0.13
Short & Glossy C 0.04 0.20
Long & Glossy D 0.60 0.05
Short & White Back E 0.05 0.01
Long & White Back F 0.20 0.01
For simplicity, each of the six possible observations has
been given a code: A, B, C, and so on.
The Bayesian Approach The table above gives conditional
probabilities of this form: given that the bird is an
Ivory-billed woodpecker, the probability of observation D is
0.60.
In the Bayesian approach, you use the information in the
table to compute a different form of conditional probability, in
this form: given that you observed D, what is the probability
of the bird being an Ivory-billed.
In order to switch the form of the conditional probability
from “given that the bird is ...” to “given that the observation
is ...”, you need some additional information, called the prior
probability. The prior probability reflects your view of how
83
likely a random bird is to be an Ivory-billed or Pileated before
you make any observation. Then, working through the
Bayesian calculations, you will find the posterior probability,
that is, the probability after you make your observation.
Suppose that, based on your prior knowledge of the history
and biology of the Ivory-bill, that your prior probability that
the sighting was really an Ivory-bill is 0.01, and the probability
that the sighting was a Pileated is 0.99. With this
information, you can calculate the joint probability of any of
the 12 outcomes in the table.
Then, by considering each row of the table, you can calculate
the marginal probability of Ivory-bill vs Pileated for
each of the possible observations.
(a) What is the joint probability of a sighting being both D
and Ivory-billed? (Pick the closest one.)
impossible 0.006 0.01 0.05 0.60
(b) What is the joint probability of a sighting being both D
and Pileated? (Pick the closest one.)
impossible 0.006 0.01 0.05 0.60
(c) What is the conditional probability of the sighting being
an Ivory-billed GIVEN that it was D? (Pick the closest
one.)
0.01 0.05 0.10 0.60
(d) Which of the possible observations would provide the
largest posterior probability that the observation was of
an Ivory-bill?
A B C D E F
(e) What is that largest posterior probability? (Pick the closest
one.)
0.02 0.08 0.17 0.73
The Hypothesis-Testing Approach The hypothesistesting
approach is fundamentally different from the Bayesian
approach. In hypothesis testing, you pick a null hypothesis
that you are interested in disproving. Since the Pileated
woodpecker is relatively common, it seems sensible to say that
the null hypothesis is that the observation is a Pileated woodpecker,
and treat the Ivory-billed as the alternative hypothesis.
Once you have the null hypothesis, you choose rejection
criteria based on observations that are unlikely given the null
hypothesis. You can read these off directly from the conditional
probability table given above.
1. The two least likely observations are E and F. If observing
E or F is the criterion for rejecting the null hypothesis,
what is the significance of the test?
0.01 0.02 0.05 0.25 0.60 0.65
What would be the power of this test?
0.01 0.05 0.25 0.60 0.65
2. Now suppose the rejection criteria are broadened to include
D, E, and F.
What would be the significance of such a test?
0.01 0.02 0.05 0.07 0.10 0.20 0.25 0.60 0.85 other
What would be the power of such a test?
0.01 0.02 0.05 0.07 0.10 0.20 0.25 0.60 0.85 other
Comparing the Bayesian and hypothesis-testing approaches,
explain why you might reject the null hypothesis
even if the observation was very likely to be a Pileated woodpecker.
84 CHAPTER 13. THE LOGIC OF HYPOTHESIS TESTING
Chapter 14
Hypothesis Testing on Whole Models
Reading Questions
A permutation test involves randomizing in order to
simulate eliminating the relationship, if any, between
the explanatory variables and the response variable.
What’s the point of this?
? The F statistic compares two quantities. What are
they?
What is a “degree of freedom?”
Prob 14.01 Your friend is interested in using natural teas
to replace commercial drugs. He has been testing out his
own tea blend, which he calls “Special Cooling Tea” to reduce
temperatures of people with the flu. Fortunately for him, he
had a bad flu with fever that lasted for 3 days. During this
time, he alternated taking Special Cooling Tea and taking a
conventional fever reducing drug, Paracetamol. On each occasion,
he measured the change in his body temperature over
one hour. His data were
Change in
Treatment Temp. (F)
Tea -1.2
Drug -2.0
Tea -0.7
Drug -1.5
Tea +0.0
Drug -0.7
Your friend is excited by these results. When he fit the
model temperature change ~ Treatment he got the following
reports:
Df Sum Sq Mean Sq F value Pr(>F)
treat 1 0.13 0.13 0.11 0.7555
Residuals 4 4.85 1.21
Based on these report, your friend claims, ”This shows that
my Special Cooling Tea is just as effective as Paracetamol.”
Which of the following features of the Anova report
supports your friend’s conclusion that the two treatments
are effective:
A There are 4 degrees of freedom in the residual.
B The p-value is very small.
C The p-value is not small.
D There is no p-value on the residuals.
What’s an appropriate Null Hypothesis is this setting:
A The Tea and Paracetamol have the same
effect.
B The Tea is more effective than Paracetamol.
C The Tea is less effective than Paracetamol.
Of course, your friend’s conclusion that his experiment
shows that Tea and Paracetamol are the same is actually unjustified.
Failing to reject the null hypothesis is not the same
as accepting the null hypothesis. You explain this to your
friend.
He’s disappointed, of course. But then he gets a bit annoyed.
“How can you ever show that two things are the same
if your never get to accept the null hypothesis?”
“Well,” you respond, “the first thing is to decide what ‘the
same’ means. In this case, you’re interested in the fever reduction.
How big a difference in the temperature reduction
would be medically insignificant?”
Your friend thinks for a while. “I suppose no one would
care about 0.2 degrees.”
“Then you have to make your measurement precise to
about 0.2 degrees. If such a precise measurement shows no
difference, then it’s reasonable to claim that your Tea and
the drug are equivalent.” Using your friend’s data, you fit the
model and look at the regression report.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9333 0.6360 -1.47 0.2161
treatTea 0.3000 0.8994 0.33 0.7555
The margin of error on the difference in temperature reduction
is ±1.8 degrees, based on the 6 cases your friend
recorded.
? To reduce the margin of error to ±0.2 degrees, about
how many cases altogether should your friend plan to
collect? (Choose the closest one. For instance, 12
would be a doubling from the 6 actually collected.)
12 75 150 500 2000
85
86 CHAPTER 14. HYPOTHESIS TESTING ON WHOLE MODELS
Prob 14.02 You can test the null hypothesis that the population
mean of a variable x is zero by constructing a model
x~1 and interpreting the coefficient on the intercept term 1.
Often, the null hypothesis that the population mean is
zero is irrelevant. For example, in the kid’s feet data, the
mean foot width is 8.99 cm. There is no physical way for the
population mean to be zero.
But, suppose you want to test the null hypothesis that the
population mean (for the population of school-aged children
from whom the sample was drawn) is 8.8 cm. To do this, create
a new variable that would be zero if the population mean
were 8.8 cm. This is simple: make the new variable by subtracting
8.8 cm from the original variable. The new variable
can then be tested to see if its mean is different from zero.
Using the kids-feet data:
1. Find the p-value on the null-hypothesis that the population
mean of the foot width is 8.8 cm.
0.000 0.024 0.026 0.058 0.094 0.162 0.257
2. Find the p-value on the null-hypothesis that the population
mean of the foot width is 8.9 cm.
0.00 0.23 0.27 0.31 0.48 0.95
Prob 14.04 You are performing an agricultural experiment
in which you will study whether placing small bits of iron in
the soil will increase the yield of barley. Randomly scattered
across an enclosure owned by the state’s agricultural field of-
fice, you have marked off ten plots, each of size 10 meters
by 10 meters. You randomly assign five of the plots to be
controls, and in the other five you place the iron.
A colleague hears of your experiment and asks to join it.
She would like to test zinc. Still another colleague wants to
test copper and another has two different liquid fertilizers to
test. The head of the field office asks you to share the data
from your control plots, so that the other researchers won’t
have to grow additional control plots. This will reduce the
overall cost of the set of experiments. He points out that the
additional experiments impose absolutely no burden on you
since they are being grown in plots that you are not using.
All you need to do is provide the yield measurements for the
control plots to your colleagues so that they can use them in
the evaluation of their own data.
You wonder whether there is a hidden cost. Will you need
to adjust the p-value thresholdhold? For instance, a Bonferroni
correction would reduce the threshold for significance
from 0.05 to 0.01, since there are five experiments altogether
(iron, zinc, copper, and two fertilizers, each compared to a
control). What’s the cost of making such an adjustment?
A It reduces the significance level of your experiment.
B It reduces the power of your experiment.
C It invalidates the comparison to the control.
D None of the above.
What’s the argument for making an adjustment to the p-value
threshold?
A Doing multiple experiments increases the
probability of a Type I error.
B Doing multiple experiments increase the
probability of a Type II error.
C Doing multiple experiments introduces interaction
terms.
D The other experiments increase the power
of your experiment.
What’s the argument for not making an adjustment to the
p-value threshold?
A The additional experiments (zinc, copper,
fertilizers) have nothing to do with my
iron experiment, so I can reasonably ignore
them.
B You don’t think the other experiments will
be successful.
C A p-value of 0.05 is always appropriate.
Suppose that the other experimenters decided that they had
sufficient budget to make their own control plots, so that their
experiments could proceed independently of yours (although
all the plots would be in the same field). Would this change
the arguments for and against adjusting the p-value?
A No. There are still multiple tests being
done, even if the data sets don’t overlap.
B Yes. Now the tests aren’t sharing any data.
The idea that a researcher’s evaluation of a p-value should
depend on whether or not other experiments are being conducted
has been termed “inconsistent” by Saville. [?] He
writes, “An experiment is no more a natural unit than a
project consisting of several experiments or a research program
consisting of several projects.” Why should you do an
adjustment merely because a set of experiments happened to
be performed by the same researcher or in the same laboratory?
If you’re going to adjust, why shouldn’t you adjust for
all the tests conducted in the same institution or in the same
country?
This quandry illustrates that data do not speak for themselves.
The evaluation of evidence needs to be made in the
context in which the data were collected. In my view, the
iron researcher would be justified in not adjusting the p-value
threshold because the other experiments are irrelevant to his.
However, the field office head, who is supervising multiple
experiments, should be making adjustments. This is paradoxical.
If the p-value from the iron experimental were 0.04,
the individual researcher would be justified in declaring the
effect of iron significant, but the office head would not be.
Same data, different results.
This paradox has an impact on how you should interpret
the reports that you read. If you are reading hundreds of reports,
you need to keep a skeptical attitude about individual
reports with a p-value of 0.01. A small p-value is only one
piece of evidence, not a proof.
Prob 14.05 One of the difficulties in interpreting p-values
comes from what is sometimes called publication bias: only
those studies with sufficiently low p-values are published; we
never see studies on the same subject that didn’t have low
p-values.
87
We should interpret a p-value of, say, 0.03 differently in
these two situations:
1. The study that generated the p-value was the only one
that has been done on that subject. In this case, the
p-value can reasonably be taken at face value.
2. The study that generated the p-value was one of 20 different
studies but was the only one of the 20 studies
that generated a p-value below 0.05. In this case, the
p-value has little genuine meaning since, even if the null
hypothesis were true we wouldn’t be surprised to see a
p-value like 0.03 from the “best” of 20 studies.
It’s sometimes the case that “para-normal” phenomena —
mind-reading, for instance — are subjected to studies and
that the studies sometimes give low p-values. It’s hard to
interpret the p-value directly because of the publication bias
effect. We can, of course, ask that the study be repeated,
but if there are many repeats that we never hear about (because
they happened to generate large p-values), it can still
be difficult to interpret the p-value.
There is one way out, however. In addition to asking that
the study be repeated, we can ask that the sample size be
increased. The reason is that the larger sample size should
generate much smaller p-values if the studied effect is genuine.
However, if the effect is spurious, we’ll still expect to
see published p-values around 0.03.
To explore this, consider the following result of a ficticious
study of the possible link between “positive thinking”
(posThinking) and the t-cell count (tcells) in the immune system
of 50 different people. The study has produced a “suggestive”
p-value of about 0.08:
Model: tcells ~ posThinking
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6412 1.2506 2.912 0.00614
posThinking 0.2325 0.1293 1.798 0.07847
Here is a statement that can generate the p-value:
> 2*(1-pt(1.798,df=48))
[1] 0.07846795
Why is the the output of pt subtracted from 1?
A This is always the way pt is used.
B Because the t-value is positive.
C Because we want a one-tailed test.
D Because the t-value is less than 2.
The df in the statement stands for the degrees of freedom
of the residual, just as in the F test. There were 50 cases and
two model vectors (the intercept and the posthinking indicator
vector), so 48 degrees of freedom in the residual.
Now imagine that the same study had been repeated with
four times as much data: 200 cases. Assuming that everything
else remained the same, how big would the standard
error on posThinking be given the usual relationship between
the size of the standard error and the number of cases n:
? Standard error with n = 200 cases:
0.52 0.26 0.13 0.065 0.032
? With n = 200 cases, how many degrees of freedom will
there be in the residual?
48 50 98 100 198 200
? Using the standard error with n = 200 cases, and assuming
that the coefficient remains exactly as reported
in the table, recalculate the p-value on posThinking.
Select the one of these that is closest to the p-value:
0.4 0.04 0.004 0.0004 0.00004 0.000004
Prob 14.10 The t- and F-distributions are closely related.
That is, you can calculate the p-value on a t-statistic by using
the F distribution, if you do it right.
Generate a large random sample from a t-distribution with
10 degrees of freedom. Then generate another random sample
from the F-distribution with 1 and 10 degrees of freedom.
Graphically compare the distributions of the F and t values.
What do you see? True or False They are the same.
True or False The F distribution is much more skew
to the right.
True or False The t distribution is symmetrical
around zero.
Now compare the distributions of the F values and the
square of the t values. What do you see?
True or False They are the same.
Prob 14.11 You know that you can compute the sample
mean m and variance s
2 of a variable with simple calculations
Such calculations are implemented with simple computer
commands, e.g.
feet = fetchData("kidsfeet.csv")
mean( length, data=feet )
[1] 24.72
var( length, data=feet )
[1] 1.736
This is a fine way to calculate these quantities. But, just
to show the link of these quantities to modeling, I ask you to
do the following with the kidsfeet data: kidsfeet.csv :
1. Fit a model where one of the coefficients will be the
mean of the length. Enter the model report below.
2. Perform Anova on the same model. One of the numbers
in the Anova report will be the variance. Which
one is it?
3. Based on the above, explain why the calculation of the
sample variance involves dividing by N ? 1 rather than
N.
88 CHAPTER 14. HYPOTHESIS TESTING ON WHOLE MODELS
Prob 14.12 Zebra mussels are a small, fast reproducing
species of freshwater mussel native to the lakes of southeast
Russia. They have accidentally been introduced in other areas,
competing with native species and creating problems for
people as they cover the undersides of docks and boats, clog
water intakes and other underwater structures. Zebra mussels
even attach themselves to other mussels, sometimes starving
those mussels.
Zebra mussels growing on a
larger, native mussel.
Ecologists Shirley Baker and Daniel Hornbach examined
whether zebra mussels gain an advantage by attaching to
other mussels rather than to rocks.[?] The ecologists collected
samples of small rocks and Amblema plicata mussels, each of
which had a collection of zebra mussels attached. The samples
were transported to a laboratory where the group of mussels
from each individual rock or Amblema were removed and
placed in an aquarium equipped to measure oxygen uptake
and ammonia excretion. After these physiological measurements
were made, the biochemical composition of the mussel
tissue was determined: the percentage of protein, lipid, carbohydrate,
and ash.
Baker and Hornbach found that zebra mussels attached
to Amblema had greater physiological activity than those attached
to rocks as measured by oxygen uptake and ammonia
excretion. But this appears to be a sign of extra effort for
the Amblema-attached zebra mussels, since they had lower
carbohydrate and lipid levels. In other words, attaching to
Amblema appears to be disadvantageous to the zebra mussels
compared to attaching to a rock.
In this problem, you will use the data collected by Baker
and Hornbach to reprise their statistical analysis. The data
are in the file zebra-mussels.csv.
GroupID dry.mass count attachment ammonia O2
1 1 0.55 20 Rock 0.075 0.82
2 2 0.45 19 Rock 0.066 0.70
3 3 0.37 20 Rock 0.074 0.62
... and so on ...
28 28 0.89 28 Amblema 0.248 2.68
29 29 0.70 31 Amblema 0.258 2.26
30 30 0.68 64 Amblema 0.235 2.05
There are 30 cases altogether. The unit of analysis was
not an individual zebra mussel but the group of zebra mussels
that were attached to individual rocks or Amblema.
The variables are:
dry.mass The mass of the dried tissue of the group, in
grams.
count How many mussels in the group.
attachment The substrate to which the mussels in the
group were attached: a rock or an Amblema mussel.
O2 Oxygen uptake in mg per hour for the group.
ammonia – Nitrogen excretion measured as ammonia in
mg per hour for the group.
lipid, protein, carbo, ash Biochemical makeup as percent
of the dry weight for each of these.
Kcal Total calorific value of the tissue in kilo-calories per
gram.
A first question is whether the amount of mussel tissue is
greater for the rock-attached or the Amblema-attached zebra
mussels. Here is the report of a model:
> z = fetchData("zebra-mussels.csv")
> summary(lm(dry.mass ~ attachment, data=z))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6846 0.0554 12.36 1.2e-12
attachmentRock -0.1983 0.0770 -2.58 0.016
Based on this report, which of these claims do data support?
A Smaller mass for rock-attached mussels, as
a group.
B Larger mass for rock-attached mussels, as
a group.
C No statistically significant difference in
mass between the rock- and Amblemaattached
mussels.
The dry.mass variable gives the mass of the entire group
of mussels. It might be that the mussels were systematically
different in size on the two different substrates. One way to
look at this is to compute the dry mass per individual mussel.
This can be calculated by dividing dry.mass by count:
> z = transform( z, ind.mass=dry.mass/count )
Fit a model of ind.mass versus attachment. Which of the
claims does this model support?
A Smaller mass for rock-attached mussels, as
a group.
B Larger mass for rock-attached mussels, as
a group.
C No statistically significant difference in
mass between the rock- and Amblemaattached
mussels.
In summarizing their results about oxygen uptake and nitrogen
excretion, Baker and Hornbach wrote:
Attachment substrate had significant effects on
ammonia excretion and oxygen uptake rates.
Dreissenids [that is, Zebra mussels] attached to
Amblema had higher ammonia excretion rates
(p < 0.0001) and higher oxygen uptake rates (p <
0.0001) compared to dreissenids attached to rocks.
To check their statistics, fit these models for oxygen uptake
per individual and ammonia excretion per individual:
> mod1 = lm( O2/count ~ attachment, data=z )
> mod2 = lm( ammonia/count ~ attachment, data=z )
The coefficients on the attachment variable indicate that:
? ammonia excretion is lower higher for rock-attached
zebra mussels than for Amblema-attached mussels.
89
? oxygen uptake is lower higher for rock-attached than
for Amblema-attached.
? True or False These results are consistent with Baker
and Hornbach’s claims.
Look at the p-values on the attachment variable. What
are they?
Ammonia: 0.01 0.04 0.10 0.25 0.40
Oxygen: 0.01 0.04 0.10 0.25 0.40
Perhaps you are surprised to see that the p-values from
your models are quite different from those reported by Baker
and Hornbach. That’s because Baker and Hornbach included
ind.mass as a covariate in their models, in order to take into
account how the metabolic rate might depend on the mass of
the mussel.
Following Baker and Hornbach’s procedure, fit these models:
> mod3 = lm( O2/count ~ ind.mass + attachment, data=z )
> mod4 = lm( ammonia/count ~ ind.mass + attachment, data=z )
You’ll get different p-values for the attachment variable in
these models depending on whether the analysis is done using
Anova (which is often called “Analysis of Covariance” (ANCOVA)
when there is a covariate in the model, or is done
using a regression report. In ANCOVA, the p-value depends
on whether ind.mass is put before or after attachment.
What is the regression report p-value on attachment?
Oxygen 0.0000016 0.000016 0.0016 0.016 0.16
Ammonia 0.0000005 0.0005 0.0001 0.001 0.01 0.1
The reason that including ind.mass as a covariate has improved
the p-value is that ind.mass has soaked up the variance
in the response variable. Show an Anova report for one of
the two models mod3 or mod4 and explain what in the report
demonstrates that ind.mass is soaking up variance.
In contrast to the dramatic effect of ind.mass on the pvalues,
including it as a covariate does not seem to affect much
the coefficient on attachment. What is it about ind.mass that
explains this?
A ind.mass is almost entirely uncorrelated
with the response variable.
B ind.mass is almost entirely uncorrelated
with attachment.
C ind.mass is strongly correlated with the response
variable.
D ind.mass is strongly correlated with
attachment.
Prob 14.21 The table shows a brief data set that we will
use to trace through a simple one-way ANOVA calculation by
hand.
Age Group
18 Student
22 Student
45 Parent
51 Parent
35 Professor
57 Professor
Our model will be Age ~ 1 + Group
The ANOVA report breaks down the model into a series
of nested models. In this case, the series is simple:
1. Age ~ 1
2. Age ~ 1 + Group
For each of these nested models, you need to calculate
the “degrees of freedom” and the sum of squares of the fitted
model values. The degrees of freedom is the number of model
vectors involved. The sum of squares is found by fitting the
model and adding up the squares of the fitted values.
We will do this by hand. First, fit Age ~ 1 and enter
the fitted values into the table. “Fitting” is easy here, since
the coefficient on Age ~ 1 is just the grand mean of the Age
variable.
For Age ~ 1
Age Group Fitted
18 Student
22 Student
45 Parent
51 Parent
35 Professor
57 Professor
Once you have written down the fitted values for each case,
compute the sum of squares. Since there is one model vector,
there is just one degree of freedom for this model.
Next, fit the model Age ~ 1 + Group and enter the fitted
values into the table. This model is also easy to fit, since the
fitted values are just the groupwise means.
For Age ~ 1 + Group
Age Group Fitted
18 Student
22 Student
45 Parent
51 Parent
35 Professor
57 Professor
Compute the sum of squares for this model. Since there
are three groups, there are three model vectors and hence
three degrees of freedom.
Finally, compute the sum of squares of the Age variable
itself. You might like to think of this as the sum of squares of
the fitted values to a “perfect” model, a model that captures
all of the variability in Age. We know that such a model can
always be constructed so long as we allow N model vectors,
even if they are junk, so the degrees of freedom ascribed to this
“perfect” model is N. (We put “perfect” in quotation marks
to emphasize that this model is perfect only in the sense that
the residuals are zero. That can happen anytime we have as
many model vectors as cases, that is, when m = N.)
Enter the degrees of freedom and the sums of squares of
the fitted values into the table below:
90 CHAPTER 14. HYPOTHESIS TESTING ON WHOLE MODELS
Model D.F. Sum of Squares of Fitted
Age ~ 1
Age ~ 1 + Group
“Perfect” model
We are almost at the point where we can construct the
ANOVA table. At the heart of the ANOVA table is the
degree-of-freedom and sum-of-squares information from the
above table. But rather than entering the sum of squares directly,
we subtract from each of the sum of squares the total
of all the sums of squares of the models that appear above it.
That is, we give each successive nested model credit only for
the amount that it increased the sum of squares from what it
had been in the previous model. A similar accounting is done
for the degrees of freedom. In the following table, we’ll mark
the header as ? to emphasize that you should the change in
sum of squares and the change in degrees in freedom from the
previous model.
Fill in the first two columns of the table. Then go back
and fill in the “mean square” column, which is just the sum
of squares divided by the degrees of freedom. Similarly, fill
in the F column, which is the mean square for each model
divided by the mean square for the perfect model.
Model ? D.F. ? S.S. M.S. F ratio p-value
Age ~ 1
Age ~ 1 + Group
“Perfect” model
You can use software to compute the p-value. The relevant
parameters of the F distribution are the degrees of freedom of
the model and the degrees of freedom of the “perfect” model.
Of course, in a real ANOVA table, rows are labeled differently.
Each row gives a name to the terms that were added
in making the model. So, in the above table, the labels will
be “Intercept,”“Group,” and “Residuals.”
Compare your table to one constructed with software. Enter
the data into a spreadsheet, save it to disk in an appropriate
format (e.g., CSV), then read it into the statistical
software and create the ANOVA table.
Prob 14.22 The special program rand() generates random
vectors for use in modeling. It’s special because it works only
within the lm command. For example, suppose we want to
create three random model vectors along with an intercept
term to model the kids’ foot width data:
> lm( width ~ rand(3), data=kids)
Coefficients:
(Intercept) rand(3)1 rand(3)2 rand(3)3
9.00838 0.01648 -0.05185 0.01627
> lm( width ~ rand(3), data=kids)
(Intercept) rand(3)1 rand(3)2 rand(3)3
8.99367 -0.09795 -0.06916 0.05676
The coefficients are different in the two models because the
“explanatory” vectors are random.
We’re going to study the R2 due to such collections of
random vectors. This can be calculated with the r.squared
program:
> r.squared(lm( width ~ rand(3), data=kids))
[1] 0.1770687
> r.squared(lm( width ~ rand(3), data=kids))
[1] 0.03972449
Note that the R2 values vary from trial to trial, because
the vectors are random.
According to the principle of equipartition, on average,
each random vector should contribute 1/(n?1) to the R2 and
the effect of multiple vectors is additive. So, for example, with
n = 39 cases, the three random vectors in the above example
should result in an R2
that is near 3/(39 ? 1) = 0.08.
Repeat the above calculation of R2 many times for p = 3
random vectors and find the mean of the resulting R2
. You
can do the repetitions 100 times with a statement like this:
> samps=do(100)*r.squared(lm(width~rand(3),data=kids))
Now do the same thing for p = 1, 3, 10, 20, 37 and 38. Are
the results consistent with the theory that, on average, R2
should be p/(n ? 1)?
Enter all your results in the table:
p p/(n ? 1) Mean R2
1 3 10 20 37 38
Note that for p = 38 all of the trials produced the same
R2 value. Explain why.
Prob 14.23 Suppose that you have worked hard to carry
out an experiment about baldness that involves several different
treatments: rubbing various herbs and spices on the scalp,
playing hairy music, drinking kiwi juice, laying fresh moss on
the head, eating peaches. Then, you measure the density of
hair follicles after 2 months.
You plot out your data:
Sage
Kiwi
Music
Moss
Peaches
Follicle Count (per cm^2)
91
It looks like the music treatment produced the lowest number
of follicles, while moss produced the highest. (Kiwi is a
close second, but the range is so large that you aren’t sure
about it.)
You decide to extract the Music and Moss groups and
build a model of follicle density versus treatment. (This sort
of test, comparing the means of two groups, is called a t-test.)
dd = subset(d, group=='Music' | group=="Moss")
summary(lm(val ~ group, data=dd))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.6316 3.5882 26.09 0.0000
groupMoss 11.1991 5.0745 2.21 0.0405
Aha! p = 0.04. You have evidence that such treatments
can have an effect.
Or do you? Admittedly, a p-value of 0.04 will not be terribly
compelling to someone who doubts that music and moss
affect follicle density in different ways. But it is below the
conventional threshold of 0.05. Consider some of the things
that could be wrong:
? What’s the null hypothesis in the hypothesis test presented
above?
A That Moss and Music are no different than
a Control group.
B That Moss and Music are the same as each
other.
C That Moss and Music have no effect.
Assuming that the null hypothesis is correct, what’s the
probability of seeing a p-value as small or smaller than
the one actually observed?
0 0.01 0.04 0.05 0.10 0.50 0.94 0.99
You chose the two groups to compare based on the
graph. There were many other possible pairs of groups:
Sage vs Peaches, Kiwi vs Moss, etc. Altogether, there
are 5 ? 4/2 = 10 possible pairs of groups. (In general,
when there are k different groups, there k(k ? 1)/2 possible
pairs of groups. So if k = 10 there are 45 different
possible pairs.)
The Bonferroni correction calls for the p-value to be adjusted
based on the number of comparisons done. There
are two, equivalent ways to do it. One way is to divide
the conventional threshold by the number of comparisons
and use the new threshold. So for this test, with
10 pairs of groups, the threshold for rejecting the null
would be 0.05/10 = 0.005. The other way is just to
multiply the p-value by the number of comparisons and
then compare this to the standard threshold.
Using the “multiply by” approach to the Bonferroni correction,
what is the p-value value on the Music-vs-Moss
comparison?
0.005 0.01 0.04 0.05 0.20 0.40 0.80
Prob 14.24 This exercise concerns assessing the appropriateness
of the Bonferroni correction in a setting where you are
choosing one pair out of several possible groups to compare
to each other. The exercise is based on a simulation. You
will need to install additional software to run the simulation,
which you can do with this command:
fetchData("simulate.r")
Recall that the Bonferroni correction relates to a situation
where you are conducting k hypothesis tests and choosing the
p-value that is most favorable. For example, you might have
tested several different treatments: B, C, D, ... against a
control, A. An example is shown in the figure.
A B C D E F G H
70 90 110 130
Group E seems to stand out as different from the control,
so you decide to focus your attention on comparing the control
A to group E.
To generate the data in the graph, use this command:
d = run.sim(bogus.groups, ngroups=8, seed=230)
When you do this, your results should match these exactly
(subject only to round-off):
head(d)
group val
1 A 96.47747
2 A 101.51901
3 A 104.93023
4 A 105.42620
5 A 106.63995
6 B 100.61215
tail(d)
group val
35 G 96.05322
36 H 86.91363
37 H 92.46628
38 H 109.58906
39 H 126.52284
40 H 136.98016
As you might guess from the name of the simulation,
bogus.groups, the data are being sampled from a population
in which the assigned group is unrelated to the quantitative
variable val. That is, the null hypothesis is true.
The point of the seed=230 statement is to make sure that
exactly the same “random” numbers are generated as in the
example.
Using the data, extract the subset of data for groups A
and E, and carry out a hypothesis test on group using the
92 CHAPTER 14. HYPOTHESIS TESTING ON WHOLE MODELS
model val ~ group. (Hint: in the subset command, use the
selector variable group=="A’’|group=="E’’ to select cases
that are from either group A or group E. Then apply lm to
this subset of data.)
What is the p-value that compares groups A to E.
0.011 0.032 0.057 0.128 0.193 0.253
As a convenience, you are provided with a special-purpose
function, compare.many.groups that does each of the hypothesis
tests comparing the control group to another group.
You can run it this way:
compare.many.groups( d, control="A")
group1 group2 diff pvalue
1 A B -3.456261 0.37730300
2 A C -2.203797 0.78987416
3 A D -1.047166 0.79541252
4 A E -21.580974 0.01101977
5 A F -7.841517 0.03687201
6 A G 1.361413 0.74935216
7 A H 7.495821 0.46485192
For the simulated data, your results should match exactly
(subject to round-off). Confirm that the p-value from this
report matches the one you got by hand above.
? As it happens, the p-value for group F is also below the
0.05 threshold. What is that p-value?
0.021 0.031 0.037 0.042 0.049
The Bonferroni correction takes the “raw” p-value and corrects
it by multiplying by the number of comparisons. Since
there are 7 comparisons altogether (groups B, C, ..., H against
group A), multiply the p-value by 7.
What is the Bonferroni-corrected p-value for the A vs
E comparison?
0.011 0.037 0.077 0.11 0.37 0.77
Now you are going to do a simulation to test whether the
Bonferroni correction is appropriate. The idea is to generate
data from a simulation that implements the null hypothesis,
calculate the p-value for the “best looking” comparison between
the control group A and one of the other groups, and
see how often the p-value is less than 0.05. It will turn out
that the p-value on the best looking comparison will be below
0.05 much too often. (If the p-value is to be taken at
face value, when the null hypothesis is true then p < 0.05
should happen only 5% of the time.) Then you’ll apply the
Bonferroni correction and see if this fixes things.
Because this is a somewhat elaborate computation, the
following leads you through the development of the R statement
that will generate the results needed. Each step is a
slight elaboration on the previous one.
1. Generate one realization from the bogus.groups simulation
and find the A vs other group p-values. In this
example, you will use seed=111 in the first few steps
so that you can compare your output directly to what’s
printed here. This can be done in two statements, like
this:
d = run.sim( bogus.groups, ngroups=8,seed=111 )
compare.many.groups( d, control="A")
group1 group2 diff pvalue
1 A B -2.7882420 0.7311174
2 A C 16.0119200 0.1113634
3 A D 0.1178478 0.9893143
4 A E 11.8370894 0.1204709
5 A F 9.7352535 0.4986335
6 A G -10.9812636 0.4322559
7 A H 6.9922567 0.4669595
2. It’s helpful to combine the two statements above into a
single statement. That will make it easier to automate
later on. To do this, just put the statement that created
d in place of d, as in the following. (The statement is
so long that it’s spread over 3 lines. You can cut it out
of this page and into an R session, if you wish.)
compare.many.groups(
run.sim( bogus.groups, ngroups=8,seed=111 ),
control="A")
group1 group2 diff pvalue
1 A B -2.7882420 0.7311174
2 A C 16.0119200 0.1113634
3 A D 0.1178478 0.9893143
4 A E 11.8370894 0.1204709
5 A F 9.7352535 0.4986335
6 A G -10.9812636 0.4322559
7 A H 6.9922567 0.4669595
It happens that none of the p-values from this simulation
were < 0.05.
3. Extend the statement above to extract the smallest pvalue.
This involves prefacing the statement with min(
and following it with $pvalue)
min(compare.many.groups(
run.sim( bogus.groups, ngroups=8,seed=111 ),
control="A")$pvalue)
[1] 0.1113634
4. Now, run this statement a dozen times, using do(12)*.
Assign the results to an object s.
s = do(12)*min(compare.many.groups(
run.sim( bogus.groups, ngroups=8,seed=111 ),
control="A")$pvalue)
s
result
1 0.1113634
2 0.1113634
3 0.1113634
4 0.1113634
5 0.1113634
6 0.1113634
7 0.1113634
8 0.1113634
9 0.1113634
93
10 0.1113634
11 0.1113634
12 0.1113634
Don’t be too surprised by the result. You got exactly
the same answer on each trial because the same random
“seed” was used every time.
5. Delete the random seed, so that new random numbers
will be generated on each trial.
s = do(12)*min(compare.many.groups(
run.sim( bogus.groups, ngroups=8 ),
control="A")$pvalue)
s
result
1 0.20460170
2 0.05247199
3 0.01378853
4 0.03755064
5 0.03587157
6 0.26104192
7 0.13371236
8 0.08099884
9 0.03397622
10 0.16773459
11 0.43178937
12 0.16676447
6. The point of running a dozen trials was just to produce
a manageable amount of output for you to read. Now
generate a thousand trials and, storing the results in s.
It will take about a minute for the computer to complete
the calculation, more on older computers.
s = do(1000)*min(compare.many.groups(
run.sim( bogus.groups, ngroups=8 ),
control="A")$pvalue)
Then answer the following questions:
(a) About what fraction of the “raw” (that is, uncorrected)
p-values are below 0.05? (Pick the closest
answer.)
0.01 0.02 0.05 0.10 0.20 0.30 0.40 0.50
(b) Now calculate the Bonferroni correction. Remember
that there were 7 comparisons done in each
trial, of which the one resulting in the smallest pvalue
was selected. What fraction of trials led to
p < .05? (Pick the closest answer.)
0.01 0.02 0.05 0.10 0.20 0.30 0.40 0.50
Prob 14.25 You and your statistics-class friends are having
a disagreement. Does the result of a hypothesis test depend
on the units in which quantities are measured? For example,
in the kids’ feet data, kidsfeet.csv the length and width are
measured in cm. Would it matter if they had been measured
in inches or meters?
Do a numerical experiment to figure out whether the units
make a difference. Explain your strategy concisely and give
the corresponding commands and the results here.
94 CHAPTER 14. HYPOTHESIS TESTING ON WHOLE MODELS
Chapter 15
Hypothesis Testing on Parts of Models
Reading Questions
? What is a “covariate” and how does it differ from any
other kind of variable?
? Why is there a separate F statistic for each explanatory
term in a model?
? How can covariates make an explanatory term look better
(e.g., more significant) in an F test?
? How can covariates make an explanatory term look
worse (e.g., less significant) in an F test?
? Why does the the sum of squares of the various model
terms change when there is collinearity among the
model terms?
Prob 15.01 Often we are interested in whether two groups
are different. For example, we might ask if girls have a different
mean footlength than do boys. We can answer this
question by constructing a suitable model.
> kids = fetchData("kidsfeet.csv")
> summary( lm( length ~ sex, data=kids ) )
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.1050 0.2847 88.180 <2e-16
sexG -0.7839 0.4079 -1.922 0.0623
Interpret this report, keeping in mind that the foot length
is reported in centimeters. (The reported value <2e-16 means
p < 2 × 10?16.)
1. What is the point estimate of the difference between the
lengths of boys and girls feet.
A Girls’ feet are, on average, 25 centimeters
long.
B Girls’ feet are 0.4079 cm shorter than boys’.
C Girls’ feet are 0.7839 cm shorter than boys’.
D Girls’ feet are 1.922 cm shorter than boys’.
2. The confidence interval can be written as a point estimate
plus-or-minus a margin of error: P ± M. What is
the 95% margin of error, M, on the difference between
boy’s and girl’s foot lengths. -0.78 0.28 0.41 0.60 0.80
3. What is the Null Hypothesis being tested by the reported
p-value 0.0623?
A Boys’ feet are, on average, longer than girls’
feet.
B Girls’ feet are, on average, shorter than
boys’ feet.
C All boys’ feet are longer than all girls’ feet.
D No girl’s foot is shorter than all boys’ feet.
E There is no difference, on average, between
boys’ footlengths and girls’ footlengths.
4. What is the Null Hypothesis being tested by the p-value
on the intercept?
A Boys’ and girls’ feet are, on average, the
same length
B The length of kids’ feet is, on average, zero.
C The length of boys’ feet is, on average, zero.
D The length of girls’ feet is, on average, zero.
E Girls’ and boys’ feet don’t intercept.
Here is the report from a related, but slightly different
model:
> summary( lm( length~sex-1, data=kids ))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
sexB 25.1050 0.2847 88.18 <2e-16
sexG 24.3211 0.2921 83.26 <2e-16
Note that the p-values for both coefficients are practically
zero, p < 2 × 10?16
.
What is the Null Hypothesis tested by the p-value on
sexG?
A Girls’ feet have a different length, on average,
than boys’.
B Girls’ feet are no different in length, on average,
than boys’.
C Girls’ footlengths are, on average, zero.
D Girls’ footlengths are, on average, greater
than zero.
Prob 15.02 Here is an ANOVA table (with the “intercept”
term included) from a fictional study of scores assigned to various
flavors, textures, densities, and chunkiness of ice cream.
Some of the values in the table have been left out. Figure out
from the rest of the table what they should be.
95
96 CHAPTER 15. HYPOTHESIS TESTING ON PARTS OF MODELS
Df Sum-Sq Mean-Sq F-value p-value
(intercept) 1 _A_ 200 _B_ _C_
flavor 8 640 80 _D_ 0.134
density _E_ 100 100 2 0.160
fat-content 1 300 _F_ 6 0.015
chunky 1 200 200 4 0.048
Residuals 100 5000 50
(a) The value of A:
1 2 100 200 400 600
(b) The value of B:
1 2 3 4 5 6 7 8 10 20 200
(c) The value of C: (Hint: There’s enough information in the
table to find this.)
0.00 0.015 0.030 0.048 0.096 0.134 0.160 0.320 0.480
(d) The value of D:
0.0 0.8 1.6 3.2 4.8
(e) The value of E:
0 1 2 3 4 5 6
(f) The value of F:
100 200 300 400 500
(g) How many cases are involved altogether?
50 100 111 112 200 5000
(h) How many different flavors were tested?
1 3 5 8 9 10 12 100
Prob 15.04 Consider the following analysis of the kids’
feet data looking for a relationship between foot width and
whether the child is left or right handed. The variable
domhand gives the handedness, either L or R. We’ll construct
the model in two different ways. There are 39 cases altogether.
> anova( lm(width ~ domhand, data=kids))
Response: width
Df Sum Sq Mean Sq F value Pr(>F)
(Intercept) 1 3153.60 3153.60 12228.0064 <2e-16 ***
domhand 1 0.33 0.33 1.2617 0.2686
Residuals 37 9.54 0.26
> anova( lm(width ~ domhand - 1, data=kids))
Response: width
Df Sum Sq Mean Sq F value Pr(>F)
domhand 2 3153.93 1576.96 6114.6 < 2.2e-16 ***
Residuals 37 9.54 0.26
? Explain why, in the first case, the p-value is not signifi-
cant, but in the second case it is.
? Why does domhand have 1 degree of freedom in the first
ANOVA report, but 2 degrees of freedom in the second?
Prob 15.05 A statistics student wrote:
I’m interested in the publishing business, particularly
magazines, and thought I would try a statistical analysis of
some of the features of magazines. I looked at several different
magazines, and recorded several variables, some of which I
could measure from a single copy and some of which I deduced
from my knowledge of the publishing business.
magazine a number to identify the magazine
pages the number of pages in a typical issue
color the number of pages with a color picture
age the age group of the target audience
sex the sex of the intended audience
sentenceLength the average number of words in a sentence
in the articles in the magazine.
Most people find it hard to believe, but most mass-market
magazines are very deliberately written and composed graphically
to be attractive to the target audience. The distinctive
“styles” of magazines is no accident.
I was interested to see if there is a relation between the average
sentence length and any of the other variables. I made
one linear model and had a look at the ANOVA table, as
shown below.
Analysis of Variance Table
Response: sentenceLength
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 0.222 0.222 0.0717 0.80626
age 3 71.067 23.689 7.6407 ????
color 1 0.299 0.299 0.0964 0.77647
Residuals 3 9.301 3.100
Answer each question based on the information given above.
1. What model structure did I use to generate this table?
A sentenceLength ~ age + sex + color
B sentenceLength ~ sex * age + color
C sentenceLength ~ sex + age + color
D color ~ sentenceLength + sex + age
2. How many cases are there altogether?
A 8
B 9
C 10
D No way to tell from the information given.
3. Is the variable age categorical or quantitative?
A categorical
B quantitative
C Could be either.
D Can’t know for sure from the data given.
4. The p-value for age is missing. What should this value
be?
97
A 0.93553 from pf(7.6407, 3, 3)
B 0.06446 from 1-pf(7.6407, 3, 3)
C 0.98633 from pf(23.689, 3, 3)
D 0.01367 from 1-pf(23.689, 3, 3)
E 0.99902 from pnorm(23.689, 0, 7.6507)
F 0.00098 from 1-pnorm(23.689, 0,
7.6507)
5. Based on the ANOVA table, what null hypothesis can be
rejected in this study at the significance level p < 0.10?
A An average sentence has zero words.
B There is no relationship between the number
of color pages and the sex of the intended
audience.
C The number of color pages is not related to
the sentence length.
D There is no relation between the average
number of words per sentence in an article
and the age group that the magazine is
intended for, after taking sex into account.
E None of the above, because there is a different
null hypothesis corresponding to each
model term in the ANOVA report.
6. I really wanted to look to see if there is an interaction
between sex and age. What would be the point of including
this term in the model?
A To see if the different sexes have a different
distribution of age groups.
B To see if there is a difference in average sentence
length between magazines for females
and males.
C To see if magazines for different age groups
are targeted to different sexes.
D To see if the difference in average sentence
length between magazines for females and
males changes from one age group to another.
7. I tried to include an interaction between sex and age
into the model. This didn’t work out. Just using the
information in the printed Anova report, judge what
might have gone wrong.
A The term was included as the last term in
the ANOVA report and didn’t have a significant
sum of squares.
B I discovered that sex and age were redundant.
C The p-values disappeared from the report.
D None of the above.
Prob 15.10 P-values concern the “statistical significance”
of evidence for a relationship. This can be a different thing
from the real-world importance of the observed relationship.
It’s possible for a weak connection to be strongly statistically
significant (if there is a lot of data for it) and for a strong relationship
to lack statistical significance (if there is not much
data).
Consider the data on the times it took runners to complete
the Cherry Blossom ten-mile race in March 2005:
> run = fetchData("ten-mile-race.csv")
> names(run)
[1] "state" "time" "net" "age" "sex"
Consider the net variable, which gives the time it took the
runners to get from the start line to the finish line.
Answer each of the following questions, giving both a
quantitative argument and also an everyday English explanation.
Assessing statistical significance is a technical matter,
but to interpret the substance of a relationship, you will have
to put it in a real-world context.
1. What is the relationship between net running time and
the runner’s age? Is the relationship significant? Is it
substantial?
2. What is the relationship between net running time and
the runner’s sex? Is the relationship significant? Is it
substantial?
3. Is there an interaction between sex and age? Is the relationship
significant? Is it substantial?
Prob 15.11 You are conducting an experiment of a treatment
for balding. You measure the hair follicle density before
treatment and again after treatment. The data table has the
following variables (with a few examples shown):
Subject.ID follicle.density when sex
A59 7.3 before M
A59 7.9 after M
A60 61.2 before F
A60 61.4 after F
and so on, 100 entries altogether
1. Here is an ANOVA table for a model of these data:
> anova(lm(follicle.density~when,data=hair))
Df Sum Sq Mean Sq F value p
when 1 33.7 33.7 0.157 0.693
Residuals 98 21077.5 215.1
Does this table suggest that the treatment makes a difference?
Why or why not?
2. Here’s another ANOVA table
> anova(lm(follicle.density~when+Subject.ID,
+ data=hair))
Df Sum Sq Mean Sq F value p
when 1 33.7 33.7 14.9 0.0002
Subject.ID 49 20858.6 425.7 185.0 zero
Residuals 97 218.9 2.3
Why is the F-value on when different in this model than
in the previous one?
3. What overall conclusion do you draw about the effectiveness
of the treatment? Is the effect of the treatment
statistically significant? Is it significant in practice?
98 CHAPTER 15. HYPOTHESIS TESTING ON PARTS OF MODELS
Prob 15.12 During a conversation about college admissions,
a group of high-school students starts to wonder how
reliable the SAT score is, that is, how much an individual
student’s score could be expected vary just do to random factors
such as the day on which the test was taken, the student’s
mood, the specific questions on the test, etc. This variation
within an individual is quite different from the variation from
person to person.
The high-school students decide to study the issue. A
simple way to do this is to have one student take the SAT
test several times and examine the variability in the student’s
scores. But, it would be better to do this for many different
students. To this end, the students propose to pool together
their scores from the times they took the SAT, producing a
data frame that looks like this:
Student Score Sex Order
PersonA 2110 F 1
PersonB 1950 M 1
PersonC 2080 F 1
PersonA 2090 F 2
PersonA 2150 F 3
... and so on
The order variable indicates how many times the student
has taken the test. 1 means that it is the student’s first time,
2 the second time, and so on.
One student suggests that they simply take the standard
deviation of the score variable to measure the variability in
the SAT score. What’s wrong with this for the purpose the
students have in mind?
A There’s nothing wrong with it.
B Standard deviations don’t measure random
variability.
C It would confound variability between students
with variability.
Another student suggests looking at the sum of square
residuals from the model score ~ student. What’s wrong with
this:
A There’s nothing wrong with it.
B It’s the coefficients on student that are important.
C Better to look at the mean square residual.
The students’ statistics teacher points out that the model
score ~ student will exactly capture the score of any student
who takes the SAT only once; the residuals for those students
will be exactly zero. Explain why this isn’t a problem, given
the purpose for which the model is being constructed.
Still another student suggests the model score ~
student + order in order to adjust for the possibility that
scores change with experience, and not just at random. The
group likes this idea and starts to elaborate on it. They make
two main suggestions:
? Elaboration 1: score ~ student + order + sex
? Elaboration 2: score ~ student + order + student:order
Why not include sex as an additional covariate, as in Elaboration
1, to take into account the possibility that males and
females might have systematically different scores.
A It’s a good idea.
B Bad idea since probably a person’s sex has
nothing to do with his or her score.
C Useless, since sex is redundant with student.
Regarding Elaboration 2, which of the following statements
is correct?
1. True or False It allows the model to capture how the
change of score with experience itself might be different
from one person to another.
2. True or False It assumes that all the students in the
data frame are taking the test multiple times.
3. True or False With the interaction term in place,
the model would capture the exact scores of all students
who took the SAT just once or twice, so the mean
square residual would reflect only those students who
took the SAT three times or more.
Prob 15.21 In conducting a hypothesis test, we need to
specify two things:
? A Null Hypothesis
? A Test Statistic
The numerical output of a hypothesis test is a p-value.
In modeling, a sensible Null Hypothesis is that one or more
explanatory variables are unrelated to the response variable.
We can simulate a situation in which this Null applies by
shuffling the variables. For example, here are two trials of a
simulation of the Null in a model of the kidsfeet data:
> kids = fetchData("kidsfeet.csv")
> lm( width ~ length + shuffle(sex), data=kids)
Coefficients:
(Intercept) length shuffle(sex)G
3.14406 0.23828 -0.07585
> lm( width ~ length + shuffle(sex), data=kids)
Coefficients:
(Intercept) length shuffle(sex)G
2.74975 0.25106 0.08668
The test statistic summarizes the situation. There are several
possibilities, but here we will use R2
from the model since
this gives an indication of the quality of the model.
> r.squared(lm( width ~ length + shuffle(sex), data=kids))
[1] 0.4572837
> r.squared(lm( width ~ length + shuffle(sex), data=kids))
[1] 0.4175377
> r.squared(lm( width ~ length + shuffle(sex), data=kids))
[1] 0.4148968
By computing many such trials, we construct the sampling
distribution under the Null — that is, the sampling distribution
of the test statistic in the world in which the Null holds
true. We can automate this process using do:
> samps = do(1000) *
+ r.squared(lm( width ~ length + shuffle(sex), data=kids))
Finally, to compute the p-value, we need to compute the
test statistic on the model fitted to the actual data, not on
the simulation.
99
> r.squared( lm( width ~ length + sex, data=kids))
[1] 0.4595428
The p-value is the probability of seeing a value of the test
statistic from the Null Hypothesis simulation that is more extreme
than our actual value. The meaning of “more extreme”
depends on what the test statistic is. In this example, since a
better fitting model will always have a larger R2 we check the
probability of getting a larger R2
squares from our simulation
than from the actual data.
> table( samps >= 0.4595428)
FALSE TRUE
912 88
Our p-value is about 9%.
Here are various computer modeling statements that implement
possible Null Hypotheses. Connect each computer
statement to the corresponding Null.
1. lm(width ~ length + shuffle(sex),data=kids)
2. lm(width ~ shuffle(length) + shuffle(sex), data=kids)
3. lm(width ~ shuffle(length), data=kids)
4. lm(width ~ shuffle(sex), data=kids)
5. lm(width ~ length + sex, data=shuffle(kids))
? Foot width is unrelated to foot length or to sex.
1 2 3 4 5
? Foot width is unrelated to sex, but it is related to foot
length.
1 2 3 4 5
? Foot width is unrelated to sex, and we won’t consider
any possible relationship to foot length.
1 2 3 4 5
? Foot width is unrelated to foot length, and we won’t
consider any possible relationship to sex.
1 2 3 4 5
? This isn’t a hypothesis test; the randomization won’t
change anything from the original data.
1 2 3 4 5
Prob 15.22 I’m interested in studying the length of gestation
as a function of the ages of the mother and the father.
In the gestation data set, ( gestation.csv ) the variable age
records the mother’s age in years, and dage gives the father’s
age in years. The variable gestation is the length of the gestation
in days. I hypothesize that the older the mother and
father, the shorter the gestational period. So, I fit a model to
those 599 cases where all the relevant data were recorded:
> summary(lm( gestation ~ age+dage, data=b))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 282.62201 3.25821 86.741 <2e-16
age -0.21313 0.19947 -1.068 0.286
dage 0.06085 0.17372 0.350 0.726
1. Describe in everyday language the relationship between
age and gestation indicated by this model.
2. I note that the two p-values are nothing great. But I
wonder whether if I treated mother’s and father’s age together
— lumping them together into a single term with
two degrees of freedom — I might not get something significant.
Using the ANOVA reports given below, explain
how you might come up with a single p-value summarizing
the joint contribution of mother’s and father’s age.
Insofar as you can, try to calculate the p-value itself.
> anova( lm(gestation ~ age+dage, data=b))
Df Sum Sq Mean Sq F value Pr(>F)
age 1 486 486 1.9091 0.1676
dage 1 31 31 0.1227 0.7262
Residuals 596 151758 255
> anova( lm( gestation ~ dage+age, data=b))
Df Sum Sq Mean Sq F value Pr(>F)
dage 1 227 227 0.8903 0.3458
age 1 291 291 1.1416 0.2858
Residuals 596 151758 255
100 CHAPTER 15. HYPOTHESIS TESTING ON PARTS OF MODELS
Chapter 16
Models of Yes/No Variables
Reading Questions
? What’s the difference between a “link value” and a
“probability value” in a logistic regression model? How
are they related to one another?
? How does the logistic function serve to keep the fitted
probability values always within the range 0 to 1?
? What is maximum likelihood and how is it used as a
criterion to fit a model?
Prob 16.01 The graphs show the link values and the corresponding
probability values for a logistic model where x is
the explanatory variable.
? Link Values
2 ?1 0 1 2
2 ?1 0 1 2 x Y
Probability Values
2 ?1 0 1 2
0.0 0.2 0.4 0.6 0.8 1.0 x P
Use the graphs to look up answers to the following. Choose
the closest possibility to what you see in the graphs.
At what value of x is the link value 0?
-2 -1 0 1 2
What probability corresponds to a link of 0?
0.0 0.1 0.5 0.9 1.0
At what value of x is the link value 1?
-1.50 -0.75 0.00 1.25 1.75
What probability corresponds to a link of 1?
0.0 0.25 0.50 0.75 1.00
What probability corresponds to a link of ?1?.0 0.25 0.50 0.75 1.00
What probability corresponds to a link of ∞? (This
isn’t on the graph.)
0.0 0.25 0.50 0.75 1.00
What probability corresponds to a link of ?∞? (This
isn’t on the graph.)
0.0 0.25 0.50 0.75 1.00
Prob 16.02 The NASA space shuttle Challenger had a
catastrophic accident during launch on January 28, 1986.
Photographic evidence from the launch showed that the accident
resulted from a plume of hot flame from the side of
one of the booster rockets which cut into the main fuel tank.
101
102 CHAPTER 16. MODELS OF YES/NO VARIABLES
US President Reagan appointed a commission to investigate
the accident. The commission concluded that the jet was due
to the failure of an O-ring gasket between segments of the
booster rocket.
A NASA photograph showing the plume of flame
from the side of the booster rocket during the Challenger
launch.
An important issue for the commission was whether the
accident was avoidable. Attention focused on the fact that
the ground temperature at the time of launch was 31?F, much
lower than for any previous launch. Commission member and
Nobel laureate physicist Richard Feynman famously demonstrated,
using a glass of ice water and a C-clamp, that the
O-rings were very inflexible when cold. But did the data
available to NASA before the launch indicate a high risk
of an O-ring failure?
Here is the information available at the time of Challenger’s
launch from the previous shuttle launches:
Flight Temp Damage Flight Temp Damage
STS-1 66 no STS-2 70 yes
STS-3 69 no STS-4 80 NA
STS-5 68 no STS-6 67 no
STS-7 72 no STS-8 73 no
STS-9 70 no STS 41-B 57 yes
STS 41-C 63 yes STS 41-D 70 yes
STS 41-G 78 no STS 51-A 67 no
STS 51-B 75 no STS 51-C 53 yes
STS 51-D 67 no STS 51-F 81 no
STS 51-G 70 no STS 51-I 76 no
STS 51-J 79 no STS 61-A 75 yes
STS 61-B 76 no STS 61-C 58 yes
Using these data, you can fit a logistic model to estimate
the probability of failure at any temperature.
> mod = glm(Damage ~ Temp, family=’binomial’)
> summary(mod)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 15.0429 7.3786 2.039 0.0415
Temp -0.2322 0.1082 -2.145 0.0320
Use the coefficients to find the link value for these launch
temperatures:
70F (a typical launch temperature)
-2.4 -1.2 1.6 2.7 4.3 7.8 9.4
53F (the previous low temperature)
-2.4 -1.2 1.6 2.7 4.3 7.8 9.4
31F (the Challenger temperature)
-2.4 -1.2 1.6 2.7 4.3 7.8 9.4
Convert the link value to a probability value for the launch
temperatures:
70F
0.08 0.23 0.83 0.94 0.985 0.9996 0.9999
53F
0.08 0.23 0.83 0.94 0.985 0.9996 0.9999
31F
0.08 0.23 0.83 0.94 0.985 0.9996 0.9999
A more complete analysis of the situation would take
into account the fact that there are multiple O-rings in each
booster, while the Damage variable describes whether any
O-ring failed. In addition, there were two O-rings on each
booster segment, both of which would have to fail to create a
leakage problem. Thus, the probabilities estimated from this
model and these data do not accurately reflect the probability
of a catastrophic accident.
Prob 16.04 George believes in astrology and wants to check
whether a person’s sign influences whether they are left- or
right-handed. With great effort, he collects data on 100 people,
recording their dominant hand and their astrological sign.
He builds a logistic model hand ~ sign. The deviance from
the model hand ~ 1 is 102.8 on 99 degrees of freedom. Including
the sign term in the model reduces the deviance to
63.8 on 88 degrees of freedom.
The sign term only reduced the degrees of freedom by 11
(that is, from 99 to 88) even though there are 12 astrological
signs. Why?
A There must have been one sign not represented
among the 100 people in George’s
sample.
B sign is redundant with the intercept and so
one level is lost.
C hand uses up one degree of freedom.
According to theory, if sign were unrelated to hand, the
11 degrees of freedom ought to reduce the deviance by how
much, on average?
A 11/99 × 102.8
B 1/11 × 102.8
C to zero
D None of the above.
Prob 16.05 This model traces through some of the steps
in fitting a model of a yes/no process. For specificity, pretend
that the data are from observations of a random sample of
teenaged drivers. The response variable is whether or not the
driver was in an accident during one year (birthday to birthday).
The explanatory variables are sex and age of the driver.
The model being fit is accident ~ 1 + age + sex.
Here is a very small, fictitious set of data.
103
Case Age Sex Accident?
1 17 F Yes
2 17 M No
3 18 M Yes
4 19 F No
Even if it weren’t fictitious, it would be too small for any
practical purpose. But it will serve to illustrate the principles
of fitting.
In fitting the model, the computer compares the likelihoods
of various candidate values for the coefficients, choosing
those coefficients that maximize the likelihood of the model.
Consider these two different candidate coefficients:
Candidate A Coefficients
Intercept age sexF
35 -2 -1
Candidate B Coefficients
Intercept age sexF
35 -2 0
The link value is found by multiplying the coefficients by
the values of the explanatory variables in the usual way.
? Using the candidate A coefficients, what is the link value
for case 1?
A 35 ? 2 × 17 ? 0 = 1
B 35 ? 2 × 17 ? 1 = 0
C 35 ? 2 × 18 ? 1 = ?2
D 35 ? 2 × 19 ? 1 = ?4
? Using the candidate B coefficients, what is the link value
for case 3?
A 35 ? 2 × 18 ? 0 = ?1
B 35 ? 2 × 18 ? 1 = ?2
C 35 ? 2 × 18 + 1 = 0
D 35 ? 2 × 18 ? 2 = ?3
The link value is converted to a probability value by using
The logistic transform.
The link value under the candidate A coefficients
for case 4 is 35 ? 2 × 19 ? 1 = ?4. What
is the corresponding probability value? (Hint:
Plug in the link value to the logistic transform!)
0.004 0.018 0.027 0.047 0.172 0.261
The link value under the candidate B coeffi-
cients for case 4 is 35 ? 2 × 19 ? 0 = ?3.
What is the corresponding probability value?
0.004 0.018 0.027 0.047 0.172 0.261
The probability value is converted to a likelihood by calculating
the probability of the observed outcome according to
the probability value. When the outcome is “Yes,” the likelihood
is just the same as the probability value. But when
the outcome is “No,” the likelihood is 1 minus the probability
value.
The link value for case 3 using the candidate A coef-
ficients is ?1 and the corresponding probability value
is 0.269. What is the likelihood of the observed
value of case 3 under the candidate A coefficients?
0.000 0.269 0.500 0.731 1.000
The link value for case 2 using the candidate A coefficients
is 1 and the corresponding probability value
is 0.731. What is the likelihood of the observed
value of case 2 under the candidate A coefficients?
0.000 0.269 0.500 0.731 1.000
To compute the likelihood of the entire set of observations
under the candidate coefficients, multiply together the likelihoods
for all the cases. Do this calculation separately for
the candidate A coefficients and the candidate B coefficients.
Show your work. Say which of the two candidates gives the
bigger likelihood?
In an actual fitting calculation, the computer goes through
large numbers of candidate coefficients in a systematic way to
find the candidate with the largest possible likelihood: the
maximum likelihood candidate. Explain why it makes sense
to choose the candidate with the maximize rather than the
minimum likelihood.
Prob 16.10 The National Osteoporosis Risk Assessment
(NORA)[?] studied about 200,000 postmenopausal women
aged 50 years or old in the United States. When entering
the study, 14,412 of these women had osteoporosis as defined
by a bone-mineral density “T score.” In studying the risk factors
for the development of osteoporosis, the researchers fit a
logistic regression model.
The coefficients in a logistic regression model can be directly
interpreted as the logarithm of an odds ratio — the “log
odds ratio.” In presenting results from logistic regression, it’s
common to exponentiate the coefficients, that is, to compute
e
coef to produce a simple odds ratio.
The table below shows the coefficients in odds ratio form
from the NORA model. There were many explanatory variables
in the model: Age group, years since menopause, health
status, etc. All of these were arranged to be categorical variables,
so there is one coefficient for each level of each variable.
As always, one level of each variable serves as a reference
level. For instance, in the table below, the age group 50-54
is the reference level. In the table below, the odds ratio for
the reference level is always given as 1.00. The other odds
ratios are always with respect to this reference. So, women in
the 55-59 age group have odds of having osteoporosis that are
1.79 time bigger than women in the 50-54 age group. In contrast,
women who are 6-10 years since menopause have odds
of having osteoporosis that are 0.79 as big as women who are
≤ 5 years since menopause.
An odds ratio of 1 means that the group has the same
probability value as the reference group. Odds ratios bigger
than 1 mean the group is more likely to have osteoporosis
than the reference group; odds ratios smaller than 1 mean
the group is less likely to have the condition.
The 95% confidence interval on the odds ratio indicates
the precision of the estimate from the available data. When
the confidence interval for a coefficient includes 1.00, the null
hypothesis that the population odds ratio is 1 cannot be rejected
at a 0.05 significance level. For example, the odds ratio
for a self-rated health status level of “very good” is 1.04 compared
to those in “excellent” health. But the confidence interval,
0.97 to 1.13, includes 1.00, indicating that the evidence
is weak that women in very good health have a different risk
104 CHAPTER 16. MODELS OF YES/NO VARIABLES
of developing osteoporosis compared to women in excellent
health.
For some variables, e.g., “college education or higher,” no
reference level is given. This is simply because the variable
has just two levels. The other level serves as the reference.
Age group (years) Odds Ratio (95% CI)
50-54 1.00 (Referent)
55-59 1.79 (1.56-2.06)
60-64 3.84 (3.37-4.37)
65-69 5.94 (5.24-6.74)
70-74 9.54 (8.42-10.81)
75-79 14.34 (12.64-16.26)
≥80 22.56 (19.82-25.67)
Years since menopause Odds Ratio (95% CI)
≤ 5 1.00 (Referent)
6-10 0.79 (0.70-0.89)
11-15 0.83 (0.76-0.91)
16-20 0.96 (0.89-1.03)
21-25 1.01 (0.95-1.08)
26-30 1.02 (0.95-1.09)
31-35 1.10 (1.03-1.19)
36-40 1.14 (1.05-1.24)
≥41 1.24 (1.14-1.35)
College educ or higher Odds Ratio (95% CI)
0.91 (0.87-0.94)
Self-rated health status Odds Ratio (95% CI)
Excellent 1.00 (Referent)
Very good 1.04 (0.97-1.13)
Good 1.23 (1.14-1.33)
Fair/poor 1.62 (1.50-1.76)
Fracture history Odds Ratio (95% CI)
Hip 1.96 (1.75-2.20)
Wrist 1.90 (1.77-2.03)
Spine 1.34 (1.17-1.54)
Rib 1.43 (1.32-1.56)
Maternal
history of osteoporosis Odds Ratio (95% CI)
1.08 (1.01-1.17)
Maternal
history of fracture Odds Ratio (95% CI)
1.16 (1.11-1.22)
Race/ethnicity Odds Ratio (95% CI)
White 1.00 (Referent)
African American 0.55 (0.48-0.62)
Native American 0.97 (0.82-1.14)
Hispanic 1.31 (1.19-1.44)
Asian 1.56 (1.32-1.85)
Body mass index, kg/m2 Odds Ratio (95% CI)
≤ 23 1.00 (Referent)
23.01-25.99 0.46 (0.44-0.48)
26.00-29.99 0.27 (0.26-0.28)
≥ 30 0.16 (0.15-0.17)
Current medication use Odds Ratio (95% CI)
Cortisone 1.63 (1.47-1.81)
Diuretics 0.81 (0.76-0.85)
Estrogen use
Odds Ratio (95% CI)
Former 0.77 (0.73-0.80)
Current 0.27 (0.25-0.28)
Cigarette smoking
Odds Ratio (95% CI)
Former 1.14 (1.10-1.19)
Current 1.58 (1.48-1.68)
Regular Exercise Odds Ratio (95% CI)
Regular 0.86 (0.82-0.89)
Alcohol use, drinks/wk Odds Ratio (95% CI)
None 1.00 (Referent)
1-6 0.85 (0.80-0.90)
7-13 0.76 (0.69-0.83)
≥ 14 0.62 (0.54-0.71)
Technology Odds Ratio (95% CI)
Heel x-ray 1.00 (Referent)
Forearm x-ray 2.86 (2.75-2.99)
Finger x-ray 4.86 (4.56-5.18)
Heel ultrasound 0.79 (0.70-0.90)
Since all the variables were included simultaneously in the
model, the various coefficients can be interpreted as indicating
partial change: the odds ratio comparing the given level
to the reference level for each variable, adjusting for all the
other variables as if they had been held constant.
? For which ethnicity are women least likely to have osteoporosis?
White African.American Native.American Hispanic Asian
Is regular exercise (compared to no regular exercise) associated
with a greater or lesser risk of having osteoporosis?
greater lesser same
Is current cigarette smoking (compared to never having
smoked) associated with a greater or lesser risk of
having osteoporosis? greater lesser same
The body mass index (BMI) is a measure of overweight.
For adults, a BMI greater than 25 is considered
overweight (although this is controversial) and a
BMI greater than 30 is considered “obese.” Are women
with BMI ≥ 30 (compared to those with BMI < 23) at
greater, lesser, or the same risk of having osteoporosis?
greater lesser same
There are different technologies for detecting osteoporosis.
Since the model adjusts for all the other risk factors,
105
it seems fair to interpret the risk ratios for the different
technologies as indicating how sensitive each technology
is in detecting osteoporosis.
Which technology is the most sensitive?
heel.x-ray forearm.x-ray finger.x-ray heel.ultrasound
Which technology is the least sensitive?
heel.x-ray forearm.x-ray finger.x-ray heel.ultrasound
In combining the odds ratios of multiple variables, you
can multiply the individual odds ratios. For instance,
the odds of a woman in very good health with a body
mass index of 24 is 1.04 × 0.46 as large as a woman in
excellent health with a BMI of < 23 (the reference levels
for the variables involved). (If log odds ratios were
used, rather than the odds ratios themselves, the values
would be added, not multiplied.)
What is the odds ratio of a women having osteoporosis
who is in fair/poor health, drinks 7-13 drinks per week,
and is Asian?
A 0.76 × 0.27 × 1.00
B 1.62 × 0.27 × 1.56
C 1.62 × 0.76 × 1.56
D 0.76 × 1.62 × 1.00
The two variables “age” and “years since menopause” are
likely to be somewhat collinear. Explain why. What
effect might this collinearity have on the width of the
confidence intervals for the various variables associated
with those variables? If you were recommending to remove
one of the variables in the list of potential risk
factors, which one would it be?
Notice that the table gives no intercept coefficient. The
intercept corresponds to the probability of having osteoporosis
when belonging to the reference level of each of
the explanatory variables. Without knowing this, you
cannot use the coefficients calculate the absolute risk
of osteoporosis in the different conditions. Instead, the
odds ratios in the table tell about relative risk. Gigerenzer
[?, ?] points out that physicians and patients often
have difficulty interpreting relative risks and encourages
information to be presented in absolute terms.
To illustrate, in the group from whom the NORA subjects
was drawn, the absolute risk of osteoporosis was
72 in 1000 patients. This corresponds to an odds of osteoporosis
of 72/(1000 ? 72) = 0.776. Now consider a
woman taking cortisone. According to the table, this
increases her odds of osteoporosis by a factor of 1.63, to
0.776 × 1.63 = 0.126. Translating this back into an absolute
risk means converting from odds into probability.
The probability will be 0.126/(1 + 0.126) = 0.112, or, in
other words, an absolute risk of 112 in 1000 patients.
Now suppose the woman was taking cortisone to treat
arthritis. Knowing the absolute risk (an increase of 40
women per 1000) puts the woman and her physician in
a better position to compare the positive effects of cortisone
for arthritis to the negative effects in terms of
osteoporosis.
[This problem is based on an item used in a test of the
statistical expertise of medical residents reported in [?].]
Prob 16.11 The National Osteoporosis Risk Assessment
(NORA)[?] studied about 200,000 postmenopausal women
aged 50 years or old in the United States. When entering
the study, 14,412 of these women had osteoporosis as defined
by a bone-mineral density “T score.” In studying the risk factors
for the development of osteoporosis, the researchers fit a
logistic regression model.
The coefficients in a logistic regression model can be directly
interpreted as the logarithm of an odds ratio — the “log
odds ratio.” In presenting results from logistic regression, it’s
common to exponentiate the coefficients, that is, to compute
e
coef to produce a simple odds ratio.
The table below shows the coefficients in odds ratio form
from the NORA model. There were many explanatory variables
in the model: Age group, years since menopause, health
status, etc. All of these were arranged to be categorical variables,
so there is one coefficient for each level of each variable.
As always, one level of each variable serves as a reference
level. For instance, in the table below, the age group 50-54
is the reference level. In the table below, the odds ratio for
the reference level is always given as 1.00. The other odds
ratios are always with respect to this reference. So, women in
the 55-59 age group have odds of having osteoporosis that are
1.79 time bigger than women in the 50-54 age group. In contrast,
women who are 6-10 years since menopause have odds
of having osteoporosis that are 0.79 as big as women who are
≤ 5 years since menopause.
An odds ratio of 1 means that the group has the same
probability value as the reference group. Odds ratios bigger
than 1 mean the group is more likely to have osteoporosis
than the reference group; odds ratios smaller than 1 mean
the group is less likely to have the condition.
The 95% confidence interval on the odds ratio indicates
the precision of the estimate from the available data. When
the confidence interval for a coefficient includes 1.00, the null
hypothesis that the population odds ratio is 1 cannot be rejected
at a 0.05 significance level. For example, the odds ratio
for a self-rated health status level of “very good” is 1.04 compared
to those in “excellent” health. But the confidence interval,
0.97 to 1.13, includes 1.00, indicating that the evidence
is weak that women in very good health have a different risk
of developing osteoporosis compared to women in excellent
health.
For some variables, e.g., “college education or higher,” no
reference level is given. This is simply because the variable
has just two levels. The other level serves as the reference.
106 CHAPTER 16. MODELS OF YES/NO VARIABLES
Age group (years) Odds Ratio (95% CI)
50-54 1.00 (Referent)
55-59 1.79 (1.56-2.06)
60-64 3.84 (3.37-4.37)
65-69 5.94 (5.24-6.74)
70-74 9.54 (8.42-10.81)
75-79 14.34 (12.64-16.26)
≥80 22.56 (19.82-25.67)
Years since menopause Odds Ratio (95% CI)
≤ 5 1.00 (Referent)
6-10 0.79 (0.70-0.89)
11-15 0.83 (0.76-0.91)
16-20 0.96 (0.89-1.03)
21-25 1.01 (0.95-1.08)
26-30 1.02 (0.95-1.09)
31-35 1.10 (1.03-1.19)
36-40 1.14 (1.05-1.24)
≥41 1.24 (1.14-1.35)
College educ or higher Odds Ratio (95% CI)
0.91 (0.87-0.94)
Self-rated health status Odds Ratio (95% CI)
Excellent 1.00 (Referent)
Very good 1.04 (0.97-1.13)
Good 1.23 (1.14-1.33)
Fair/poor 1.62 (1.50-1.76)
Fracture history Odds Ratio (95% CI)
Hip 1.96 (1.75-2.20)
Wrist 1.90 (1.77-2.03)
Spine 1.34 (1.17-1.54)
Rib 1.43 (1.32-1.56)
Maternal
history of osteoporosis Odds Ratio (95% CI)
1.08 (1.01-1.17)
Maternal
history of fracture Odds Ratio (95% CI)
1.16 (1.11-1.22)
Race/ethnicity Odds Ratio (95% CI)
White 1.00 (Referent)
African American 0.55 (0.48-0.62)
Native American 0.97 (0.82-1.14)
Hispanic 1.31 (1.19-1.44)
Asian 1.56 (1.32-1.85)
Body mass index, kg/m2 Odds Ratio (95% CI)
≤ 23 1.00 (Referent)
23.01-25.99 0.46 (0.44-0.48)
26.00-29.99 0.27 (0.26-0.28)
≥ 30 0.16 (0.15-0.17)
Current medication use Odds Ratio (95% CI)
Cortisone 1.63 (1.47-1.81)
Diuretics 0.81 (0.76-0.85)
Estrogen use
Odds Ratio (95% CI)
Former 0.77 (0.73-0.80)
Current 0.27 (0.25-0.28)
Cigarette smoking
Odds Ratio (95% CI)
Former 1.14 (1.10-1.19)
Current 1.58 (1.48-1.68)
Regular Exercise Odds Ratio (95% CI)
Regular 0.86 (0.82-0.89)
Alcohol use, drinks/wk Odds Ratio (95% CI)
None 1.00 (Referent)
1-6 0.85 (0.80-0.90)
7-13 0.76 (0.69-0.83)
≥ 14 0.62 (0.54-0.71)
Technology Odds Ratio (95% CI)
Heel x-ray 1.00 (Referent)
Forearm x-ray 2.86 (2.75-2.99)
Finger x-ray 4.86 (4.56-5.18)
Heel ultrasound 0.79 (0.70-0.90)
Since all the variables were included simultaneously in the
model, the various coefficients can be interpreted as indicating
partial change: the odds ratio comparing the given level
to the reference level for each variable, adjusting for all the
other variables as if they had been held constant.
For which ethnicity are women least likely to have osteoporosis?
White African.American Native.American Hispanic Asian
Is regular exercise (compared to no regular exercise) associated
with a greater or lesser risk of having osteoporosis?
greater lesser same
Is current cigarette smoking (compared to never having
smoked) associated with a greater or lesser risk of
having osteoporosis? greater lesser same
The body mass index (BMI) is a measure of overweight.
For adults, a BMI greater than 25 is considered
overweight (although this is controversial) and a
BMI greater than 30 is considered “obese.” Are women
with BMI ≥ 30 (compared to those with BMI < 23) at
greater, lesser, or the same risk of having osteoporosis?
greater lesser same
There are different technologies for detecting osteoporosis.
Since the model adjusts for all the other risk factors,
107
it seems fair to interpret the risk ratios for the different
technologies as indicating how sensitive each technology
is in detecting osteoporosis.
Which technology is the most sensitive?
heel.x-ray forearm.x-ray finger.x-ray heel.ultrasound
Which technology is the least sensitive?
heel.x-ray forearm.x-ray finger.x-ray heel.ultrasound
In combining the odds ratios of multiple variables, you
can multiply the individual odds ratios. For instance,
the odds of a woman in very good health with a body
mass index of 24 is 1.04 × 0.46 as large as a woman in
excellent health with a BMI of < 23 (the reference levels
for the variables involved). (If log odds ratios were
used, rather than the odds ratios themselves, the values
would be added, not multiplied.)
What is the odds ratio of a women having osteoporosis
who is in fair/poor health, drinks 7-13 drinks per week,
and is Asian?
A 0.76 × 0.27 × 1.00
B 1.62 × 0.27 × 1.56
C 1.62 × 0.76 × 1.56
D 0.76 × 1.62 × 1.00
? The two variables “age” and “years since menopause” are
likely to be somewhat collinear. Explain why. What
effect might this collinearity have on the width of the
confidence intervals for the various variables associated
with those variables? If you were recommending to remove
one of the variables in the list of potential risk
factors, which one would it be?
Notice that the table gives no intercept coefficient. The
intercept corresponds to the probability of having osteoporosis
when belonging to the reference level of each of
the explanatory variables. Without knowing this, you
cannot use the coefficients calculate the absolute risk
of osteoporosis in the different conditions. Instead, the
odds ratios in the table tell about relative risk. Gigerenzer
[?, ?] points out that physicians and patients often
have difficulty interpreting relative risks and encourages
information to be presented in absolute terms.
To illustrate, in the group from whom the NORA subjects
was drawn, the absolute risk of osteoporosis was
72 in 1000 patients. This corresponds to an odds of osteoporosis
of 72/(1000 ? 72) = 0.776. Now consider a
woman taking cortisone. According to the table, this
increases her odds of osteoporosis by a factor of 1.63, to
0.776 × 1.63 = 0.126. Translating this back into an absolute
risk means converting from odds into probability.
The probability will be 0.126/(1 + 0.126) = 0.112, or, in
other words, an absolute risk of 112 in 1000 patients.
Now suppose the woman was taking cortisone to treat
arthritis. Knowing the absolute risk (an increase of 40
women per 1000) puts the woman and her physician in
a better position to compare the positive effects of cortisone
for arthritis to the negative effects in terms of
osteoporosis.
[This problem is based on an item used in a test of the
statistical expertise of medical residents reported in [?].]
Prob 16.12 The concept of residuals does not cleanly apply
to yes/no models because the model value is a probability (of
a yes outcome), whereas the actual observation is the outcome
itself. It would be silly to try to compute a difference between
“yes” and a probability like 0.8. After all, what could it mean
to calculate (yes ? 0.8)2
?
In fitting ordinary linear models, the criterion used to select
the best coefficients for any given model design is “least
squares,” minimizing the sum of square residuals. The corresponding
criterion in fitting yes/no models (and many other
types of models) is “maximum likelihood.”
The word “likelihood” has a very specific and technical
meaning in statistics, it’s not just a synonym for “chance” or
“probability.” A likelihood is the probability of the outcome
according to a specific model.
To illustrate, here is an example of some yes-no observations
and the model values of two different models.
Model A Model B Observed
Case p(Yes) p(No) p(Yes) p(No) Outcome
1 0.7 0.3 0.4 0.6 Yes
2 0.6 0.4 0.8 0.2 No
3 0.1 0.9 0.3 0.7 No
4 0.5 0.5 0.9 0.1 Yes
Likelihood always refers to a given model, so there are two
likelihoods here: one for Model A and another for Model B.
The likelihood for each case under Model A is the probability
of the observed outcome according to the model. For example,
the likelihood under Model A for case 1 is 0.7, because
that is the model value of the observed outcome “Yes” for that
case. The likelihood of case 2 under Model A is 0.4 — that is
the probability of “No” for case 2 under model A.
What is the likelihood under Model A for case 3?
0.1 0.3 0.5 0.7 0.9
What is the likelihood under Model B for case 3?
0.1 0.3 0.5 0.7 0.9
What is the likelihood under Model A for case 4?
0.1 0.3 0.5 0.7 0.9
What is the likelihood under Model B for case 4?
0.1 0.3 0.5 0.7 0.9
The likelihood for the whole set of observations combines
the likelihoods of the individual cases: multiply them all together.
This is justified if the cases are independent of one
another, as is usually assumed and sensible if the cases are
the result of random sampling or random assignment to an
experimental treatment.
What is the likelihood under Model A for the whole set
of cases?
A 0.3 × 0.4 × 0.9 × 0.5
B 0.7 × 0.6 × 0.9 × 0.5
C 0.3 × 0.4 × 0.1 × 0.5
D 0.7 × 0.4 × 0.9 × 0.5
E 0.7 × 0.4 × 0.1 × 0.5
108 CHAPTER 16. MODELS OF YES/NO VARIABLES
? What is the likelihood under Model B for the whole set
of cases?
A 0.4 × 0.8 × 0.3 × 0.9
B 0.4 × 0.2 × 0.3 × 0.9
C 0.6 × 0.2 × 0.3 × 0.9
D 0.4 × 0.2 × 0.7 × 0.9
E 0.4 × 0.2 × 0.3 × 0.1
Chapter 17
Causation
Reading Questions
1. Why does an observed correlation between two variables
not provide compelling evidence that one causes
the other?
2. What is a hypothetical causal network? What does the
word “hypothetical” signify about these networks?
3. What is the difference between a correlating pathway
and a non-correlating pathway?
4. How is the appropriate choice of covariates in models
to study causation influenced by the structure of the
modeler’s hypothetical causal network?
5. When might two modelers legitimately disagree about
which covariates to include in a model?
109
110 CHAPTER 17. CAUSATION
Chapter 18
Experiment
Reading Questions
1. Is the point of a controlled experiment to create variability
or to avoid variability?
2. What is an experimental variable? In what ways is it
the same and in what ways different from an explanatory
variable?
3. What is the purpose of randomization in conducting an
experiment? What gets randomized?
4. Why might blocking be preferred to randomization?
5. What is the point of matched sampling and instrumental
variables?
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。