联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-11-17 09:19

431 Quiz 2: Fall 2019

Thomas E. Love

due 2019-11-18 at Noon, version 2019-11-12

Instructions

All of the links for Quiz 2 Materials are at

The Materials

To complete the Quiz you’ll need three things, all of which are linked at the URL above.

1. The 2019-431-quiz02-questions.PDF file. This contains all of the instructions, questions and potential

responses. Be sure that you see all 30 questions, and all 27 pages.

2. Five data files, named quiz_data_states.csv, quiz_hosp.csv, quiz_ra.csv, quiz_sim_nejm.csv

and quiz_statin.csv, which may be useful to you.

3. The Quiz 2 Answer Sheet which is a Google Form.

Use the PDF file to read the quiz and craft your responses (occasionally making use of the provided data

sets), and then place those responses into the Answer Sheet Google Form. When using the Answer Sheet,

please select or type in your best response (or responses, as indicated) for each question. All of your responses

must be in the Answer Sheet by the deadline.

Key Things To Remember

The deadline for completing the Answer Sheet is Noon on Monday 2019-11-18, and this is a firm deadline,

without the grace period we allow for in turning in Homework.

The questions are not arranged in any particular order, and your score is based on the number of correct

responses, so you should answer all questions. There are 30 questions, and each is worth either 3 or 4 points.

The maximum possible score on the quiz is 100 points. Questions 01, 02, 05, 06, 08, 14, 17, 22, 27 and 30 are

worth 4 points each. They are marked to indicate this.

If you wish to work on some of the quiz on the Answer Sheet and then return later, you can do this by [1]

completing the final question which asks you to type in your full name, and then [2] submitting the Answer

Sheet. You will then receive a link which allows you to return to the Answer Sheet without losing your

progress.

Occasionally, I ask you to provide a single line of code. In all cases, a single line of code can include at most

one pipe for these purposes, although you may or may not need the pipe in any particular setting. Moreover,

you need not include the library command at any time for any of your code. Assume in all questions that all

relevant packages have been loaded in R. Any reference to a logarithm refers to a natural logarithm. If you

need to set a seed, use set.seed(2019) throughout this Quiz.

You are welcome to consult the materials provided on the course website, but you are not allowed to discuss

the questions on this quiz with anyone other than Professor Love and the teaching assistants at 431-help at

case dot edu. Please submit any questions you have about the Quiz to 431-help through email. Thank you,

and good luck.

1

1 Question 01 (4 points)

Consider the starwars tibble that is part of the dplyr package in the tidyverse. Filter the data file to focus

on individuals who are of the Human species, who also have complete data on both their height and mass.

Then use a t-based approach to estimate an appropriate 90% confidence interval for the difference between the

mean body-mass index of Human males minus the mean body-mass index of Human females. Don’t assume

that the population variances of males and females are the same. The data provides height in centimeters

and mass in kilograms. You’ll need to calculate the body-mass index (BMI) values - the appropriate formula

to obtain BMI in our usual units of kg

m2 is:

BMI =10, 000 ∗ mass in kg(height in cm)

2

Specify your point estimate, and then the lower and upper bound, each rounded to a single decimal place,

and be sure to specify the units of measurement.

2 Question 02 (4 points)

On 2019-09-25, Maggie Koerth-Baker at FiveThirtyEight published “We’ve Been Fighting the Vaping Crisis

Since 1937.” In that article, she quotes a 2019-09-06 article at the New England Journal of Medicine by

Jennifer E. Layden et al. entitled “Pulmonary Illness Related to E-Cigarette Use in Illinois and Wisconsin —

Preliminary Report.” Quoting that report:

E-cigarettes are battery-operated devices that heat a liquid and deliver an aerosolized product

to the user. . . . In July 2019, the Wisconsin Department of Health Services and the Illinois

Department of Public Health received reports of pulmonary disease associated with the use of

e-cigarettes (also called vaping) and launched a coordinated public health investigation. . . . We

defined case patients as persons who reported use of e-cigarette devices and related products in

the 90 days before symptom onset and had pulmonary infiltrates on imaging and whose illnesses

were not attributed to other causes.

The entire report is available at https://www.nejm.org/doi/full/10.1056/NEJMoa1911614. In the study, 53

case patients were identified, but some patients gave no response to the question of whether or not “they had

used THC (tetrahydrocannabinol) products in e-cigarette devices in the past 90 days.” 33 of the 41 reported

THC use. Assume those 41 subjects are a random sample of all case patients that will appear in Wisconsin

and Illinois in 2019.

Use a SAIFS procedure to estimate an appropriate 90% confidence interval for the PERCENTAGE of

case patients in Illinois and Wisconsin in 2019 that used THC in the 90 days prior to symptom onset.

Note that I’ve emphasized the word PERCENTAGE here, so as to stop you from instead presenting a

proportion. Specify your point estimate of this PERCENTAGE, and then the lower and upper bound for

your confidence interval, in each case rounded to a single decimal place.

2

3 Question 03

Alex, Beth, Cara and Dave independently select random samples from the same population. The sample sizes

are 200 for Alex, 400 for Beth, 125 for Cara, and 300 for Dave. Each researcher constructs a 95% confidence

interval from their data using the same statistical method. The half-widths (margins of error) for those

confidence intervals are 1.45, 1.74, 1.96 and 2.43. Match each interval’s margin of error with its researcher.

Rows:

a. Alex, who took a sample of n = 200 people.

b. Beth, who took a sample of n = 400 people.

c. Cara, who took a sample of n = 125 people.

d. Dave, who took a sample of n = 300 people.

Columns:

1. 1.45

2. 1.74

3. 1.96

4. 2.43

4 Question 04

Suppose you have a tibble with two variables. One is a factor called Exposure with levels High, Low and

Medium, arranged in that order, and the other is a quantitative outcome. You want to rearrange the order

of the Exposure variable so that you can then use it to identify for ggplot2 a way to split histograms of

outcomes up into a series of smaller plots, each containing the histogram for subjects with a particular level

of exposure (Low then Medium then High.)

Which of the pairs of tidyverse functions identified below has Dr. Love used to accomplish such a plot?

a. fct_reorder and facet_wrap

b. fct_relevel and facet_wrap

c. fct_collapse and facet_wrap

d. fct_reorder and group_by

e. fct_collapse and group_by

3

5 Question 05 (4 points)

In a double-blind trial, 350 patients with active rheumatoid arthritis were randomly assigned to receive one

of two therapy types: a cheaper one, or a pricier one, and went on to participate in the trial.

The primary outcome was the change in DAS28 at 48 weeks as compared to study entry. The DAS28 is

a composite index of the number of swollen and tender joints, the erythrocyte sedimentation rate, and a

visual-analogue scale of patient-reported disease activity. A decrease in the DAS28 of 1.2 or more (so a change

of -1.2 or below) was considered to be a clinically meaningful improvement. Data are in the quiz_ra.csv file.

A student completed four analyses, shown below. Which of the following 90% confidence intervals for the

change in DAS28 at 48 weeks most appropriately compares the pricier therapy to the cheaper one?

d. Analysis D

e. Analysis E

f. Analysis F

g. Analysis G

ra <- read.csv("data/quiz_ra.csv") %>% tbl_df()

mosaic::favstats(das28_chg ~ therapy, data = ra)

therapy min Q1 median Q3 max mean sd n missing

1 Cheaper -6.12 -2.955 -2.22 -1.415 0.56 -2.250857 1.208183 175 0

2 Pricier -5.56 -2.630 -2.06 -1.250 1.53 -2.027486 1.260694 175 0

ggplot(data = ra, aes(x = therapy, y = das28_chg, fill = therapy)) +

geom_violin(alpha = 0.3) + geom_boxplot(width = 0.3, notch = TRUE) +

theme_bw() + guides(fill = FALSE) + scale_fill_viridis_d()

5.1 Analysis D

ra %$% t.test(das28_chg ~ therapy, var.equal = TRUE) %>%

tidy(conf.int = TRUE, conf.level = 0.90) %>%

mutate(estimate = estimate1 - estimate2) %>%

select(estimate, conf.low, conf.high, method)

# A tibble: 1 x 4

estimate conf.low conf.high method

<dbl> <dbl> <dbl> <chr>

1 -0.223 -0.483 0.0362 " Two Sample t-test"

5.2 Analysis E

ra %$% t.test(das28_chg ~ therapy, paired = TRUE) %>%

tidy(conf.int = TRUE, conf.level = 0.90) %>%

select(estimate, conf.low, conf.high, method)

# A tibble: 1 x 4

estimate conf.low conf.high method

<dbl> <dbl> <dbl> <chr>

1 -0.223 -0.250 -0.197 Paired t-test

5.3 Analysis F

ra %$% wilcox.test(das28_chg ~ therapy, paired = TRUE,

conf.int = TRUE, conf.level = 0.90) %>%

tidy() %>%

select(estimate, conf.low, conf.high, method)

# A tibble: 1 x 4

estimate conf.low conf.high method

<dbl> <dbl> <dbl> <chr>

1 -0.230 -0.245 -0.215 Wilcoxon signed rank test with continuity co~

5.4 Analysis G

ra %$% wilcox.test(das28_chg ~ therapy, conf.int = TRUE, conf.level = 0.90) %>%

tidy() %>%

select(estimate, conf.low, conf.high, method)

# A tibble: 1 x 4

estimate conf.low conf.high method

<dbl> <dbl> <dbl> <chr>

1 -0.240 -0.450 -0.0300 Wilcoxon rank sum test with continuity corre~

5

6 Question 06 (4 points)

Referring again to the study initially described in Question 05, which of the following analyses provides an

appropriate 90% confidence interval for the difference (cheaper - pricier) in the proportion of participants

who had a clinically meaningful improvement (DAS28 change of -1.2 or below) at 48 weeks?

j. Analysis J

k. Analysis K

l. Analysis L

m. Analysis M

n. None of the above.

6.1 Analysis J

ra <- read.csv("data/quiz_ra.csv") %>% tbl_df()

ra <- ra %>%

mutate(improved = das28_chg < -1.2) %>%

mutate(improved = fct_relevel(factor(improved), "FALSE"))

ra %>% tabyl(improved, therapy)

improved Cheaper Pricier

FALSE 31 41

TRUE 144 134

twobytwo(31, 41, 144, 134, "improved", "didn't improve",

"cheaper", "pricier")

2 by 2 table analysis:

------------------------------------------------------

Outcome : cheaper

Comparing : improved vs. didn't improve

cheaper pricier P(cheaper) 95% conf. interval

improved 31 41 0.4306 0.3217 0.5466

didn't improve 144 134 0.5180 0.4593 0.5762

95% conf. interval

Relative Risk: 0.8312 0.6227 1.1096

Sample Odds Ratio: 0.7036 0.4173 1.1864

Conditional MLE Odds Ratio: 0.7043 0.4019 1.2246

Probability difference: -0.0874 -0.2100 0.0416

Exact P-value: 0.2339

Asymptotic P-value: 0.1872

------------------------------------------------------

6

6.2 Analysis K

ra <- read.csv("data/quiz_ra.csv") %>% tbl_df()

ra <- ra %>%

mutate(improved = das28_chg <= -1.2) %>%

mutate(improved = fct_relevel(factor(improved), "TRUE"))

ra %>% tabyl(improved, therapy)

improved Cheaper Pricier

TRUE 144 134

FALSE 31 41

twobytwo(144, 134, 31, 41, "improved", "didn't improve",

"cheaper", "pricier")

2 by 2 table analysis:

------------------------------------------------------

Outcome : cheaper

Comparing : improved vs. didn't improve

cheaper pricier P(cheaper) 95% conf. interval

improved 144 134 0.5180 0.4593 0.5762

didn't improve 31 41 0.4306 0.3217 0.5466

95% conf. interval

Relative Risk: 1.2031 0.9013 1.6059

Sample Odds Ratio: 1.4213 0.8429 2.3965

Conditional MLE Odds Ratio: 1.4198 0.8166 2.4880

Probability difference: 0.0874 -0.0416 0.2100

Exact P-value: 0.2339

Asymptotic P-value: 0.1872

------------------------------------------------------

7

6.3 Analysis L

ra <- read.csv("data/quiz_ra.csv") %>% tbl_df()

ra <- ra %>%

mutate(improved = das28_chg < -1.2) %>%

mutate(improved = fct_relevel(factor(improved), "FALSE"))

ra %>% tabyl(improved, therapy)

improved Cheaper Pricier

FALSE 31 41

TRUE 144 134

twobytwo(31, 41, 144, 134, conf.level = 0.90,

"improved", "didn't improve", "cheaper", "pricier")

2 by 2 table analysis:

------------------------------------------------------

Outcome : cheaper

Comparing : improved vs. didn't improve

cheaper pricier P(cheaper) 90% conf. interval

improved 31 41 0.4306 0.3383 0.5279

didn't improve 144 134 0.5180 0.4687 0.5669

90% conf. interval

Relative Risk: 0.8312 0.6523 1.0592

Sample Odds Ratio: 0.7036 0.4538 1.0908

Conditional MLE Odds Ratio: 0.7043 0.4379 1.1271

Probability difference: -0.0874 -0.1914 0.0212

Exact P-value: 0.2339

Asymptotic P-value: 0.1872

------------------------------------------------------

8

6.4 Analysis M

ra <- read.csv("data/quiz_ra.csv") %>% tbl_df()

ra <- ra %>%

mutate(improved = das28_chg <= -1.2) %>%

mutate(improved = fct_relevel(factor(improved), "TRUE"))

ra %>% tabyl(improved, therapy)

improved Cheaper Pricier

TRUE 144 134

FALSE 31 41

twobytwo(144, 134, 31, 41, conf.level = 0.90,

"improved", "didn't improve", "cheaper", "pricier")

2 by 2 table analysis:

------------------------------------------------------

Outcome : cheaper

Comparing : improved vs. didn't improve

cheaper pricier P(cheaper) 90% conf. interval

improved 144 134 0.5180 0.4687 0.5669

didn't improve 31 41 0.4306 0.3383 0.5279

90% conf. interval

Relative Risk: 1.2031 0.9441 1.5331

Sample Odds Ratio: 1.4213 0.9168 2.2034

Conditional MLE Odds Ratio: 1.4198 0.8872 2.2838

Probability difference: 0.0874 -0.0212 0.1914

Exact P-value: 0.2339

Asymptotic P-value: 0.1872

------------------------------------------------------

9

7 Question 07

In response to unexpectedly low enrollment, the protocol was amended part-way through the trial described

in Question 05 to change the primary outcome from a binary outcome to a continuous outcome in order to

increase the power of the study.

Originally, the proposed primary outcome was the difference in the proportion of participants who had a

DAS28 of 3.2 or less at week 48. The original power analysis established a sample size target of 225 completed

enrollments in each therapy group, based on a two-sided 10% significance level, and a desire for 90% power.

In that initial power analysis, the proportion of participants with a DAS28 of 3.2 or less at week 48 was

assumed to be 0.27 under the less effective of the two therapies.

What value was used in the power calculation for the proportion of participants with DAS28 of 3.2 or less at

week 48 for the more effective therapy? State your answer rounded to two decimal places.

8 Question 08 (4 points)

In the trial described in Question 05, 21 of the 222 subjects originally assigned to receive the cheaper therapy

and 35 of the 219 subjects originally assigned to receive the pricier therapy experienced a serious adverse

event (which included infections, gastrointestinal, renal, urinary, cardiac or vascular disorders, as well as

surgical or medical procedures.)

Suppose you wanted to determine whether or not there was a statistically detectable difference in the rates of

serious adverse events in the two therapy groups at the 5% significance level? Specify a single line of R code

that would do this, appropriately.

10

9 Question 09

The Pottery data are part of the carData package in R. Included are data describing the chemical composition

of ancient pottery found at four sites in Great Britain. This data set will also be used in Question 10. In this

question, we will focus on the Na (Sodium) levels, and our goal is to compare the mean Na levels across the

four sites.

anova(lm(Na ~ Site, data = carData::Pottery))

Analysis of Variance Table

Response: Na

Df Sum Sq Mean Sq F value Pr(>F)

Site 3 0.25825 0.086082 9.5026 0.0003209 ***

Residuals 22 0.19929 0.009059

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which of the following conclusions is most appropriate, based on the output above?

a. The F test allows us to conclude that the population mean Na level in at least one of the four sites is

detectably different than the others, at a 1% significance level.

b. The F test allows us to conclude that the population mean Na level in each of the four sites is detectably

different than each of the others, at a 1% significance level.

c. The F test allows us to conclude that the population mean Na level is the same in all four sites, at a

1% significance level.

d. The F test allows us to conclude that the population mean Na level may not be the same in all sites,

but is not detectably different at the 1% level.

e. None of these conclusions are appropriate.

11

10 Question 10

Consider these two sets of plots, generated to describe variables from the Pottery data set within the carData

package.

Plot 2 for Question 10

Question 10 continues on the next page. . .

12

Question 10 continues

And now, here are summary statistics from the mosaic::inspect function describing the variables contained

in the Pottery data set.

mosaic::inspect(carData::Pottery)

categorical variables:

name class levels n missing

1 Site factor 4 26 0

distribution

1 Llanedyrn (53.8%), AshleyRails (19.2%) ...

quantitative variables:

name class min Q1 median Q3 max mean sd n

1 Al numeric 10.10 11.95 13.800 17.4500 20.80 14.4923077 2.9926474 26

2 Fe numeric 0.92 1.70 5.465 6.5900 7.09 4.4676923 2.4097507 26

3 Mg numeric 0.53 0.67 3.825 4.5025 7.23 3.1415385 2.1797260 26

4 Ca numeric 0.01 0.06 0.155 0.2150 0.31 0.1465385 0.1012301 26

5 Na numeric 0.03 0.05 0.150 0.2150 0.54 0.1584615 0.1352832 26

missing

Based on this output, and whatever other work you need to do, which of the statements below is true, about

Variable 1 (as shown in Plot 1) and Variable 2 (shown in Plot 2)?

a. Var1 is . . .

b. Var2 is . . .

Choices are:

11 Question 11

Suppose you have a data frame named mydata containing a variable called sbp, which shows the participant’s

systolic blood pressure in millimeters of mercury. Which of the following lines of code will create a new

variable badbp within the mydata data frame which takes the value TRUE when a subject has a systolic

blood pressure that is at least 120 mm Hg, and FALSE when a subject’s systolic is less than 120 mm Hg.

a. mydata %>% badbp <- sbp >= 120

b. mydata$badbp <- ifelse(mydata$sbp >= 120, "YES", "NO")

c. badbp <- mydata %>% filter(sbp >= 120)

d. mydata %>% mutate(badbp = sbp >= 120)

e. None of these will do the job.

12 Question 12

According to Jeff Leek in The Elements of Data Analytic Style, which of the following is NOT a good reason

to create graphs for data exploration?

a. To understand properties of the data.

b. To inspect qualitative features of the data more effectively than a huge table of raw data would allow.

c. To discover new patterns or associations.

d. To consider whether transformations may be of use.

e. To look for statistical significance without first exploring the data.

13 Question 13

If the characteristics of a sample approximate the characteristics of its population in every respect, then

which of the statements below is true? (CHECK ALL THAT APPLY.)

a. The sample is random

b. The sample is accidental

c. The sample is stratified

d. The sample is systematic

e. The sample is representative

f. None of the above

14

Setup for Questions 14-15

For Questions 14 and 15, consider the data I have provided in the quiz_hosp.csv file. The data describe

700 simulated patients at a metropolitan hospital. Available are:

• subject.id = Subject Identification Number (not a meaningful code)

• sex = the patient’s sex (FEMALE or MALE)

• statin = does the patient have a prescription for a statin medication (YES or NO)

• insurance = the patient’s insurance type (MEDICARE, COMMERCIAL, MEDICAID, UNINSURED)

• hsgrads = the percentage of adults in the patient’s home neighborhood who have at least a high school

diploma (this measure of educational attainment is used as an indicator of the socio-economic place in

which the patient lives)

14 Question 14 (4 points)

Using the quiz_hosp data, what is the 95% confidence interval for the odds ratio which compares the odds of

receiving a statin if you are MALE divided by the odds of receiving a statin if you are FEMALE. Show the

point and interval estimates, rounded to two decimal places. Do NOT use a Bayesian augmentation here.

15 Question 15

Perform an appropriate analysis to determine whether insurance type is associated with the education

(hsgrads) variable, ignoring all other information in the quiz_hosp data. Which of the following conclusions

is most appropriate based on your analyses, using a 5% significance level?

a. The ANOVA F test shows no detectable effect of insurance on hsgrads, so it doesn’t make sense to

compare pairs of insurance types.

b. The ANOVA F test shows a detectable effect of insurance on hsgrads, and a Tukey HSD comparison

reveals that Medicare shows detectably higher education levels than Uninsured.

c. The ANOVA F test shows a detectable effect of insurance on hsgrads, and a Tukey HSD comparison

reveals that Medicaid’s education level is detectably lower than either Medicare or Commercial.

d. The ANOVA F test shows a detectable effect of insurance on hsgrads, and a Tukey HSD comparison

reveals that Uninsured’s education level is detectably lower than Commercial or Medicare.

e. None of these conclusions is appropriate.

15

16 Question 16

Once a confidence interval is calculated, several design changes may be used by a researcher to make a

confidence interval wider or narrower. For each of the changes listed below, indicate the impact on the width

of the confidence interval.

Rows are

a. Increase the level of confidence.

b. Increase the sample size.

c. Increase the standard error of the estimate.

d. Use a bootstrap approach to estimate the CI.

Columns are

1. CI will become wider

2. CI will become narrower

3. CI width will not change

4. It is impossible to tell

17 Question 17 (4 points)

The data in the quiz_statin.csv file provided to you describe the results of a study of 180 patients who

have a history of high cholesterol. Patients in the study were randomly assigned to the use of a new statin

medication, or to retain their current regimen. The columns in the data set show a patient identification

code, whether or not the patient was assigned to the new statin (Yes or No) and their LDL cholesterol value

(in mg/dl) at the end of the study. You have been asked to produce a 95% confidence interval comparing the

mean LDL levels across the two statin groups (including both a point estimate and appropriate confidence

interval rounded to two decimal places), and then describe your result in context in a single English sentence.

Which of the following approaches and conclusions are reasonable in this setting? (CHECK ALL THAT

APPLY)

a. LDL levels using the new statin were 4.95 mg/dl higher with 95% CI (0.65, 9.24) mg/dl, based on an

indicator variable regression model, which replicates a two-sample t test assuming equal variances.

b. LDL levels using the new statin were 4.95 mg/dl lower with 95% CI (0.65, 9.24) mg/dl, based on an

indicator variable regression model, which replicates a two-sample t test assuming equal variances.

c. LDL levels using the new statin were 4.95 mg/dl higher with 95% CI (0.56, 9.33) mg/dl, based on a

Welch two-sample t test not assuming equal variances.

d. LDL levels using the new statin were 4.95 mg/dl lower with 95% CI (0.56, 9.33) mg/dl, based on a

Welch two-sample t test not assuming equal variances.

e. LDL levels using the new statin were 4.95 mg/dl higher with 95% CI (0.94, 9.21) mg/dl, based on a

bootstrap comparison of the population means and using the seed 2019.

f. LDL levels using the new statin were 4.95 mg/dl lower with 95% CI (0.94, 9.21) mg/dl, based on a

bootstrap comparison of the population means and using the seed 2019.

g. None of the above are appropriate, since we should be using a paired samples analysis with these data.

16

18 Question 18

A hospital system has about 1 million records in its electronic health record database who meet our study’s

qualifying requirements for inclusion and exclusion. We believe that about 20% of the subjects who qualify

by these criteria will need a particular blood test.

Rows are:

a. Which will provide a confidence interval with smaller width for the proportion needing the blood test,

using a Wald approach?

b. Which will provide a better confidence interval estimate for the sample proportion of eligible subjects

who need the blood test?

Columns are:

1. A random sample of 85 subjects who meet the qualifying requirements.

2. A non-random sample of 850,000 of the subjects who met the qualifying requirements in the past year.

19 Question 19

A series of 88 models were built by a team of researchers interested in systems biology. 36 of the models

showed promising results in an attempt to validate them out of sample. Define the hit rate as the percentage

of models built that show these promising results. Which of the following intervals appropriately describes

the uncertainty we have around a hit rate estimate in this setting, using a Wald confidence interval approach

with a Bayesian augmentation and permitting a 10% rate of Type I error?

a. (31.8%, 50.3%)

b. (32.2%, 50.2%)

c. 0.411 plus or minus 9 percentage points

d. (32.4%, 50.3%)

e. None of these intervals.

17

20 Question 20

The lab component of a core course in biology is taught at the Watchmaker’s Technical Institute by a set

of five teaching assistants, whose names, conveniently, are Amy, Beth, Carmen, Donna and Elena. On the

second quiz of the semester (each section takes the same set of quizzes) an administrator at WTI wants to

compare the mean scores across lab sections. She produces the following output in R.

Analysis of Variance Table

Response: exam2

Df Sum Sq Mean Sq F value Pr(>F)

ta 4 971.5 242.868 2.7716 0.02898

Residuals 165 14458.4 87.627

Emboldened by this result, the administrator decides to compare mean exam2 scores for each possible pair of

TAs, using a Bonferroni correction. Suppose she’s not heard of pairwise.t.test() and therefore plans to

make each comparison separately with two-sample t tests. If she wants to maintain an overall α level of 0.10

for the resulting suite of pairwise comparisons using the Bonferroni correction, then what significance level

should she use for each of the individual two-sample t tests?

a. She should use a significance level of 0.10 on each test.

b. She should use 0.05 on each test.

c. She should use 0.025 on each test.

d. She should use 0.01 on each test.

e. She should use 0.001 on each test.

f. None of these answers are correct.

21 Question 21

If the administrator at the Watchmaker’s Technical Institute that we mentioned in Question 20 instead used

a Tukey HSD approach to make her comparisons, she might have obtained the following output.

Tukey multiple comparisons of exam2 means, 90% family-wise confidence level

diff lwr upr || diff lwr upr

----- ----- ----- || ----- ------ ----

Beth-Amy 1.21 -4.43 6.83 || Donna-Beth -6.53 -12.16 -0.90

Carmen-Amy -1.41 -7.04 4.22 || Elena-Beth -0.24 -5.87 5.40

Donna-Amy -5.32 -10.96 0.31 || Donna-Carmen -3.91 -9.54 1.72

Elena-Amy 0.97 -4.66 6.60 || Elena-Carmen 2.38 -3.25 8.01

Carmen-Beth -2.62 -8.25 3.01 || Elena-Donna 6.29 0.66 11.93

Note that when we refer in the responses below to Beth’s scores, we mean the scores of students who were in

Beth’s lab section. Which conclusion of those presented below would be most appropriate?

a. Amy’s scores are significantly higher than Carmen’s or Elena’s.

b. Beth’s scores were significantly higher than Amy’s.

c. Donna’s scores are significantly lower than Beth’s or Elena’s.

d. Elena’s scores are significantly lower than Donna’s.

e. None of these answers are correct.

18

22 Question 22 (4 points)

The quiz_data_states.csv file contains information on several variables related to the 50 United States

plus the District of Columbia. The available data include 102 rows of information on six columns, and those

columns are:

• code: the two-letter abbreviation for the “state” (DC = Washington DC, etc.)

• state: the “state” name

• year: 2019 or 2010, the year for which the remaining variables were obtained

• population: number of people living in the “state”

• poverty_people: number of people in the “state” living below the poverty line

• poverty_rate: % of people living in the “state” who are below the poverty line

Our eventual goal is to use the quiz_data_states data to produce an appropriate 90% confidence interval

for the change from 2010 to 2019 in poverty rate, based on an analysis of the data at the level of the 51

“states”.

Which of the following statements is most true?

a. This should be done using a paired samples analysis, and the quiz_data_states data require us to

calculate the paired differences, but are otherwise ready to plot now.

b. This should be done using a paired samples analysis, and the quiz_data_states data require us to

pivot the data to make them wider, and then calculate the paired differences and plot them.

c. This should be done using a paired samples analysis, and the quiz_data_states data require us to

pivot the data to make them longer, and then calculate the paired differences and plot them.

d. This should be done using an independent samples analysis, and the quiz_data_states data are ready

to be plotted appropriately now.

e. This should be done using an independent samples analysis, and the quiz_data_states data require

us to pivot the data to make them wider, and then plot the distributions of the two samples.

f. This should be done using an independent samples analysis, and the quiz_data_states data require

us to pivot the data to make them longer, and then plot the distributions of the two samples.

23 Question 23

Which of the following is the most appropriate way to complete the development of the confidence interval

proposed in Question 22?

a. Tukey HSD comparisons following an Analysis of Variance

b. Applying tidy() to an Indicator Variable Regression

c. Applying tidy() to an Intercept-only Regression

d. A Wilcoxon-Mann-Whitney Rank Sum Confidence Interval

e. A bootstrap on the poverty_people values across the states

19

24 Question 24

Use the data you have been provided in the quiz_data_states.csv file to provide a point estimate of the

change from 2010 to 2019 in the poverty rate in the United States as a whole. Provide your response as a

proportion with four decimal places. Note carefully what I am asking for (and not asking for) here.

25 Question 25

In The Signal and The Noise, Nate Silver writes repeatedly about a Bayesian way of thinking about uncertainty,

for instance in Chapters 8 and 13. Which of the following statistical methods is NOT consistent with a

Bayesian approach to thinking about variation and uncertainty? (CHECK ALL THAT APPLY)

a. Updating our forecasts as new information appears.

b. Establishing a researchable hypothesis prior to data collection.

c. Significance testing of a null hypothesis, using, say, Fisher’s exact test.

d. Combining information from multiple sources to build a model.

e. Gambling using a strategy derived from a probability model.

26 Question 26

According to Jeff Leek in The Elements of Data Analytic Style, which of the following is NOT a good idea in

creating graphs you will share with other people to describe your work? (CHECK ALL THAT APPLY)

a. If you have multiple plots to compare, use the same scale on the vertical axis.

b. Axis labels should be large, easy to read, in plain language.

c. Add a third dimension, perhaps with animation.

d. Include units in figure labels and legends.

e. Use color and size to help communicate information, for instance to point out confounding.

20

27 Question 27 (4 points)

Suppose that 200 of 260 applicants from students at private undergraduate institutions to a graduate school

are accepted, while 140 of 210 from students at public undergraduate institutions are accepted. Estimate a

two-sided 95% confidence interval for the relative risk of acceptance for a “private undergrad” applicant as

compared to a “public undergrad” applicant. Round your response to two decimal places. Provide both the

point estimate and confidence interval.

28 Question 28

Each of the 470 students described in Question 27 applied to exactly one program at the school: either

Program A, B or C. Breaking down the applications, we find that

• Program A received 120 applications, and accepted 75.

• Program A accepted 35 of its 60 applicants who came from private schools.

• Program B received 125 applications in total.

• Program B accepted exactly half of its 20 applicants from private schools.

• Program C accepted 40 of its 45 applicants from public schools.

• Program C rejected 25 applicants from private schools.

Which of the following statements is true? (CHECK ALL THAT APPLY.)

a. Students from private schools have lower odds of being accepted into Program A than do students from

public schools.

b. Students from private schools have lower odds of being accepted into Program B than do students from

public schools.

c. Students from private schools have lower odds of being accepted into Program C than do students from

public schools.

d. None of these statements are true.

e. There is insufficient information to decide which of statements a-c are true.

29 Question 29

Suppose we wanted to do a new study of people guessing my age, at the 5% significance level. We believe we

can enroll 220 people in total, 100 of whom are female, who can then be asked to guess my age.

Find the power for a two-sided t test to compare age guesses, if we assume a minimum clinically meaningful

difference between male and female observers in terms of age guesses is 2.5 years, and assuming that the

standard deviation of age guesses is 5 years in both males and females. Which of the following options best

describes the power we obtain in this case?

a. Less than 80%.

b. Between 80% and 84.9%.

c. Between 85% and 89.9%.

d. Between 90% and 94.9%.

e. 95% or higher.

f. It is impossible to estimate the power in this setting.

21

30 Question 30 (4 points)

A June 2018 special article in the New England Journal of Medicine by Michelle M. Mello, Van Lieou and

Steven N. Goodman entitled “Clinical Trial Participants’ Views of the Risks and Benefits of Data Sharing”

is the focus of this question. You can see the whole article (although it won’t help you with the Quiz) at

https://www.nejm.org/doi/full/10.1056/NEJMsa1713258.

In that article, the investigators describe a survey of 771 “current and recent participants from a diverse

sample of clinical trials at three academic medical centers in the United States” and report an overall response

rate of 79%.

The 771 respondents were asked “How likely would you be to allow your anonymous, individual clinical trial

data to be shared with. . . ” two types of people, specifically:

1. scientists in universities and other not-for-profit organizations

2. scientists in companies developing medical products, such as prescription drugs

The data in quiz_sim_nejm.csv show simulated responses (collapsed to address the question of whether the

response in each case was “Very Likely” [indicated by Yes] or something else [indicated by No]) from 771

subjects to these items. I’ve used those data to create four analyses, labeled W, X, Y and Z.

Your job is to find appropriate point and 90% confidence interval estimates for the difference in the proportion

of respondents who would be “Very Likely” to allow universities but not companies to use their data to share

their results. Which of the following responses is correct?

a. Point Estimate is -0.1582, with 90% CI (-0.1981, -0.1177) from Analysis W

b. Point Estimate is 0.0892, with 90% CI (0.0251, 0.1525) from Analysis X

c. Point Estimate is 0.1582, with 90% CI (0.1177, 0.1981) from Analysis Y

d. Point Estimate is 0.1582, with 90% CI (0.1197, 0.1968) from Analysis Z

e. None of these estimates are correct.

22

Data Management for Question 30, done before Analyses W, X, Y or Z

sim_nejm <- read.csv(here("data", "quiz_sim_nejm.csv")) %>%

tbl_df() %>% clean_names() %>%

mutate(university = fct_relevel(university, "Yes"),

company = fct_relevel(company, "Yes"))

head(sim_nejm)

# A tibble: 6 x 3

subject university company

<fct> <fct> <fct>

1 S-001 Yes Yes

2 S-002 Yes No

3 S-003 Yes Yes

4 S-004 Yes Yes

5 S-005 No Yes

6 S-006 Yes No

sim_long <- pivot_longer(sim_nejm, -subject,

names_to = "type",

values_to = "response") %>%

mutate(response = fct_relevel(response, "Yes"))

head(sim_long)

# A tibble: 6 x 3

subject type response

<fct> <chr> <fct>

1 S-001 university Yes

2 S-001 company Yes

3 S-002 university Yes

4 S-002 company No

5 S-003 university Yes

6 S-003 company Yes

Analyses W, X, Y and Z use the material above

You’ll find them on the last four pages of this PDF.

23

30.1 Analysis W

sim_long %$% table(type, response)

response

type Yes No

company 412 359

university 534 237

Epi::twoby2(sim_long %$% table(type, response),

conf.level = 0.90)

2 by 2 table analysis:

------------------------------------------------------

Outcome : Yes

Comparing : company vs. university

Yes No P(Yes) 90% conf. interval

company 412 359 0.5344 0.5047 0.5638

university 534 237 0.6926 0.6646 0.7192

90% conf. interval

Relative Risk: 0.7715 0.7209 0.8258

Sample Odds Ratio: 0.5093 0.4276 0.6067

Conditional MLE Odds Ratio: 0.5096 0.4253 0.6101

Probability difference: -0.1582 -0.1981 -0.1177

Exact P-value: 0.0000

Asymptotic P-value: 0.0000

------------------------------------------------------

24

30.2 Analysis X

sim_nejm %$% table(university, company)

company

university Yes No

Yes 300 234

No 112 125

Epi::twoby2(sim_nejm %$% table(university, company),

conf.level = 0.90)

2 by 2 table analysis:

------------------------------------------------------

Outcome : Yes

Comparing : Yes vs. No

Yes No P(Yes) 90% conf. interval

Yes 300 234 0.5618 0.5262 0.5967

No 112 125 0.4726 0.4197 0.5260

90% conf. interval

Relative Risk: 1.1888 1.0447 1.3528

Sample Odds Ratio: 1.4309 1.1059 1.8514

Conditional MLE Odds Ratio: 1.4302 1.0926 1.8732

Probability difference: 0.0892 0.0251 0.1525

Exact P-value: 0.0234

Asymptotic P-value: 0.0222

------------------------------------------------------

25

30.3 Analysis Y

y <- sim_long %>%

mutate(type = fct_relevel(type, "university"))

y %$% table(type, response)

response

type Yes No

university 534 237

company 412 359

Epi::twoby2(y %$% table(type, response),

conf.level = 0.90)

2 by 2 table analysis:

------------------------------------------------------

Outcome : Yes

Comparing : university vs. company

Yes No P(Yes) 90% conf. interval

university 534 237 0.6926 0.6646 0.7192

company 412 359 0.5344 0.5047 0.5638

90% conf. interval

Relative Risk: 1.2961 1.2110 1.3872

Sample Odds Ratio: 1.9633 1.6483 2.3385

Conditional MLE Odds Ratio: 1.9625 1.6391 2.3514

Probability difference: 0.1582 0.1177 0.1981

Exact P-value: 0.0000

Asymptotic P-value: 0.0000

------------------------------------------------------

26

30.4 Analysis Z

sim_nejm %>% tabyl(university, company) %>%

adorn_totals(where = c("row", "col")) %>%

adorn_title()

company

university Yes No Total

Yes 300 234 534

No 112 125 237

Total 412 359 771

PropCIs::diffpropci.Wald.mp(b = 112, c = 234, n = 771,

conf.level = 0.90)

data:

90 percent confidence interval:

0.1196754 0.1967967

sample estimates:

[1] 0.1582361

This is the end of the Quiz.

27


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp