Homework 4: Diffusion of Tetracycline

We continue examining the diffusion of tetracycline among doctors in Illinois in the early 1950s, building on

our work in lab 6. You will need the data sets ckm_nodes.csv and ckm_network.dat from the labs.

1. Clean the data to eliminate doctors for whom we have no adoption-date information, as in the labs.

Only use this cleaned data in the rest of the assignment.

2. Create a new data frame which records, for every doctor, for every month, whether that doctor began

prescribing tetracycline that month, whether they had adopted tetracycline before that month, the

number of their contacts who began prescribing strictly before that month, and the number of their

contacts who began prescribing in that month or earlier. Explain why the dataframe should have 6

columns, and 2125 rows. Try not to use any loops.

3. Let

pk = Pr(A doctor starts prescribing tetracycline this month | Number of doctor’s contacts prescribing before this month = k)

and

qk = Pr(A doctor starts prescribing tetracycline this month | Number of doctor’s contacts prescribing this month = k)

We suppose that pk and qk are the same for all months.

a. Explain why there should be no more than 21 values of k for which we can estimate pk and qk

directly from the data.

b. Create a vector of estimated pk probabilities, using the data frame from (2). Plot the probabilities

against the number of prior-adoptee contacts k.

c. Create a vector of estimated qk probabilities, using the data frame from (2). Plot the probabilities

against the number of prior-or-contemporary-adoptee contacts k.

4. Because it only conditions on information from the previous month, pk is a little easier to interpret than

qk. It is the probability per month that a doctor adopts tetracycline, if they have exactly k contacts

who had already adopted tetracycline.

a. Suppose pk = a + bk. This would mean that each friend who adopts the new drug increases the

probability of adoption by an equal amount. Estimate this model by least squares, using the values

you constructed in (3b). Report the parameter estimates.

b. Suppose pk = e

a+bk/(1 + e

a+bk). Explain, in words, what this model would imply about the

impact of adding one more adoptee friend on a given doctor’s probability of adoption. (You can

suppose that b > 0, if that makes it easier.) Estimate the model by least squares, using the values

you constructed in (3b).

c. Plot the values from (3b) along with the estimated curves from (4a) and (4b). (You should have

one plot, with k on the horizontal axis, and probabilities on the vertical axis .) Which model do

you prefer, and why?

For quibblers, pedants, and idle hands itching for work to do: The pk values from problem 3 aren’t all equally

precise, because they come from different numbers of observations. Also, if each doctor with k adoptee

contacts is independently deciding whether or not to adopt with probability pk, then the variance in the

number of adoptees will depend on pk. Say that the actual proportion who decide to adopt is p?k. A little

probability (exercise!) shows that in this situation, E[p?k] = pk, but that Var[p?k] = pk(1 ? pk)/nk, where nk is

the number of doctors in that situation. (We estimate probabilities more precisely when they’re really extreme

[close to 0 or 1], and/or we have lots of observations.) We can estimate that variance as V?

k = p?k(1 ? p?k)/nk.

1

Find the V?

k, and then re-do the estimation in (4a) and (4b) where the squared error for pk is divided by V?

k.

How much do the parameter estimates change? How much do the plotted curves in (4c) change?

2

版权所有：编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。