Assignment 3 - Part 2
Version 2.1 with some typos corrected. Use this version of the assignment.
STA238
Winter 2020
Suppose Joe owns a pizza shop. We know it’s the favourite shop of at least one STA238 student, so this
question is of serious practical importance. Joe is concerned with the number of customers he serves during
the lunch hour. To study this, one day his trusted assistant records the time between successive customers.
Let Xi be the number of minutes until the i
th customer enters Joe’s shop, measured from when the previous
customer entered.
We model X1, . . . , Xn as independent and identically distributed random variables from the Exponential(λ0)
distribution, which is sometimes an appropriate model for waiting times. (Note that we are using here
an alternative parametrization of the Exponential distribution, which might be different than what you’ve
seen before; throughout this assignment work with the Exponential density given below.) The parameter λ
represents the average number of minutes Joe has to wait until the next customer enters his shop, measured
from the time the previous customer entered. λ = 2 would be 2 minutes per customer, λ = 0.1 would be 10
customers per minute, and so on. Joe wants to estimate the true value of λ, λ0. The exponential density is
fλ(xi) = λ−1exp(−xi/λ).
1. Show that the Maximum Likelihood Estimator (MLE) for λ in this model is λb = X¯.
2. The waiting times data are posted on the assignment page on Quercus in the file assignment3-waiting.csv.
Consider the following code. Describe in words, in full detail, the bootstrap algorithm that is implemented
in the code. Indicate the distribution that is being estimated by the bootstrap
distribution.
set.seed(7886)
waiting <- readr::read_csv(
# Include the path to where you saved the data on your computer.
file = "assignment3-waiting.csv",
col_names = TRUE,
col_types = "n"
)
glimpse(waiting)
## Observations: 30
## Variables: 1
## $ waitingtime <dbl> 0.281, 0.330, 0.708, 0.463, 3.372, 0.008, 0.105, 1...
n <- nrow(waiting)
B <- 1000 # Number of bootstrap samples to do
bootmle <- numeric(B)
for (b in 1:B) {
samp <- sample(waiting$waitingtime,n,replace = TRUE)
bootmle[b] <- mean(samp) - mean(waiting$waitingtime)
}
tibble(x = bootmle) %>%
ggplot(aes(x = x)) +
theme_minimal() +
geom_histogram(aes(y = ..density..),bins = 50,colour = "black",fill = "transparent")
3. Joe comes in, angry. His assistant recorded the times between successive customers, when really, what
he wanted was the number of customers that arrived each minute! Joe is furious at his hapless assistant.
Luckily his assistant is a statistics major, and knows that if X1, . . . , Xn are independent Exponential(λ)
waiting times in minutes, then the number of customers per minute Y1, . . . , Ym is an i.i.d. sample from a
Poissonλ−1
distribution. So the parameter representing the physical quantity of interest now is µ = λ−1.
µ is the average number of customers per minute. So µ = 1/2 would be one customer every two minutes and
µ = 2 would be one customer every 30 seconds, and so on.
Follow the steps below which will allow you to find an estimate of µ. First, you will need to convert the
observed waiting times into a dataset containing the count of the number of customers arriving each minute.
To do this, follow the following steps:
Step 1: Create a new dataframe with two new variables: totaltime, computed as the total time (continuous,
fractional number of minutes) that has passed when each successive customer enters the pizza
shop; and numberofminutes, which is the number of integer minutes that have passed (plus one) when
each customer enters the shop. You can use totaltime = cumsum(waitingtime), numberofminutes =
ceiling(totaltime), and remember you can create new variables in a dataframe with the mutate function
in the dplyr package.
Your new dataframe should look like this (the numbers should match!):
## # A tibble: 30 x 3
## waitingtime totaltime numberofminutes
## # ... with 20 more rows
Step 2: Now, add up the number of customers that arrived in each integer minute. You can
group_by(numberofminutes) and then count the number of customers using summarize(numberofcustomers
= n()). Your new dataframe should be called customers and should look like this (the numbers should
match!):
customers1
## # A tibble: 18 x 2
## numberofminutes numberofcustomers
Step 3: Finally, you have to add zeroes for minutes in which customers didn’t arrive. Here is the code to this.
Run it and make sure you have created the data frame below.
zeroes <- tibble(numberofminutes = 1:max(customers1$numberofminutes))
customers <- zeroes %>%
left_join(customers1,by = "numberofminutes") %>%
replace_na(list(numberofcustomers = 0))
customers
## # A tibble: 26 x 2
## numberofminutes numberofcustomers
## <dbl> <dbl>
## # ... with 16 more rows
The column numberofcustomers now represents independent realizations of random variables Y1, . . . , Ym
from the Poissonλ−10
distribution, where λ0 is the same λ0 from question 1.
4. (a) The density of a Poissonλ−1
random variable is
fλ(yj ) = λ−yj e−1/λyj !
Use this density to find the MLE for λ and the MLE for µ = λ−1. Hint: you might want to find the MLE for
µ first, and then apply the principle of invariance to find the MLE for λ.
(b) Find the MLE for µ = λ
−1 using the original sample of waiting times and the principle of invariance of
the MLE. Do you expect your answer to be exactly the same as your answer to part (a)? Why or why
not?
5. Use your answer from question 4 to create a similar plot to that given in question 2. That is, construct
a bootstrap estimate of the sampling distribution of λ−1 − λ−10
and plot a histogram of this distribution.
You might wish to rely heavily on the code given in question 2.
6. (a) For the random sample X1, . . . , Xn, it can be shown thatλˆ − λ0λˆ∼ N0,1n
Compute a 95% confidence interval for λ using this asymptotic distribution, using the observed values of the
random sample X1, . . . , Xn.
(b) For the random sample Y1, . . . , Ym, it can be shown that
µˆ − µ0√µˆ∼ N0,1m
Compute a 95% confidence interval for µ using this asymptotic distribution, using the observed values of the
random sample Y1, . . . , Ym.
(c) Compare your confidence intervals in part (a) part (b). Interpret both of them: that is, “we are 95%
confident that ________ is between ___ and ___.” Remember that µ0 = λ−1
0.
7. (a) Compute the likelihood ratio Λ based on the sample Y1, . . . , Ym and evaluate the claim that µ = 2,
that is, two customers enter the shop per minute. Is this claim supported by the data?
(b) Compute the likelihood ratio Λ based on the sample X1, . . . , Xn and evaluate the claim that λ = 1/2,
that is, one customer enters the shop every 30 seconds. Is this claim supported by the data?
8. Use a bootstrap procedure to estimate a p-value corresponding to the claim that µ = 2 based on
Y1, . . . , Ym. Is your value different than what you got in question 7 (a)? Say why you would expect
them to be the same OR explain why they are different. (You may wish to use the code from the
lectures for week 11 for this question.)
9. (a) Consider the observed values of the random sample of customers per minute, y1, . . . , ym. Suppose
we put a Gamma(a, b) prior distribution on µ = λ−1 with density
Derive the posterior distribution for µ|y1, . . . , ym. State the name of the distribution and give
expressions for its parameters.
4
(b) Compute a 95% posterior credible interval for µ. You may wish to use the pgamma function in R to
calculate cumulative probability from the Gamma distribution. Interpret your interval and compare it
to the confidence interval for µ that you obtained in question 6 (b). Comment on whether they are
different. Compare their interpretations, and state whether you would expect them to be the same and
why/why not. Use values of a = 2 and b = 1 for the prior parameters (this gives a prior with mean 2).
5
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。