Estimate the mean and variance of the Height variable using this sample. Are the

sample mean and the sample variance equal to the population mean and population

variance? Why/why not?

12. (5 points) Using the sample drawn before, we want to give a 95% condence interval1

for the population mean of the height using the formula

[height ? t1?α/2,n?1 ?

√

var(height)/n, height + t1?α/2,n?1 ?

√

var(height)/n],

where height is the sample mean, var(height) is the sample variance, n is the eective

sample size and t1?α/2,n?1

is the quantile of order 1 ? α/2 of the Student (or t)

distribution with n?1 degrees of freedom. Here, α = 0.05. Is it possible to construct

here a such 95% condence interval (check the assumptions of its construction)? If it

is the case, compute this condence interval and interpret it. Usually we do not know

the population mean but in our case its value is known. Verify if the population

mean is inside of the constructed condence interval.

Exercise 2 (25 points)

1. (5 points) The data frame Cars93 (from MASS package) holds extensive information

on data from 93 cars on sale in the USA in 1993 (see the help le for data and

variable description). The MASS package is already downloaded; you should load it

in your R session. One of the variables, stored as a factor, is Type. Create a new

data frame called myCars93, in which the row names are the distinct values of Type.

This data frame contains 3 columns: the mean of the Min.Price, the mean of the

Max.Price, while a later column holds one or two character abbreviations of each of

the car types. Print this new data frame. This data frame should be like this:

mean.min.price mean.max.price abbreviation

Compact 15.69 20.7 c

Large 22.94 25.7 l

Midsize 24.11 30.3 m

Small 8.43 11.9 s

Sporty 16.86 22.0 sp

Van 16.20 22.0 v

2. (5 points) The normal QQ-plot is a fancy way of checking if the distribution looks

normal. A more primitive one is to check the rule of thumb that approximatively

68% of the data is 1*standard deviation from the mean, 95% within 2*standard

deviations and 99.8% within 3*standard deviations. Generate 100 random values

from the Student distribution with 25 degrees of freedom. Is your data consistent

with the normal distribution ? Check it using a QQ-plot and the rule given above.

1The meaning of the term "95% condence level" is that, if condence intervals are constructed across

many samples drawn from the same population and in the same conditions, the proportion of such intervals

that contain the true value of the parameter (the population mean) will match the condence level 95%.

3

3. (5 points) Depending on the type of data, there are advantages to use the mean or

the median.

(a) Use 100 times 200 random numbers from the Student distribution with two

degrees of freedom (each time you generate 200 random numbers). Compute

each time the mean and the median. You obtain 100 means and 100 medians.

Plot in the same layout the boxplot of all means and the boxplot of all medians.

(b) Repeat the same computation for 100 times and 200 random numbers from

N(0, 1) distribution.

Based on these two cases explain the advantages of using the mean or the median.

Provide also each time a measure of the data dispersion.

4. (10 points) The data `experiment.csv' contains information about the results of a

test assessing the level of prociency in a foreign language. The participants were

informed that they were going to watch 6 scenes showing the events of a dramatic

evening and they were asked to tell the story immediately after each scene. Afterwards

the responses of the participants were coded with respect to the presence

(`yes/no') of some adverbs (variable `adverb'). The variable `group' gives the level

of prociency in the foreign language of the participants (4 groups, `GL1', `GL2-B',

`GL2-C', `FL1'). The participant id is given by the variable `speaker'. The data set

contains 6 records (one for each scene) for each participant (so the observations are

not independent here). We are interested to study if on average the groups with

dierent level of prociency show a dierence in using the adverbs. Propose, check

the assumptions and perform a statistical test at the 5% level. Comment your result.

Exercise 3 (15 points)

1. (5 points) Write an R function to compute the sum of the maximum of independent

N(0, 1) random variables X and Y. To do this generate n pairs of (X, Y ), compute

the maximum for each, and sum them. The integer number n should be the argument

of your function. Call this function for n = 100 and n = 1000.

2. (5 points) Suppose we have a matrix of 1s and 0s and want to create a vector as

follows: for each row of the matrix, the corresponding element of the vector will be

either 1 or 0, depending on whether the majority of the rst d elements in that row

is 1 or 0. Write a function to create this vector. The function has as arguments the

matrix and d. Call this function for the following matrix x and d = 3 :

For this input, your function should return the vector 1, 1, 0.

3. (5 points) Write a function to visualize the approximation of the binomial distribution

by the normal distribution (plot the density mass function of the binomial

4

distribution and add the density of the normal distribution). The approximation

generally improves as n increases (at least 20) and is better when p is not near to 0

or 1. The function takes as parameters the sample size n and the success probability

p of the binomial distribution. The normal distribution will have the mean np and

the standard deviation √

np(1 ? p). Use this function for dierent sample sizes n

and dierent p to see the approximation eect. Take:

? n = 10, 30, 50, 100 and p = 0.25 in the same layout;

? n = 10, 30, 50, 100 and p = 0.75 in the same layout;

? n = 10, 30, 50, 100 and p = 0.5 in the same layout.

What do you observe?

Exercise 4 (30 points)

Read the le study.dat in R. The le contains a header. The columns are not in xed

width, but delimited by a comma (sep=",") (open rst the le to check it). The data

set contains measures obtained in an epidemiological study. Some variables that were

measured are:

? BMI: Body Mass Index (kg/m)

? SMOKING: 0=no, 1=yes

? TCHOL: Total cholesterol (mg/dl)

? FEMALE: sex, 0=man, 1=woman

? CVD: Cardiovascular death, 0=no, 1=yes, missing if the cause of death wasn't cardiovascular.

a. (5 points) Check the normality of the BMI variable using three dierent graphical

tools. Comment your results.

b. (5 points) Check the normality of the BMI variable using an appropriate statistical

test (the signicance level is 5%). Explain your result.

c. Compare the total cholesterol (mg/dl) for the patients of this study who died of a

cardiovascular disease between the smokers and non-smokers:

1. (5 points) Draw boxplots (in the same layout) of the total cholesterol for smokers

and non-smokers to give a rst insight and comment your result.

2. (5 points) Test for the equality of the TCHOL means in the two groups (smokers,

non-smokers) using a two-sided t-test at the 5% signicance level, assuming

normality of the TCHOL in both groups (the test assumes that the two

groups/samples come from two normal populations N(μ1, σ21) and N(μ2, σ22),

respectively, under the null hypothesis). A parameter of the R function concerns

the equality of σ21 and σ22. Use the F-test to test the equality of the variances

(provided that the samples come from normal populations). Then, apply the R

function for a two-sample t-test. Comment your results.

5

3. (5 points) All these tests assume normality of the data in the two groups/samples.

The two-sample Wilcoxon or Mann-Whitney test (which is a nonparametric

test) only assumes a common continuous distribution under the null hypothesis

and tests if the median dierence is 0 versus the median dierence is not 0.

Perform this test and comment your result. Does this result agree with the

result of the t-test at the 5% level?

4. (5 points) It is known that the t-test is sensitive to outliers (for this reason,

the Wilcoxon test is sometimes preferred). The variable TCHOL includes some

outliers in the smokers group (see the associated boxplot). Delete the largest

outlier from the data and perform again the t-test. What do you observe?

Comment your result.

Exercise 5 (40 points)

Consider the data given in the le 'restaurant.csv'. Some variables that were measured

are:

? y - Price (the price of a dinner in US$);

? x1 - Food (customer rating of the food, out of 30);

? x2 - Decor (customer rating of the decor, out of 30);

? x3 - Service (costumer rating of the service, out of 30);

? x4 - East (dummy variable, 1/0 if the restaurant is in the east/west of Fifth Avenue,

New York).

We seek for a linear regression model that predicts y.

1. (5 points) Start by graphically inspecting the data. Comment.

2. (5 points) Fit the regression model having as predictors all the x variables. State the

null and the alternative hypotheses of the overall F-test. Perform the overall F-test

at the 5% level. Comment your result.

3. (5 points) Check if the predictor variables are statistically signicant at the 5% level.

Comment your result.

4. (10 points) Consider the model including only the predictors which are statistically

signicant at the 5% level. Check using diagnostic plots the validity of the regression

model. Improve if necessary its goodness-of-t following the assumptions of a linear

multiple regression model. Comment your results.

5. (5 point) The quantities hii represent the diagonal elements of the hat matrix. We

can show that the mean of hii, i = 1, . . . , n is (p + 1)/n, where p is the number of

predictors in your model and n is the number of observations. Large values of hii

may indicate an observation i having an important inuence on the model (a data

point is inuential if it strongly inuences any part of a regression analysis, such as

6

the predicted responses, the estimated coecients, or the hypothesis test results).

Consider the following rule of thumb: if hii > 3(p + 1)/n, the observation i should

be considered noteworthy. List the observations where this rule is fullled and check

their inuence on the model. Comment your results.

6. (10 points) It is possible to construct models that include dierent subsets of predictors.

Install the package `leaps' and use the function with the same name to perform

an exhaustive search for the best subset of the variables x for predicting y. Use the

adjusted R2 as criterion to compare dierent models. Provide the `best' model given

by the leaps() function (we call this the initial `best' model). We want to check the

stability of the initial `best' model using the following method:

? draw with replacement a sample of observations with the sample size equals to

the number of rows of the original data set; a new data set is obtained;

? run the leaps() function on the new data set and obtain a new `best' model as

before;

? repeat the previous steps 1000 times and compute the proportion of times that

the initial `best' model was provided by the leaps() function. Comment your

result.

7

版权所有：编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。