Problem Set 1

Econometrics Review

EC 421: Introduction to Econometrics

Due before noon (11:59am) on July 1st, 2020 (on canvas)

To make grading slightly easier, please include all of your R code at the end of your word doc with your

written answers.

OBJECTIVE This problem set has three purposes: (1) reinforce the econometrics topics we reviewed in

class; (2) build your R toolset; (3) build your intuition for newer topics like heteroskedasticity and

consistency.

Problem 0: Inference

For this question I used data from a survey conducted by the department of education in 1980. A

description of the data can be found here. We want to test the effect of an extra year of education on

wages. For this question, we have observations and parameters. This means that we have

degrees of freedom. For a signicance

level, this gives us a t-critical value of .

You can use this information throughout the rest of the problem.

We are interested in the following regression:

where is the individual hourly wage, and is the individual years of education. This regression

yields the following parameter estimates:

where standard errors for each parameter are given below in parenthesis.

0a. Conduct the appropriate statistical test to determine whether or not education has a statistically

signicant

impact on wages. Write out all steps and be clear with your conclusion.

0b. Now write out the formula for the standard error of . Is the standard error increasing or decreasing in

the sample size ? Next write out the formula for the test statistic you calculated in 0a.. Is the test statistic

increasing or decreasing in the sample size? Lastly, use your answers to this question to determine whether

or not probability that you reject the null hypothesis is increasing or decreasing in the size. Hint: You don't

have to write out this probability explicitly. Just explain the intuition behind what the test-statistic is telling

you and how this helps you answer the question.

Oc. Use the information provided combined with the regression output to construct a 95% condence

interval for the parameter . Write out the steps you took to get to the lower and upper bounds. Provide a

careful interpretation of what this condence

interval tells you.

0d. Now suppose we think we omitted an important variable: gender. State the two conditions this variable

must meet (in the context of this example) for it to cause omitted variables bias. Would increasing the

sample size (working with "big data") alleviate the issues caused by omitting gender from this regression?

0e. Luckily, our data contains information on whether or not individuals in our data are male or female. We

now include two indicators in our regression. One for male, and one for female -- and drop the intercept.

We have the following coefcient

estimates and standard errors.

You don't need to calculate the next test (I have not given you enough information to do so), but write out

how you would use this model to statistically test the null hypothesis that wages for males and females are

different from each other. Write out each step.

n = 4739 k = 2

4737 α = 5 t0.025,4737 = 1.96

wagei = β0 + β1educi + ?i

Problem 1: Bias and variance

1a. Throughout this course, we will use the OLS estimator to estimate . Explain what it means for to be

biased for .

Figure 1

Note This gure

shows the distributions of three estimators (A, B, and C) that each estimate the unknown

parameter . E[A]= , E[B]= , E[C]=

1b. Which of the estimators in Figure 1 (above) are unbiased? Hint: There may be more than one.

1c. Which of the estimators in Figure 1 (above) has the minimum variance?

1d. Which of the estimators in Figure 1 (above) is the best (minimum variance) unbiased estimator?

1e. Suppose we want to estimate the effect of advertising on sales. Explain what it bias would mean in this

context.

1f. What does the term "standard error" mean?

1g. What does it mean for an estimator to be more efcient

than another estimator? Of the unbiased

estimators, which one is efcient?

Problem 2: Getting Started with R

Problems 2 - 6 will use data I downloaded from the 2018 American Community Survey, which I downloaded

from IPUMS. You can nd

this data on canvas.

2a. Load packages. You will probably want to load the tidyverse and here packages. Maybe some others

as well.

2b. Load the data. The data can be found on canvas. To accomplish this, use the read_csv() command.

2c. Check your dataset. How many observations and variables do you have? Hint: Try dim(), ncol() and

nrow()

Problem 3: Getting you know your data

3a. Plot a histogram of household income (hh_income) using ggplot2.

Remember: the hh_income variable is measured in tens of thousands (meaning a value of 3

means the household's income is $30,000)

This link provides a few good examples of how to create a histogram using ggplot2.

3b. What are the mean and median levels of household income? Based upon this answer and the previous

histogram, is household income (fairly) evenly distributed or is it skewed? Explain.

3c. Run a regression summarizing the relationship between household income and household size.

Interpret the results of the regression -- e.g. tell me what the coefcients

mean and comment on their

statistical signicance.

3d. Explain why you chose the specication

that you did in the previous question.

Was it linear, log-linear, log-log?

What was the outcome variable?

What was the explanatory variable?

Why did you make these choices?

Problem 4: Regression Refesher

4a. Regress average commute time time_commuting on household income (hh_income). Interpret the

coefcient

and comment on its statistical signicance.

4b. Regress the log of aeverage commute time on household income. Interpret the coefcient

and

comment on its statistical signicance.

4c. Regress the log of aeverage commute time on the log household income. Interpret the coefcient

and

comment on its statistical signicance.

'

4d. If you had to pick one of the above specications

to show your boss at work, which one would you pick?

Why? (There is no right answer to this question, just want you to start thinking about model specication.)

4 / 8

Problem 5: Multiple Linear Regression

We will now add some covariates to our regression model.

5a. Regress average commute time on household income and the share of individuals in the household

who are non-white ehtnicities (hh_share_nonwhite). Interpret the coefcients

and comment on their

statistical signicance.

Also compare your results to 4a. Has anything changed?

5b. Regress average commute time on the indicator variable for whether a household moved in the last

year (i_moved). Interpret the coefcients

and comment on their statistical signcance.

5c. Add the share of the household that represents a non-white ethnicity (hh_share_nonwhite) to the

regression from 5b. Note: Your outcome variable is still average household commute time, but you should

now have two explanatory variables. Interpret the coefcients

and comment on their statistical signicance.

5d. Did adding this second explanatory variable change the coefcient

of the rst

variable at all? What does

that tell you? Explain your answer.

5e. One variable that we potentially omitted from our regression is an indicator for whether or not the

individual lives in an urban or rural area. Does this variable (which we don't have) meet the criteria for an

omitted variable? Specically

state both conditions it needs to meet for us to have classic omitted variables

bias. Sign the bias on hh_income that results from omitting urban/rural status.

5 / 8

Problem 6: Heteroskedasticity

6a. Suppose we are interested in the relationship between a household's housing costs and its time spent

community. Plot a scatter plot using ggplot2 with housing cost (cost_housing) on the axis and commute

time (time_commuting) on the axis. Make sure to label your axis.

This Link provides an example if you need help.

6b. Based on your plot 5a, if we regress cost_housing on time_commuting, do you think we would have an

issue with heteroskedasticity? Explain your answer.

6c. What issues can heteroskedasticity cause (Hint: there are at least two main issues)

6d. Time for a regression. Regress cost_housing on time_commuting and hh_income. Report your results --

interpret the coefcients

and comment on their statistical signicance.

Be careful with your language here.

Remember: the hh_income variable is measured in tens of thousands (meaning a value of 3

means the household's income is $30,000)

6e. Use the residuals from your regression in 5d to conduct a Breusch-Pagan Test for heteroscedasticity. Do

you nd

signicant

evidence of heteroskedasticity? Justify your answer. Note: I will post an additional video

that will help you write the code for this question. There is also sample code in the slides.

6f. Now conduct a Goldfeld-Quandt test for heteroskedasticity. Do you nd

signcant

evidence of

heteroskedasticity? Here are some hints:

We are still interested in the same regression (regressing the cost of housing on commute time

and household income)

Sort the dataset on time_commuting. This can be done with the arrange() function.

Create two groups for the GQ test by using the rst

8,000 and the last 8,000 observations (after

sorting on commute time). The head() and tail() functions will help here.

When you construct the GQ stat, put the larger SSE value in the numerator.

6g. Use the lm_robust() command from the estimatr package to calculate heteroskedastic-robust standard

errors. How do these standard errors compare to the plain OLS standard errors you previously found?

Hint: lm_robust(y ~ x, data = some_df, se_type = "HC2") will calculate heteroskedasticrobust

standard errors.

6h. Why did your coefcients

remain the same in 5g -- even though your standard errors changed?

Problem 7: Unbiasedness and consistency

Throughout this course, we will use the OLS estimator to estimate . We will continue to discuss

situations in which the estimator (or other estimators) are (1) unbiased or (2) consistent.

7a. What is the formal (mathematical) denition

of bias?

7b. Why do we care if if the OLS estimator (or any estimator) is biased?

7c. What does it mean for an estimator to be consistent?

7d. True/False Unbiasedness is a property for nite-sized

samples, while consistency refers to an esimator

as sample sizes approach innity.

7e. Which of the following two estimators would you choose? Explain your reasoning.

Estimator A is unbiased and inconsistent.

Estimator B is biased and consistent.

^β β

7 / 8

Description of variables and names

Variable Description

ps

County FIPS code

hh_size Household size (number of people)

hh_income Household total income in $10,000

cost_housing Household's total reported cost of housing

n_vehicles Household's number of vehicles

hh_share_nonwhite Share of household members identifying as non-white ethnicites

i_renter Binary indicator for whether any household members are renters

i_moved Binary indicator for whether a household member moved in prior one year

i_foodstamp Binary indicator for whether any household member participates in foodstamps

i_smartphone Binary indicator for whether a household member owns a smartphone

i_internet Binary indicator for whether the household has access to the internet

time_commuting Average time spent commuting per day by household member

In general, I've tried to stick with a naming convention. Variables that begin with i_ denote binary indicatory

variables (taking on the value of 0 or 1). Variables that begin with n_ are numeric variables.

8 / 8

版权所有：编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。