ECO220Y5Y: Quantitative Methods in Economics
Final Assignment
Replacement for Exam Assessment on Regression
1 Interactive Regression Exercise
1.1 Motivation
Econometrics is best understood by doing rather than by reading about what someone else has
done. There are difficult choices and many pitfalls in arriving at the ‘correct’ model. Sometimes
the existing theory underlying the relationships in your model seem a bit off and you could build a
much ‘better’ model by including a different set of variables or transforming them (this gets at the
internal validity of the model). Sometimes choosing a model with the best fit means you are making
decisions that are ideal only for your sample and would not apply well to data outside your sample
time period or group of individuals (this gets at the external validity of the model). You need to
trade off the internal validity with the external validity as a researcher. As a result, econometrics
can sometimes feel more like an art than a science. However, you will be asked to follow a scientific
approach to making these model decisions and justifying these decisions in a scientific way. This
assignment requires that you make independent choices on specification, analyse the consequences
of these choices and adjust your choices to narrow in on a final model. You will be asked to justify
the model you have selected and then provide some feedback on its economic implications.
1.2 Overview & Data
The dependent variable for this interactive assignment is the Provincial Achievement Test (PAT)
score earned by students in an Alberta high school. There are 70 observations for this data set
measuring PAT scores and a number of possible causal factors have been randomly drawn out of a
pool of approximately 750 students over approximately one decade. The literature on PAT scores
indicates that scores are determined not only by ability and training but also various socio-economic
factors. Please see the attached article by James Fallows, ‘The Tests and the Brightest: How Fair Are
the College Boards.’ for a summary of views in the literature on how SAT performance in the USA
might be impacted by various socio-economic factors (PAT scores and SAT scores should be similarly
determined). Measures of ability and training included here are the cumulative high school grade
point average (GPA) and participation in advanced placement math and English courses (APMATH
and APENG). Advanced placement courses may help students perform better on the PAT. This
data set also includes a number of dummy variables measuring qualitative socio-economic factors
such as a student’s gender (MALE), ethnicity (WHITE), and native language (ENG). The data set
also includes a dummy variable indicating whether or not a student has attended a PAT preparation
class (PREP). The data set includes a variable indicating what year (YEAR) the students PAT score
and other information was recorded. Finally there are several variables created as the product of two
other variables.
Here is a detailed description of all variables in this assignment:
• P ATi = the Provincial Achievement Test score of the i
th student on a scale from 0 to 100
• GP Ai = the grade point average of the i
th student on a scale from 0 to 5
• APMAT Hi = a dummy variable equal to 1 if the i
th student has taken AP Math, 0 otherwise
• AP ENGi = a dummy variable equal to 1 if the i
th student has taken AP English, 0 otherwise
• APi = a dummy variable equal to 1 if the i
th student has taken either AP Math and/or AP
English, 0 otherwise
• MALEi = a dummy variable equal to 1 if the i
th student is Male, 0 if Female
• W HIT Ei = a dummy variable equal to 1 if the i
th student is Caucasian, 0 otherwise
• ENGi = a dummy variable equal to 1 if the i
th student’s first language is English, 0 otherwise
• P REPi = a dummy variable equal to 1 if the i
th student has attended a PAT preparation
course, 0 otherwise
• Y EARi = the year the Provincial Achievement Test was taken for the i
th student recorded
from 2007 to 2018
• GP AMALEi = (GP Ai)(MALEi)
• GP AW HIT Ei = (GP Ai)(W HIT Ei)
• GP AENGi = (GP Ai)(ENGi)
• W HIT EMALEi = (W HIT Ei)(MALEi)
1.3 Summary Statistics
Included below are the Means, Standard Deviations, and Correlation Coefficients for the variables
in this assignment
Means and Standard Deviations:
Correlation Coefficients:
2 Section A: Building a Model of PAT Scores
2.1 Choosing the best specification
In this section you will choose the specification you’d like to estimate from the list below, find the
regression number of that specification and then look at the regression results for your chosen specification
in the appendix at the end. You can base your initial decision on the literature provided
regarding potential discrimination in standardised testing design and also the summary statistics and
correlation coefficients for the variables. You should then decide if you are satisfied with your model
selection based on the results. If you are not satisfied you can use the information from the regression
you ran to decide how to adjust the specification. You can now repeat the process until you decide
on a final selection of the ‘best’ specification. Once you decide on your preferred specification you
will answer the questions found below the regression model options.
Regression Models:
1. Model 1: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + i
2. Model 2: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4ENGi + i
3. Model 3: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4MALEi + i
4. Model 4: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4P REPi + i
5. Model 5: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4W HIT Ei + i
6. Model 6: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4ENGi + β5MALEi + i
7. Model 7: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4ENGi + β5P REPi + i
8. Model 8: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4ENGi + β5W HIT Ei + i
9. Model 9: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4MALEi + β5P REPi + i
10. Model 10: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4MALEi + β5W HIT Ei + i
11. Model 11: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4P REPi + β5W HIT Ei + i
12. Model 12: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4ENGi + β5MALEi + β6P REPi + i
13. Model 13: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4ENGi + β5MALEi + β6W HIT Ei + i
14. Model 14: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4ENGi + β5P REPi + β6W HIT Ei + i
15. Model 15: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4MALEi + β5P REPi + β6W HIT Ei + i
16. Model 16: P ATi = β0 + β1GP Ai + β2APMAT Hi + β3AP ENGi + β4ENGi + β5MALEi + β6P REPi
+ β7W HIT Ei + i
17. Model 17: P ATi = β0 + β1GP Ai + β2APi + β3P REPi + i
18. Model 18: P ATi = β0 + β1GP Ai + β2APi + β3P REPi + β4W HIT Ei + i
19. Model 19: P ATi = β0 + β1GP Ai + β2APi + β3ENGi + β4P REPi + β5W HIT Ei + i
20. Model 20: P ATi = β0 + β1GP Ai + β2APi + i
Section A Questions:
1. Write out the estimated model for your preferred specification including coefficients and standard
errors.
2. Evaluate your estimation results with respect to its economic meaning, overall model fit, and the signs
and significance of the individual coefficients.
3. What specification problems (omitted variables, irrelevant variables, multicollinearity) might your
regression have? Why?
4. Do you have any possible suggestions to improve the model that you were not able to choose based
on the models provided?
3 Section B: Correcting a Model of PAT Scores
3.1 Understanding and correcting issues
In this section you will assess the model you selected in the last section for heteroskedasticity and
serial correlation and determine the desired approach to interpret and correct for these issues. Based
on your chosen model in Section A with its residual plot given in the last section appendix as well
as the scatter plots in the Section B appendix answer the following questions below. Provide a few
sentences to justify your answers.
Section B Questions:
1. Do you believe there might be a problem of heteroskedasticity in your chosen model? Do you believe
it is pure or impure?
2. Do you believe there might be a problem of serial correlation in your chosen model? Do you believe it
is pure or impure?
3. Based on the answers you gave to the two questions above, what would you suggest you do to improve
the estimated model and why?
4 Section C: Interpreting a Model of PAT Scores
4.1 Deciding what you can learn from the model
In this section you will assume that a professional econometrician ran 2 models (model A & B) and
determined the best specification is model B based on underlying theory. It is not your job in this
case to question the model but rather to interpret the results. Based on the regression results for
model B answer all of the following questions below by providing your rough work in calculations
and at least a few sentences to support your argument. Note that LNPAT is the natural log of PAT
scores. Both models are given in the appendix under Section C.
Section C Questions:
1. Calculate the 98% two-sided confidence interval for the coefficient on MALE. Interpret this coefficient
and what the confidence interval you calculated implies for your interpretation.
2. Test whether the absolute value of the coefficient on GPAWHITE is greater than the absolute value
of the coefficient on GPAENG. Explain the meaning of this test result in terms of PAT scores.
3. Draw and indicate the slope and intercept of the estimated models (lines of best fit) relating GPA to
the natural log of PAT scores for white males vs. non-white females conditional on them having taken
Advanced Placement classes and speaking English as their first language. Interpret the two estimated
lines in words.
4. Solve for the impact on PAT scores of a student having a GPA of 2 rather than a GPA of 0, given
they did not take AP courses, are non-white, male and do not speak English as a first language. Show
all your work in this calculation.
5. Based on inference using the results from Model B but also taking into account both models, do you
believe there is potential evidence of discrimination/bias in the way PAT’s are designed or administered?
5 Section D: Working on a Model of PAT Scores in Stata
5.1 Show you can generate your own results using code
In this section you will indicate the code you would plan to use in Stata to achieve some basic tasks.
This will draw on the sort of knowledge contained in labs, lectures, the data project and the help
session you have received with Stata code that you can refer back to. For each question below you
should provide some basic Stata code that could be run and would achieve the results requested.
There is often multiple correct ways to approach the coding, some more efficient than others, but
the only consideration will be if the actual desired outcome is achieved. Note, you do not need to
actually run the code on a data set just indicate what you believe to be a correct approach but you
can assume you already have the variables indicated in this assignment loaded and ready in your
Stata program.
Section D Questions:
1. Transform the GPA variable into a new variable measuring the natural log of GPA called LNGPA
2. Run a regression of LNPAT on LNGPA
3. Scatter LNPAT against LNGPA and display the line of best fit (linear regression line) for the model
you just estimated
4. Calculate the residuals and create a new variable for them called RES
5. Calculate the fitted values and create a new variable for them called YHAT
6. Scatter the residuals (RES) against the fitted values (YHAT) to check for any issues
7. Run a new regression of LNPAT on LNGPA , AP, MALE, ENG
8. At the 1% level of sig, test whether the true coefficient on MALE could be equal to ENG
9. Test for specification error in the regression you ran
10. Test for heteroskedasticity in the regression you ran
6 Appendix:
6.1 Section A Estimated Models
Regression Model 1:
6.3 Section C Additional Model Estimations
Regression Model A:
Regression Model B:
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。