联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-09-13 11:12

Assignment 1

MAST90083 Computational Statistics and Data Mining

Due time: 5PM, Monday September 16th

You must submit your report via LMS

1 Data Analysis

Gross domestic product is a standard measure of the size of an economy; it’s the total value

of all goods and services bought and solid in a country over the course of a year. It’s not a

perfect measure of prosperity, but it is a very common one, and many important questions

in economics turn on what leads GDP to grow faster or slower. One common idea is that

poorer economies, those with lower initial GDPs, should grower faster than richer ones.

The reasoning behind this catching up is that poor economies can copy technologies and

procedures from richer ones, but already-developed countries can only grow as technology

advances. A second, separate idea is that countries can boost their growth rate by undervaluing

their currency, making the goods and services they export cheaper. Our dataset

“uval.csv” contains the following variables:

• Country, in a three-letter code.

• Year (in five-year increments).

• Per-capita GDP, in dollars per person per year

• Average percentage growth rate in GDP over the next five years.

• An index of currency under-valuation. The index is 0 if the currency is neither overnor

under-valued, positive if under-valued, negative if it is over-valued.

Note that not all countries have data for all years. However, there are no missing values in

the data table.

1. Linearly regress the growth rate on the under-valuation index and the log of GDP.

Report the coefficients and their standard errors. Do the coefficients support the

idea of catching up? Do they support the idea that under-valuing a currency boosts

economic growth?

1

2. Repeat the linear regression but add as covariates the country, and the year. Use

factor(year), not year, in the regression formula.

(a) Report the coefficients for log GDP and undervaluation, and their standard errors.

(b) Explain why it is more appropriate to use factor(year) in the formula than just

year.

(c) Plot the coefficients on year versus time.

(d) Does this expanded model support the idea of catching up? Of undervaluation

boosting growth?

3. Does adding in year and country as covariates improve the predictive ability of a linear

model which includes log GDP and under-valuation?

(a) What are the R2 and the adjusted R2 of the two models?

(b) Use leave-one-out cross-validation to find the mean squared errors of the two

models. Which one actually predicts better, and by how much?

(c) Explain why using 5-fold cross-validation would be hard here.

4. Kernel regression Use kernel regression, as implemented in the np package, to nonparametrically

regress growth on log GDP, under-valuation, country, and year (treating

year as a categorical variable). Hint: read chapter four of Shalizi carefully. In particular,

try setting tol to about 10−3 and ftol to about 10−4

in the npreg command,

and allow several minutes for it to run.

(a) Give the coefficients of the kernel regression, or explain why you cannot.

(b) Plot the predicted values of the kernel regression, for each country and year,

against the predicted values of the linear model.

(c) Plot the residuals of the kernel regression against its predicted values. Should

these points be scattered around a flat line, if the model is right? Are they?

(d) The npreg function reports a cross-validated estimate of the mean squared error

for the model it fits. What is that? Does the kernel regression predict better or

worse than the linear model with the same variables?

2 Kernel regression and varying smoothness

Starter code for this problem is in starter.R. That code will generate a data set to be used

for this problem, and will also provide a true mean function µ(x). The resulting data frame

has a x column (your predictor) and a y column (your response).

1. Plot y versus x. Overlay the true mean function µ(x) using the curve function in R.

What do you notice for x < 4π and x > 4π?

2

2. Using the np library in R, fit a kernel regression on each of the following datasets:

(a) Only those data points with x < 4π.

(b) Only those data points with x > 4π.

(c) All the data points

For each of these regressions, what is the optimal bandwidth? How does the optimal

bandwidth for the overall data set compare to the optimal bandwidth for each of the

halves?

3. For each of the three selected bandwidths, make a plot showing:

• The true mean µ(x).

• The data points.

• The kernel regression predictions, with the bandwidth specified to be the selected

bandwidth.

• The 95% confidence band for the regression curve µ using resampling of residuals.

• The 95% confidence band for the regression curve µ using resampling of cases.

The result should be three plots, each tuned to one of the selected bandwidths. Give

these plots clear titles to distinguish them.

4. How do these three plots differ? In particular, how well do the regressions trained on

the left and right halves do on each half of the data set? How well does the bandwidth

fit on the overall data set do on each half? (Be specific about the types of problems

that occur.) What lesson might this tell about functions of varying smoothness and

kernel regression, if any?

3 Theoretical questions

1. Exercise 1.2 in Shalizi

2. Exercise 1.4 in Shalizi

3. Exercise 7.4 in ESL

3


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp