日期:2019-06-30 10:03

Take Home Exam

Using R for Economics and Statistics

Due: Monday, July 8, 11:59 PM

Your final work should be in .Rmd form.

Please note the due date.

Exercise 1 (Computation Time)

[12 points] For this exercise we will create data via simulation, then assess how well certain methods perform.

Use the code below to create a train and test dataset.



sim_trn = mlbench.spirals(n = 2500, cycles = 1.5, sd = 0.125)

sim_trn = data.frame(sim_trn$x, class = as.factor(sim_trn$classes))

sim_tst = mlbench.spirals(n = 10000, cycles = 1.5, sd = 0.125)

sim_tst = data.frame(sim_tst$x, class = as.factor(sim_tst$classes))

The training data is plotted below, with colors indicating the class variable, which is the response.

Before proceeding further, set a seed equal to your UIN.

uin = 123456789


We’ll use the following to define 5-fold cross-validation for use with train() from caret.


cv_5 = trainControl(method = "cv", number = 5)

We now tune two models with train(). First, a logistic regression using glm. (This actually isn’t “tuned” as

there are not parameters to be tuned, but we use train() to perform cross-validation.) Second we tune a

single decision tree using rpart.

We store the results in sim_glm_cv and sim_tree_cv respectively, but we also wrap both function calls with

system.time() in order to record how long the tuning process takes for each method.

glm_cv_time = system.time({sim_glm_cv = train(class ~ .,data = sim_trn,

trControl = cv_5,method = "glm")})

tree_cv_time = system.time({2

sim_tree_cv = train(class ~ .,data = sim_trn,trControl = cv_5,method = "rpart")})

We see that both methods are tuned via cross-validation in a similar amount of time.


## elapsed

## 0.98


## elapsed

## 1.25



Repeat the above analysis using a random forest, twice. The first time use 5-fold cross-validation. (This is

how we had been using random forests before we understood random forests.) The second time, tune the

model using OOB samples. We only have two predictors here, so, for both, use the following tuning grid.

rf_grid = expand.grid(mtry = c(1, 2))

Create a table summarizing the results of these four models. (Logistic with CV, Tree with CV, RF with


OOB, RF with CV). Report:

Chosen value of tuning parameter (If applicable)

Elapsed tuning time

Resampled (CV or OOB) Accuracy

Test Accuracy

Exercise 2 (Predicting Baseball Salaries)

[12 points] For this question we will predict the Salary of Hitters. (Hitters is also the name of the

dataset.) We first remove the missing data:


## Warning: package 'ISLR' was built under R version 3.4.4

Hitters = na.omit(Hitters)

After changing uin to your UIN, use the following code to test-train split the data.

uin = 123456789


hit_idx = createDataPartition(Hitters$Salary, p = 0.6, list = FALSE)

hit_trn = Hitters[hit_idx,]

hit_tst = Hitters[-hit_idx,]

Do the following:

Tune a boosted tree model using the following tuning grid and 5-fold cross-validation.

gbm_grid = expand.grid(interaction.depth = c(1, 2),

n.trees = c(500, 1000, 1500),

shrinkage = c(0.001, 0.01, 0.1),

n.minobsinnode = 10)

Tune a random forest using OOB resampling and all possible values of mtry.

Create a table summarizing the results of three models:

Tuned boosted tree model

Tuned random forest model

Bagged tree model

For each, report:

Resampled RMSE


Exercise 3 (Transforming the Response)

[5 points] Continue with the data from Exercise 2. People always suggest log transforming the response,

Salary, before fitting a random forest. Is this necessary? Re-tune a random forest as you did in Exercise 2,

except with a log transformed response. Report test RMSE for both the untransformed and transformed

model on the original scale of the response variable.



Percent of Total

0 500 1000 1500 2000 2500

Exercise 4 (Concept Checks)

[1 point each] Answer the following questions based on your results from the three exercises.


(a) Compare the time taken to tune each model. Is the difference between the OOB and CV result for the

random forest similar to what you would have expected?

(b) Compare the tuned value of mtry for each of the random forests tuned. Do they choose the same model?

(c) Compare the test accuracy of each of the four procedures considered. Briefly explain these results.


(d) Report the tuned value of mtry for the random forest.

(e) Create a plot that shows the tuning results for the tuning of the boosted tree model.

(f) Create a plot of the variable importance for the tuned random forest.

(g) Create a plot of the variable importance for the tuned boosted tree model.

(h) According to the random forest, what are the three most important predictors?

(i) According to the boosted model, what are the three most important predictors?



(j) Based on these results, do you think the transformation was necessary?

Exercise 5 (Neutral Network)

[11 point each]

Neural networks have always been one of the most fascinating machine learning model in my opinion, not

only because of the fancy backpropagation algorithm, but also because of their complexity (think of deep

learning with many hidden layers) and structure inspired by the brain. In this exercise you are required to fit

a simple neural network using the neuralnet package and fit a linear model as a comparison.

We are going to use the Boston dataset in the MASS package.The Boston dataset is a collection of data about

housing values in the suburbs of Boston. Our goal is to predict the median value of owner-occupied homes

(medv) using all the other continuous variables available.



data <- Boston

First we need to check that no datapoint is missing, otherwise we need to fix the dataset.

apply(data,2,function(x) sum(is.na(x)))

## crim zn indus chas nox rm age dis rad

## 0 0 0 0 0 0 0 0 0

## tax ptratio black lstat medv

## 0 0 0 0 0

There is no missing data, good. We first randomly splitting the data into a train (75% percent of the total

sample) and a test set:

index <- sample(1:nrow(data),round(0.75*nrow(data)))

train <- data[index,]

test <- data[-index,]

Then do the following:

Fit a linear regression model and test it on the test set (That is to predict test set) and compute RMSE

(Root Mean Squared Error). Note that you’d better use the glm() function instead of the lm() this

will become useful later when cross validating the linear model.

Before fitting a neural network, some preparation need to be done. Neural networks are not that easy to

train and tune.As a first step, we are going to address data preprocessing. It is good practice to normalize

your data before training a neural network. We chose to use the min-max method and scale the data in the

interval [0, 1]. Usually scaling in the intervals [0, 1] or [?1, 1] tends to give better results. We therefore scale

and split the data before moving on:

maxs <- apply(data, 2, max)

mins <- apply(data, 2, min)

scaled <- as.data.frame(scale(data, center = mins, scale = maxs - mins))

train_ <- scaled[index,]

test_ <- scaled[-index,]


There is no fixed rule as to how many layers and neurons to use although there are several more or less

accepted rules of thumb. Usually, if at all necessary, one hidden layer is enough for a vast numbers of

applications. As far as the number of neurons is concerned, it should be between the input layer size and

the output layer size, usually 2


of the input size. Since this is a toy example, we are going to use 2 hidden

layers with this configuration: 13 : 5 : 3 : 1. The input layer has 13 inputs, the two hidden layers have 5 and 3

neurons and the output layer has, of course, a single output since we are doing regression.

Fit the data by package neuralnet then plot the results

Predicting medv using the neural networkand compare the RMSE with the results from linear regression



