联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2019-10-21 10:41

Big Data Methods

PC session 4

Part I: Regression trees

For this part of the exercise, use the data set “Carseats” from the package “ISLR”. A simulated data

set containing sales of child car seats at 400 different stores.

A data frame with 400 observations on the following 11 variables.

Sales: Unit sales (in thousands) at each location

CompPrice: Price charged by competitor at each location

Income: Community income level (in thousands of dollars)

Advertising: Local advertising budget for company at each location (in thousands of

dollars)

Population: Population size in region (in thousands)

Price: Price company charges for car seats at each site

ShelveLoc: A factor with levels Bad, Good and Medium indicating the quality of the

shelving location for the car seats at each site

Age: Average age of the local population

Education: Education level at each location

Urban: A factor with levels No and Yes to indicate whether the store is in an urban or

rural location

US: A factor with levels No and Yes to indicate whether the store is in the US or

not

Decision trees for prediction

1. Provide a histogram and summary statistics for the outcome variable “Sales”

2. Define a training sample that contains 75% of the total sample. How many observations are

in the training sample?

3. Run a regression tree in the training data to predict sales using the default values of the

“tree” command. Create a plot of the tree indicating the number and percentage of

observations in each leaf. Interpret the results.

4. Evaluate the performance of the tree in the test data based on the MSE.

5. Use k-fold cross-validation (k=10) to determine the optimal number of splits and minimize

the MSE by setting FUN=prune.tree. Report the number of terminal nodes with the lowest

cross-validation criterion.

2019 Selina Gangl

2

6. Prune the tree in the training data to the optimal number of terminal nodes, plot the trained

tree, and evaluate the performance of the tree in the test data based on the MSE.

7. Generate a dummy for Sales>8 which is defined as factor (or qualitative) variable and is to be

used as outcome variable in a classification tree. Merge the outcome with the rest of the dat

a.

8. Run a classification tree in the training data using the newly created factor variable as outco

me and all remaining variables but Sales as predictors. Use k-fold cross-validation (k=10) to d

etermine the optimal number of splits and minimize the classification error by setting FUN=p

rune.misclass

9. Prune the tree in the training data to the optimal number of terminal nodes. Plot the trained

tree and evaluate the performance of the tree in the test data based on the classification

error rate.

Part II: Random forests for prediction

10. Apply bagging with regression trees using 500 trees in the training data. Set mtry to the total

number of predictors for bagging. Evaluate the performance of bagging in the test data

based on the MSE.

11. Run a random forest with regression trees using 500 trees in the training data. Do not specify

mtry in this exercise. Report the most important predictors. Evaluate the performance of the

random forest in the test data based on the MSE. Use cross-validation to find the optimal

number of predictors per split.

12. Re-run the analysis with 5 predictors and compute the MSE.

13. Use k-fold cross-validation to assess model accuracy in the test data based on random forests

with regression trees. The number of folds (k) should be 5. Report the average of the MSEs

across folds to assess the overall performance.

14. Run a random forest with classification trees (considering the binary outcome) using 500

trees in the training data. Evaluate the performance of the random forest in the test data

based on the classification error rate.

2019 Selina Gangl

3

Part III: Random forests for causal analysis

15. Use the dataset “HMDA” (in the AER package). Find a description of the dataset below:

Outcome variable (y) Denial of mortgage (deny)

Regressor whose causal effect is of interest

(d)

Payments to income ratio (pirat)

Control variables (x) Housing expense to income ratio (hirat)

Loan to value ratio (lvrat)

Credit history: consumer payments (chist)

Credit history: mortgage payment (mhist)

Public bad credit record? (phist)

1989 Massachusetts unemployment rate in

applicant's industry (unemp)

Is the individual self-employed? (selfemp)

Was the individual denied mortgage

insurance? (insurance)

Is the unit a condominium? (condomin)

Is the individual African-American? (afam)

Is the individual single? (single)

Does the individual have a high-school

diploma? (hschool)

Edit the data for usage of the causal_forest command:

(i) Generate numerical zero/one values (rather than factors) for binary variables.

(ii) Assign variable names easy to work with (y = outcome, d = payments to income ratio, x1 =

first covariate etc.).

16. Train a model for estimating the causal effect of the payments to income ratio (= non-binary

treatment) using the command causal_forest.

17. Predict the conditional average causal effects of the payments to income ratio for each

observation in the test data.

18. Visualize the distribution of the effects by a histogram.

19. Compute the t-statistics and p-values of the conditional effects for each prediction.

20. Plot the conditional effects for different values of the payments to income ratio.

21. Provide the correlation of the conditional effects and the payments to income ratio.

22. Provide the average marginal effect and calculate the p-value.

23. Provide the average marginal effect among high school graduates and calculate the p-value.

24. Predict the conditional effect at average values of the control variables and calculate the pvalue.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp