联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2025-03-06 10:45

MS6711 Data Mining

Homework 2

Instruction

This homework contains both coding and non-coding questions. Please submit two files,

1. One word or pdf document of answers and plots of ALL questions without coding details.

2. One jupyter notebook of your codes.

3. Questions 1 and 2 are about concepts, 3 - 6 are about coding.

1

Problem 1 [20 points]

We perform best subset, forward stepwise and backward stepwise selection on the same dataset with p

predictors. For each approach, we obtain p + 1 models containing 0, 1, 2, · · · , p predictors. Explain your

answer.

1. Which of the three models with same number of k predictors has smallest training RSS?

2. Which of the three models with same number of k predictors has smallest testing RSS? (best

subset, forward, backward, or cannot determine?)

3. True or False: The predictors in the k-variable model identified by forward stepwise are a subset of

the predictors in the (k + 1)-variable model identified by forward stepwise selection.

4. True or False: The predictors in the k-variable model identified by best subset are a subset of the

predictors in the (k + 1)-variable model identified by best subset selection.

5. True or False: The lasso, relative to OLS, is less flexible and hence will give improved prediction

accuracy when its increase in bias is less than its decrease in variance.

2

Problem 2 [20 points]

Suppose we estimate Lasso by minimizing

||Y − Xβ||2

2 + λ||β||1

for a particular value of λ. For part 1 to 5, indicate which of (a) to (e) is correct and explain your answer.

1. As we increase λ from 0, the training RSS will

(a) Increase initially, and then eventually start decreasing in an inverted U shape.

(b) Decrease initially, and then eventually start increasing in a U shape.

(c) Steadily increase.

(d) Steadily decrease.

(e) Remain constant.

2. Repeat 1. for test RSS.

3. Repeat 1. for variance.

4. Repeat 1. for (squared) bias.

3

Problem 3 [20 points]

These data record the level of atmospheric ozone concentration from eight daily meteorological mea surements made in the Los Angeles basin in 1976. We have the 330 complete cases1. We want to find

climate/weather factors that impact ozone readings. Ozone is a hazardous byproduct of burning fossil

fuels and can harm lung function. The data set for this problem is:

Variable name Definition

ozone Long Maximum Ozone

vh Vandenberg 500 mb Height

wind Wind speed (mph)

humidity Humidity (%)

temp Sandburg AFB Temperature

ibh Inversion Base Height

dpg Daggot Pressure Gradint

ibt Inversion Base Temperature

vis Visibility (miles)

doy Day of the Year

[Note: I would recommend you use R for this question, since python does not have package for

forward / backward selection. See the code example on Canvas. Or you may use the sample python code

I provided.]

1. Report result of linear regression using all variables. Note that ozone is the response variable to

predict. What variables are significant?

2. Report the selected variables using the following model selection approaches.

(a) All subset selection.

(b) Forward stepwise

(c) Backward stepwise

3. Compare the outcome of these methods with the significant variables found in the full linear regres sion in question 1.

4. Potentially, other transformation of covariates might be important. What happens if you do all

subset selection using both the original variables and their square? That is, for all variables, include

4

both

X, X2

in the linear regression model for all subset selection.

5

Problem 4 [20 points]

In this exercise, we will predict the number of applications received using the other variables in the College

data set.

Private Public/private school indicator

Apps Number of applications received

Accept Number of applicants accepted

Enroll Number of new students enrolled

Top10perc New students from top 10% of high school class

Top25perc 1 = New students from top 25 % of high school class

F.Undergrad Number of full-time undergraduates

P.Undergrad Number of part-time undergraduates

Outstate Out-of-state tuition

Room.Board Room and board costs

Books Estimated book costs

Personal Estimated personal spending

PhD Percent of faculty with Ph.D.

Terminal Percent of faculty with terminal degree

S.F.Ratio Student faculty ratio

perc.alumni Percent of alumni who donate

Expend Instructional expenditure per student

Grad.Rate Graduation rate

1. Split the data set into a training set and a test set.

2. Fit a linear regression model using OLS on the training set, and report the test error obtained.

3. Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test

error obtained.

4. Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error

obtained, along with the number of non-zero coefficient estimates.

5. Fit a PCR model on the training set, with number of components chosen by cross-validation. Report

the test error obtained, along with the value of M selected by cross-validation.

6. Fit a PLS model on the training set, with number of components chosen by cross-validation. Report

the test error obtained, along with the value of number of components selected by cross-validation.

6

Problem 5 [20 points]

We will now try to predict per capita crime rate in the Boston data set.

crim per capita crime rate by town.

zn proportion of residential land zoned for lots over 25,000 sq.ft.

indus proportion of non-retail business acres per town.

chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox nitrogen oxides concentration (parts per 10 million).

rm 1 = average number of rooms per dwelling.

age proportion of owner-occupied units built prior to 1940.

dis weighted mean of distances to five Boston employment centres.

rad index of accessibility to radial highways.

tax full-value property-tax rate per $10,000.

ptratio pupil-teacher ratio by town.

black 1000(Bk − 0.63)2 where Bk is the proportion of blacks by town.

lstat lower status of the population (percent).

medv median value of owner-occupied homes in $1000s.

1. Try out some of the regression methods explored in this chapter, such as best subset selection, the

lasso, ridge regression, PCR and partial least squares. Present and discuss results for the approaches

that you consider.

2. Propose a model (or set of models) that seem to perform well on this data set, and justify your

answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.

3. Does your chosen model involve all of the features in the data set? Why or why not?

7

Problem 6 [20 points]

In a bike sharing system the process of obtaining membership, rental, and bike return is automated

via a network of kiosk locations throughout a city. In this problem, you will try to combine historical

usage patterns with weather data to forecast bike rental demand in the Capital Bikeshare program in

Washington, D.C.

You are provided hourly rental data collected from the Capital Bikeshare system spanning two years.

The file Bike train.csv, as the training set, contains data for the first 19 days of each month, while

Bike test.csv, as the test set, contains data from the 20th to the end of the month. The dataset includes

the following information:

daylabel day number ranging from 1 to 731

year, month, day, hour hourly date

season 1=spring,2=summer,3=fall,4=winter

holiday whether the day is considered a holiday

workingday whether the day is neither a weekend nor a holiday

weather 1 = clear, few clouds, partly cloudy

2 = mist + cloudy, mist + broken clouds, mist + few clouds, mist

3 = light snow, light rain + thunderstorm + scattered clouds, light rain

4 = 4 = heavy rain + ice pallets + thunderstorm + mist, snow + fog

temp temperature in Celsius

atemp ’feels like’ temperature in Celsius

humidity relative humidity

wind speed wind speed

count number of total rentals, outcome variable to predict

Predictions will be evaluated using the root mean squared error (RMSE), calculated as

RMSE =

v

u

u t

n

1

nX

i=1

(yi − ybi)

2

where yi

is the true count, ybi

is the prediction, and n is the number of entries to be evaluated.

Build a model on train dataset to predict the bikeshare counts for the hours recorded in the test

dataset. Report your prediction RMSE on testing set.

Some tips

• This is a relatively open question, you may use any model you learnt from this class.

8

• It will be helpful to examine the data graphically to spot any seasonal pattern or temporal trend.

• There is one day in the training data with weird atemp record and another day with abnormal

humidity. Find those rows and think about what you want to do with them. Is there anything

unusual in the test data?

• It might be helpful to transform the count to log(count + 1). If you did that, do not forget to

transform your predicted values back to count.

• Think about how you would include each predictor into the model, as continuous or as categorical?

• Is there any transformation of the predictors or interactions between them that you think might be

helpful?

Try to summarize your exploration of the data, and modeling process. You may fit a few models and

chose one from them. You will receive points based on your write-up and test RMSE. This is not a

competition among the class to achieve the minimal RMSE, but your result should be in a reasonable

range.

9


相关文章

【上一篇】:到头了
【下一篇】:没有了

版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp