DRAFT

Statistical Learning Assignment - Semester 1, 2021

? INSTRUCTIONS:

1. The assignment must be typed (not handwritten). You may either use Microsoft Word (or similar)

or R markdown in RStudio for the assignment. Note that the final project will require the use of

R markdown. When answering this question, it should be no longer than 10 A4 pages

[single sided] with a font size no smaller than 11 point.

2. The assignment due date is listed on the Wattle (Turn-it-in) site. Upload the assignment through

Wattle using Turn-it-in. You should submit your assignment in two different parts. If you are

using R markdown:

(a) A pdf file [or HTML file] of your assignment (this should include important R code to highlight

what you have done).

(b) A ‘.Rmd’ file [an R markdown file].

If you are using Microsoft Word (or similar):

(a) A Word file of your assignment (this should include important R code to highlight what you

have done).

(b) A ‘.R’ file of your R code.

3. In answering the questions, write your answers clearly and succinctly. Use appropriate graphs and

tables when you think they help to describe your point or thinking process. Do not just “print” a

set of results. Every result should be discussed and have a reason for being presented. No points

will be awarded unless you clearly discuss what you are doing.

4. No late assignments will be accepted.

5. You should not discuss the assignment (questions, solutions, code, etc.) with your

classmates or other individuals. You can discuss these with me or your tutor (Dr.

Ha Nguyen) during our consultation times. You must independently write your own

solutions. This includes all computer code, English, and mathematics. University

policies on academic integrity will be strictly enforced. See http://www.anu.edu.au/

students/program-administration/assessments-exams/academic-honesty-plagiarism for

more details.

6. Have fun with the exploration!

1. (100 points) We will explore some of the techniques we are considering by examining data on housing

prices. We will use the data from the prediction competition available on Kaggle https://www.kaggle.

com/c/house-prices-advanced-regression-techniques. For this question you will need to create

an account on Kaggle. Please let me know if you don’t want to use Kaggle based on privacy concerns.

(a) Create an account on Kaggle. What is your Kaggle username? Download the training and test

data.

(b) Consider a multiple regression model to examine the relationship between housing sale prices (Y )

in Ames, Iowa, USA from 2006 to 2010 and their covariate information (x). While, 79 covariates

are available, for this assignment we will only use a few covariates. Only consider the following

covariates: LotArea, OverallCond, GrLivArea, FullBath, TotRmsAbvGrd, PoolArea. As the real

test data does not contain the response Y (SalePrice), split the training data in half. The first

half will be the new training data and the second half will be your personal test data. For this

assignment set α = 0.05.

1

DRAFT

i. (20 points) Using all of the training data together (personal training and test data), conduct

an exploratory data analysis. In doing your analysis make sure to identify any unusual points

and discuss why they are unusual. For this assignment do not remove any unusual points, only

comment on them (if they exist). In addition to visualisations of the raw data, consider the

natural log transformation of the response. You may also consider any transformations of the

covariates. For the rest of the assignment, if you believe the transformations are appropriate

(provide justification - this can simply be a discussion), use those transformations.

ii. (6 points) Using just your personal training data and the covariate GrLivArea, based on

traditional regression approaches (possibly: t-tests, F-tests, etc.), determine if there exists

a non-linear (quadratic, cubic, etc.) between the covariate and the response. How flexible

should the model be? Make sure to fully outline any tests and conclusions.

iii. (6 points) Using your personal training and personal testing data, along with the notion of

squared error loss, determine if there exists a non-linear (quadratic, cubic, etc.) relationship

between the covariate and the response. How flexible should the model be?

iv. (6 points) Consider all the covariates which we are using in this assignment: LotArea, OverallCond,

GrLivArea, FullBath, TotRmsAbvGrd, PoolArea. Using just your personal training

data and traditional regression approaches, determine if any of the variables are statistically

significant. Are you able to reduce the model (i.e. not use all the covariates)? Here you do

not need to consider any non-linearities or interactions. Make sure to fully outline any tests

and conclusions.

v. (6 points) Based on the ordering of the covariates in your final model in the previous question,

using your personal training and personal testing data, along with the notion of squared error

loss, determine which covariates should be included in the model.

vi. (6 points) Consider all the covariates which we are using in this assignment: LotArea, OverallCond,

GrLivArea, FullBath, TotRmsAbvGrd, PoolArea. Using just your personal training

data and traditional regression approaches, determine if PoolArea has a statistically significant

interaction with any of the other covariates. You may have up to five interactions in

your model. Make sure to fully outline any tests and conclusions.

vii. (6 points) Based on the ordering of the covariates in your final model in the previous question,

using your personal training and personal testing data, along with the notion of squared error

loss, determine which interactions should be included in the model.

viii. (6 points) Consider all the covariates which we are using in this assignment: LotArea,

OverallCond, GrLivArea, FullBath, TotRmsAbvGrd, PoolArea. You may now consider any

modelling that you wish using your personal training data. You may also consider any type of

model selection approach (i.e. traditional or based on squared-error loss for the testing data).

Make sure to fully outline any tests and conclusions. Calculate the mean-squared error on

your personal testing data.

ix. (6 points) Using your final model from Question 1(b)viii and the Kaggle test data, submit a

prediction file to Kaggle. See Kaggle for details on what the file should look like. What was

your score and rank?

? Note: as discussed on the site (https://www.kaggle.com/c/titanic/details/evaluation),

“[t]he Kaggle leader-board has a public and private component. 50% of your predictions

for the test set have been randomly assigned to the public leader-board (the same 50%

for all users). Your score on this public portion is what will appear on the leader-board.

At the end of the contest, we will reveal your score on the private 50% of the data, which

will determine the final winner. This method prevents users from ‘over-fitting’ to the

leader-board.”

x. (6 points) Examining the leader board you can see that one individual has a perfect score

(when I last looked). Is this surprising? What explanation might there be for this?

2

DRAFT

xi. (6 points) This Kaggle competition is using Root Mean Squared Logarithmic Error instead

of Mean Squared Error. Provide a discussion about the difference between the two criteria.

xii. (20 points) Provide a full discussion of your final model from Question 1(b)viii. This may

include, but is not limited to, discussions of the coefficients, visualisations of the fitted model,

and model checking.

3

版权所有：编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。