Final Project
STA302H1F: LEC5101/STA1001HF: LEC0201
Due on 25th June, 2020 11:59 PM Sharp in Quercus
All relevant work must be shown for credit.
Final Project: The final project is due on June 25, 2020 by 11:59PM EST and consists of
a data analysis on a novel dataset. The deadline will be strictly applied. At no circumstances
students can submit late. Please make sure that you start the submission process early so that
your project is graded.
Students will be required to demonstrate their understanding of the methods based on course
materials by developing a reasonable regression model using the techniques taught in class. The
students will be responsible for choosing the correct methods to apply and providing appropriate
justifications defending their choices.
The final project will be done individually, and must be typed and submitted by the stated
deadline. The project needs to fulfill the following criteria:
• Font: 12-point font in a style similar to Times New Roman
• Spacing: single-spaced
• The word limit for the final project is 1500. This excludes the title page, table/figure captions
and appendix.
• Maximum 5 tables/figure will be allowed in the project report. The tables and figures should
be relevent, should convey the purpose of the project. All tables and figures should have
captions. you may use any combination of tables and figures
• Up to 3 additional tables/figures but they should only be included if they are relevant to the
analysis and are referred to in the main text.
• You must submit the report in a standard file format (e.g., .doc, .docx or a pdf).
• Please submit your R codes file. This can be a .r or a .rmd file. No other file format for the
codes will be accepted.
In order to pass the course, you must submit the final project.
For this problem you need to load the NHANES dataset using the following command
## If the package is not already installed then use ##
install.packages(’NHANES’) ; install.packages(’tidyverse’)
library(tidyverse)
library(NHANES)
small.nhanes <- na.omit(NHANES[NHANES$SurveyYr=="2011_12"
& NHANES$Age > 17,c(1,3,4,8:11,13,17,20,21,25,46,50,51,52,61)])
small.nhanes <- as.data.frame(small.nhanes %>%
group_by(ID) %>% filter(row_number()==1) )
nrow(small.nhanes)
## Checking whether there are any ID that was repeated. If not ##
1
## then length(unique(small.nhanes$ID)) and nrow(small.nhanes) are same ##
length(unique(small.nhanes$ID))
This is data collected by the US National Center for Health Statistics (NCHS). To check the variable
description please type ?NHANES in R. The preceeding codes create a small subset of the original
NHANES dataset. The original dataset has 76 variables. The small.nhanes dataset has 17 variables.
We have only selected data from people with age > 17 years.
With this dataset answer the following questions, Randomly select 400 observations from the
data. For this selection use your student ID as the seed (you can follow the next chunk of codes for
this). This is the traning set. The rest of the data will be used as a test set. The test set should
not be used for model fitting and validating at any point during the analysis of the project.
## Create training and test set ##
set.seed(1002656486)
train <- small.nhanes[sample(seq_len(nrow(small.nhanes)), size = 400),]
nrow(train)
length(which(small.nhanes$ID %in% train$ID))
test <- small.nhanes[!small.nhanes$ID %in% train$ID,]
nrow(test)
The combined systolic blood pressure reading (BPSysAve) is our outcome of interest. Every
other variable other than the ID can be considered as predictors. We are mainly interested on the
effect of smoking (SmokeNow) on the combined systolic blood pressure reading. However, we are
also interested in the prediction of the combined systolic blood pressure reading and identifying
which variables are the best for the prediction. Based on the data analysis techniques you learned
from this course perform a complete analysis on the dataset. Your analysis should include (but is
not limited to):
• Model Diagnostics
• Checking for the variance inflation factor (VIF)
• Variable selection
• Shrinkage methods
• Model Validation
• Checking the prediction error on the test set after applying various model selection techniques
• After selecting the best model interpret and explain the parameter estimates
• Conclude on the effect of predictors on the combined systolic blood pressure readin
However, you have to justify the aforementioned methods and have to use them accurately.
The final project will be submitted as a project report, which consists of:
• Introduction section: where you introduce the purpose and relevance of the project. You
can also include some literature review on the NHANES dataset if applicable.
2
• Methods section: Please describe and explain the methods, tools and techniques used to
arrive at your final model here. Need to show some exploratory data analysis.
• Results section: here you present a description of your study sample, important results
that led you to make crucial decision in building your model, and the final model and any
other important results
• Discussion section: here you interpret your final model and describe why it answers the
research question and why it is important, as well as discuss any limitations that still exist
based on your results.
ALL THE BEST!
3
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。