联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-07-17 10:31

Lab 2: Multiple Linear Regression

STAT 462: Summer 2019


Note: You will work on this activity on 07/11 and 07/16. On July 11th, you should complete the first activity and take a 10-minute quiz based on those questions. The quiz will be graded for correctness. On July 16th, you will complete activity 2, and submit a completed document on Canvas by 3 pm. Please include your codes, results, and answers to other substantive questions asked here. This submission will be graded for completeness and effort. In addition to the various R-esources provided on Canvas, this document also includes some prompts which will indicate some useful commands.

In general, for questions pertaining to writing code, you can always ask Google. Typing ‘how to read a .csv file into R’ will get you the answer quickest. Rule of thumb: If you have spent more than five minutes looking for the answer, JUST ASK for help!


Getting started:

-Square brackets [] are used to subset objects in R. They allow you to access information inside objects. These two blog posts (https://www.r-bloggers.com/r-accessors-explained/ and https://rpubs.com/tomhopper/brackets) provide some information and examples in this regard. You should copy paste the below commands in your console and observe the output.


---code starts here---

a <- c(10,20,30)

a

a[1]


b <- c(“Statistics”, “Chemistry”, “Sociology”)

b

b[2]


c <- matrix(c(10,20,30,40,50,60), nrow = 2, ncol = 3)

c

c[1,1]

c[2,1]

c[,3]

c[2,]


colnames(c) <- c("Column1", "Column2","Column3")

c

c[,"Column2"]

---code ends here---


-An R package is “the fundamental unit of shareable code”, as the creator refers to it. A package can contain code, data, or documentation that someone has already worked on and wishes to share with others who may need to conduct the same type of analysis. The lm() function we have been using in so far is also hosted in a package called stats. When you open R/RStudio, some packages are pre-installed and some are also loaded into the working environment automatically. However, sometimes, we need to install (install.packages(“[package_name]”) and load (library(“[package_name]”) a package before using it. For this lab, you will be using a dataset named ‘dvisits’ hosted in the package ‘faraway’. Load this dataset into your working environment so we can analyze it.


-Once a package is loaded into the working environment, typing ?command_name in the console will show you the documentation pertaining to the code or the dataset you are working with. You can try typing ?lm and read about the command we use use to fit a regression model. Typing ?dvisits will allow you to read the description of the dataset of interest and the variables contained in it.


-(Optional): For those of you who may be interested in learning how to use R markdown to create reports that include embedded R code and output, this (https://rmarkdown.rstudio.com/articles_intro.html) is an excellent resource that can get you started. There are many other great resources on the internet.


Activity A – Thursday, July 11th:


1.How many observations are in the data? How many variables?


2.Use the str() command to find out which variable types R is identifying within the dataset. Further, compare the output of str() to the description of the variables and indicate how many quantitative variables (discrete and continuous) are in the data.

3.Remember that so far, we have only worked with quantitative variables to conduct regression analysis. We want to create a separate data frame in R which will only contain the quantitative columns. Subset the doctor visits data using the square bracket method we looked at above, and create this dataset. Name this dataset subset1.


4.Create a scatterplot matrix and a correlation matrix for this subset and comment on what you observe. In particular, we are interested in looking at the hscore variable as a response variable.


5.Now, fit a multiple linear regression model with this response variable and all other quantitative variables as explanatory variables.


6.What do you notice in the summary output of this regression? The header line of the coefficients table specifies that “(1 not defined because of singularities)”. What does this mean? Can you think of a pair or trio of variables which may be linearly dependent on each other?


7.Remove the “nonpresc” column from the data, name this new dataset subset2, and re-fit a regression model.


8.Interpret the coefficient for age and the multiple R-squared for this model. Notice that even though all-but-one of the coefficients in this regression are significant at 10% level of significance, indicating that they have a significant relationship with the response variable, the R-squared is relatively low. What does this indicate to us? If we are interested in finding a good model that explains the variation in hscore, would you say that we have found it?


9.Create a plot of residuals (Y) v/s fitted values (X). Further, access the documentation using ?plot and add a title, label for the x-axis, and label for the y-axis to this plot. What do you observe in this plot? How does it relate to the observation we made in Q8?

10.Look at the histogram and normal QQ-plot of residuals. Comment on what these plots tell us about the normality assumption.


Activity B – Tuesday, July 16th:


We are focusing on the model you have built in part 7 of activity A. More specifically, we have a multiple linear regression with hscore as the response variable, and the following explanatory variables: age, income, illness, actdays, doctorco, nondocco, hospadmi, hospdays, medicine, prescrib.


1.Perform the Anderson-Darling Test for normality. Remember that the command for this test requires the “nortest” package in R. What does the p-value tell us? Does this conclusion match your observation from the normal QQ-plot?


2.Perform the Shapiro-Wilk test for normality. Why does R give us an error in executing this code?


3.Perform the Breusch-Pagan Test for homoscedasticity. Remember that the command for this test requires “lmtest” package in R. What does the p-value tell us? Does this conclusion match your observation from the plot of residuals versus fitted values?


4.Based on your answers to questions 1 and 3, which recommendation(s) would you suggest?


5.Calculate the variance inflation factors in R and comment on whether there may be multicollinearity in this model. Which variable(s) should we consider removing to improve the model fit?


6.Now, we want to fit a new model which can improve the lack of normality, heteroscedasticity, and the possible multicollinearity. Can we consider a log transform of the response variable? If no, why not?


7.Now, we wish to fit a regression model with  as the response variable and the eight explanatory variables with VIFs < 3. Let us take a few steps to create the required dataset:


a.Start by creating a vector sqrt.hscore.

b.Now, use the cbind command to bind the 9 columns of interest: sqrt.hscore and the other 8 columns of interest which can be referred to as data$[variable_of_interest], into an object named data.new.

c.Turn data.new into a data frame using the as.data.frame command.

d.Finally, change the names of columns in data.new. A reference code for this can be found in the sample code provided at the beginning of the document.


8.Now, fit the new model and compare it to the previous model based on:


a.Coefficients of determination

b.Normality QQ-plots

c.Results for normality tests

d.Residuals versus fitted values plots

e.Results of tests for homoscedasticity


9.Finally, comment on which of the two models you think is ‘better’. Remember that there is no right or wrong answer. Hypothetically, there is a third model out there which can improve upon the two we have studied. However, the comparisons above should give us an indication as to which model may be more reliable.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp