联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-07-31 10:07

Lab 4: Predictors

STAT 462: Summer 2019


Note: You will work on this activity on 07/25 and 07/27. On July 25th, you should complete at least the first activity and take a 10-minute quiz based on those questions. The quiz will be graded for correctness. On July 27th, you will complete the last activity, and submit a completed document on Canvas by 3 pm. Please include your codes, results, and answers to other substantive questions asked here. This submission will be graded for completeness and effort. In addition to the various R-esources provided on Canvas, this document also includes a list of commands that will be useful in completing this lab.

In general, for questions pertaining to writing code, you can always ask Google. Typing ‘how to read a .csv file into R’ will get you the answer quickest. Rule of thumb: If you have spent more than five minutes looking for the answer, JUST ASK for help!


This lab is more open-ended than other lab activities before. I have provided some general prompts that can be helpful, but you should treat this lab as preparation towards your project.


GGplot is a cutting-edge tool that allows you to create some fantastic-looking visualizations! https://ggplot2.tidyverse.org has ALL YOU NEED TO KNOW about the package. Using GGplots can be an involved process, so I do not expect all of you to use GGplots in your project report. However, you should definitely consider it! Through this lab, we will start with small steps.

-Install the package using the install.packages(“ggplot2”) command.

-Be sure to load it into your working library using the library(ggplot2) command.

-You have all heard about the houseprices dataset, maybe? Below is the code I used to generate the plots you have seen on the class slides. I want you to spend some time in today’s lab understanding this code. We will be creating similar plots for another dataset.


--- <code starts here>---


library(ggplot2)

data <- read.csv("houseprices.csv", header = T)


## The color palette below is compatible with color blindness

cbbPalette <- c("#E69F00", "#56B4E9", "#D55E00","#009E73", "#CC79A7", "#000000", "#F0E442", "#0072B2")


## A scatterplot with points colored according to whether there is a fireplace in the house

plot1 <- ggplot(data, aes(x = Living.Area, y = Price, colour = factor(Fireplace))) +

 geom_point(shape = 19, size = 3) +

 scale_colour_manual(values = cbbPalette) +

 ggtitle("Living Area and Price, by Fireplace status") +

 labs(colour = "Fireplace") +

 theme(plot.title = element_text(size=20), legend.title = element_text(size=16), legend.text = element_text(size=16))


plot1


## The same scatterplot as above, with lines, as specified by our regression model

mod <- lm(Price ~ Living.Area + Fireplace, data = data)


plot2 <- ggplot(data, aes(x = Living.Area, y = Price, colour = factor(Fireplace))) +

 geom_point(shape = 19, size = 3) +

 scale_colour_manual(values = cbbPalette) +

 geom_abline(intercept = mod$coefficients[1], slope = mod$coefficients[2], colour = "#E69F00", lwd = 2) +

 geom_abline(intercept = (mod$coefficients[1]+mod$coefficients[3]), slope = mod$coefficients[2], colour = "#56B4E9", lwd = 2) +

 ggtitle("Living Area and Price, by Fireplace status") +

 labs(colour = "Fireplace") +

 theme(plot.title = element_text(size=20), legend.title = element_text(size=16), legend.text = element_text(size=16))


plot2


--- <code ends here>---


We will now be revisiting datasets that you have analyzed before, and further explore them to understand whether any of the categorical variables and/or interactions would be useful additions to the model.



Activity A – Thursday, July 25th:


Let us start with the bikesharing dataset. Let us refresh our memory about the variable contained in this dataset.


-nstant: record index

-dteday : date

-season : season (1:spring, 2:summer, 3:fall, 4:winter)

-yr : year (0: 2011, 1:2012)

-mnth : month ( 1 to 12)

-holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)

-weekday : day of the week

-workingday : if day is neither weekend nor holiday is 1, otherwise is 0.

-weathersit :

o1: Clear, Few clouds, Partly cloudy, Partly cloudy

o2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

o3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

o4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

-temp : Normalized temperature in Celsius. The values are divided to 41 (max)

-atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)

-hum: Normalized humidity. The values are divided to 100 (max)

-windspeed: Normalized wind speed. The values are divided to 67 (max)

-casual: count of casual users

-registered: count of registered users

-cnt: count of total rental bikes including both casual and registered


1)Let us begin our exploration with the season column. Start by creating a table() of the season and month variable to determine which months are considered to be in which season.


2)Suppose that we want to add the season variable to the equation. We want to understand if this variable can add something to our understanding of the relationship between temperature and rental numbers. Without further analysis, what would your first thought be? Would it be a good idea to add this variable to the model? If yes, why so? If no, why not?


3)Now, create a plot that has temperature on the X-axis, count of rental bikes on the Y-axis, and color-coded based on the season the observation is in. We will need to use a ggplot. What do you see in the plot? Does this mirror your thoughts in the previous question.


4)Add this variable to an additive linear regression model. We are not looking at interactions yet!


a.Which season was considered to be the reference level by R?

b.Interpret the coefficient for the Fall season.

c.Is bike rental count significantly different for Spring and Winter?

d.Take a look at the goodness of fit measures; the R-squared, the adjusted R-squared, and the results of F-test?

e.Run the diagnostics to check whether assumptions of linearity, normality and homoscedasticity hold?

f.We stipulate, without further investigation, that the assumption of independence cannot be expected to hold. Why so?

g.Now, we want to conduct the general linear F-test to understand whether the season variable is significantly related to the rental count. Remember, we use the aov() command which can be found on your class slides. What do you find?

h.Finally, we wish to change the reference level to winter. Once again, the command is available on your slides. Perform the regression and comment on the signs of the coefficients. Do they seem intuitive to you?


5)Introduce an interaction term between season and temperature. Look at the summary of the corresponding ANOVA model. Do you think the interaction term is significantly related to the rental count?



Activity B – Thursday, July 27th:


Let us now revisit the birth weight data. The wooldridge package contains it under the name bwght2.


1)Use the ?bwght2 command to look at the list of variables in this dataset. List the categorical variables contained in this data. Remember, it is not necessary that all datasets code categorical variables the same way. Some may code them in a single column, and some may directly encode them as indicator variables in the table.


2)Let us now expand the scope of our inquiry to reflect what actual data analysis may look like. We will be working with birth weight as the response variable, and mother and father’s age and education, month prenatal care began, total number of prenatal visits, cigarettes, and drinks as the quantitative variables. Choose one of the three categorical variables you would like to add to the model. Use Activity A Question 4 as a guide for your exploration. Answer all the relevant questions.


3)Finally, throw in the other two categorical variables as well. Treat this as a wholesome data analysis activity, which means


a.Look at a scatterplot and correlation matrix as exploratory data analysis.

b.Run a regression with all the variables and comment on which variables you think are significant in explaining the variation in the response variable.

c.Comment on the R-squared and adjusted R-squared.

d.Check for violation of regression assumptions.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp