For this activity you will create an R program and upload that program to wolfware. Be sure that your R file
adheres to the R file submission guidelines (available on wolfware).
To Do – Write an R script that will successfully do what follows. Note that you should have comments
throughout your program explaining what you are doing and also you are asked to place some answers in a
comment.
In this project, you are going to work with census data sets. There are two files: EDU01.xls and PST01.xls that
you will (eventually read in). There is also a file (Mastdata.xls) that describes the variables in the data sets
(along with variables in many other data sets).
With these census data files there are often multiple columns corresponding to one variable. The columns that
we want to work with are the columns that end with a D.
The instructions for this project are deliberately meant to be less prescriptive so that you must think about the
best way to answer the questions. We’ll start with reading in and manipulating the EDU01.xls data set and then
automate somethings to read in the PST01.xls data set.
Use tidyverse functions where possible. Chaining is optional.
Place this code at the top of your file and run the code:
#Load packages needed
library(tidyverse)
library(readxl)
#Create functions for use later on. Running this code makes these functions available to us.
#Given a census column name such as LLL######D, extract the last two numbers (which represent a year usually)
grabYear <- function(string){
m <- regexpr(pattern = "(\\d+)", string)
match <-regmatches(string, m)
year <- gsub(pattern = "\\d\\d\\d\\d", replacement = "", x = match)
return(as.numeric(year))
}
is.County <- function(countyString){
if(length(grep(pattern = ", \\w\\w", countyString))>0){
return(1)
} else {
return(0)
}
}
#vectorize it instead and use as a logical for indexing
is.CountyVec <- Vectorize(FUN = is.County, USE.NAMES = FALSE)
Our goal with the EDU01.xls data set will be to read in the Public School Enrollment variables from 1986-1987
to 2008-2009. Check out in the Mastdata.xls file which variables this corresponds to in EDU01.xls (note the
multiple sheets!).
Read the public school enrollment variables in and do the following:
• Keep only the first two columns (Area_name and STCOU) and those with a name that ends with D
• Remove rows corresponding to Area_name equal to “District of Columbia” (there are two versions of
this row, we want to keep the capital letter one only!)
• Put the data into “long form” rather than wide form
• Put all of the data into one data frame
• Create a “numYear” variable that is a numeric variable corresponding to the year. Recall from above
that the last two digits prior to D represent the year. The grabYear() function you read in above can be
used to extract that
• Use a for loop and the is.County() function to create an indexing vector that is 1 if a row corresponds to
a county level observation and a 0 if the row corresponds to any other observation (State, US, or DC
basically)
• Use the Vectorized version, is.CountyVec(), to create the same indexing vector
• Create two data sets – one data set that contains only county level data and one data set that contains
only non-county data
• Create a function that takes in a data frame that contains the non-county data, adds a column to the data
set called “Division” corresponding to the state’s classification of division here, and returns the data
frame. If row corresponds to a non-state (i.e. UNITED STATES), return ERROR for the division. This
will not be a ton of fun but can be made easier with the use of %in%.
• Your two final data sets should look like the following:
With your edSummaryData set (or whatever you called it), do the following:
• Create summary statistics for the overall mean for each division. Do this twice, once using tapply() and
once using the tidyverse method
• Use microbenchmark to compare the speeds (see Vectorized function video code for examples of this)
Lastly, create a function – the start of it is given here:
formatData <- function(path, sheet = 1, gatherCols = TRUE, colValue = "value", division = TRUE){
require(tidyverse)
}
• This function should read in a particular sheet of a census type data set (we’ll use it on the PST01.xls
data set shortly)
• The function should select the columns as above and filter out the “District of Columbia” rows
• The data set should be filtered to only include non-county data
• Optionally, the user should be able to return a “long form” data set with the value column equal to
whatever string the user specifies in the colValue argument
• If the long form data is being returned the user should be able to choose whether or not to append a
division column as well (if they don’t choose gatherCols = TRUE then it shouldn’t matter what the
division value or colValue things are set to)
• Use the function to read in the 2nd sheet of the PST01.xls data set. Read it in with gatherCols = TRUE
and give a value for colValue. Then also read it in with gatherCols = FALSE.
Below is the output you should get if you run the function with the above two settings:
formatData("PST01.xls", sheet = 2, colValue = "Population", division = TRUE, gather = TRUE)
formatData("PST01.xls", sheet = 2, gather = FALSE)
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。