联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2018-04-28 04:32

For this activity you will create an R program and upload that program to wolfware. Be sure that your R file

adheres to the R file submission guidelines (available on wolfware).

To Do – Write an R script that will successfully do what follows. Note that you should have comments

throughout your program explaining what you are doing and also you are asked to place some answers in a

comment.

In this project, you are going to work with census data sets. There are two files: EDU01.xls and PST01.xls that

you will (eventually read in). There is also a file (Mastdata.xls) that describes the variables in the data sets

(along with variables in many other data sets).

With these census data files there are often multiple columns corresponding to one variable. The columns that

we want to work with are the columns that end with a D.

The instructions for this project are deliberately meant to be less prescriptive so that you must think about the

best way to answer the questions. We’ll start with reading in and manipulating the EDU01.xls data set and then

automate somethings to read in the PST01.xls data set.

Use tidyverse functions where possible. Chaining is optional.

Place this code at the top of your file and run the code:

#Load packages needed

library(tidyverse)

library(readxl)

#Create functions for use later on. Running this code makes these functions available to us.

#Given a census column name such as LLL######D, extract the last two numbers (which represent a year usually)

grabYear <- function(string){

m <- regexpr(pattern = "(\\d+)", string)

match <-regmatches(string, m)

year <- gsub(pattern = "\\d\\d\\d\\d", replacement = "", x = match)

return(as.numeric(year))

}

is.County <- function(countyString){

if(length(grep(pattern = ", \\w\\w", countyString))>0){

return(1)

} else {

return(0)

}

}

#vectorize it instead and use as a logical for indexing

is.CountyVec <- Vectorize(FUN = is.County, USE.NAMES = FALSE)

Our goal with the EDU01.xls data set will be to read in the Public School Enrollment variables from 1986-1987

to 2008-2009. Check out in the Mastdata.xls file which variables this corresponds to in EDU01.xls (note the

multiple sheets!).

Read the public school enrollment variables in and do the following:

• Keep only the first two columns (Area_name and STCOU) and those with a name that ends with D

• Remove rows corresponding to Area_name equal to “District of Columbia” (there are two versions of

this row, we want to keep the capital letter one only!)

• Put the data into “long form” rather than wide form

• Put all of the data into one data frame

• Create a “numYear” variable that is a numeric variable corresponding to the year. Recall from above

that the last two digits prior to D represent the year. The grabYear() function you read in above can be

used to extract that

• Use a for loop and the is.County() function to create an indexing vector that is 1 if a row corresponds to

a county level observation and a 0 if the row corresponds to any other observation (State, US, or DC

basically)

• Use the Vectorized version, is.CountyVec(), to create the same indexing vector

• Create two data sets – one data set that contains only county level data and one data set that contains

only non-county data

• Create a function that takes in a data frame that contains the non-county data, adds a column to the data

set called “Division” corresponding to the state’s classification of division here, and returns the data

frame. If row corresponds to a non-state (i.e. UNITED STATES), return ERROR for the division. This

will not be a ton of fun but can be made easier with the use of %in%.

• Your two final data sets should look like the following:

With your edSummaryData set (or whatever you called it), do the following:

• Create summary statistics for the overall mean for each division. Do this twice, once using tapply() and

once using the tidyverse method

• Use microbenchmark to compare the speeds (see Vectorized function video code for examples of this)

Lastly, create a function – the start of it is given here:

formatData <- function(path, sheet = 1, gatherCols = TRUE, colValue = "value", division = TRUE){

require(tidyverse)

}

• This function should read in a particular sheet of a census type data set (we’ll use it on the PST01.xls

data set shortly)

• The function should select the columns as above and filter out the “District of Columbia” rows

• The data set should be filtered to only include non-county data

• Optionally, the user should be able to return a “long form” data set with the value column equal to

whatever string the user specifies in the colValue argument

• If the long form data is being returned the user should be able to choose whether or not to append a

division column as well (if they don’t choose gatherCols = TRUE then it shouldn’t matter what the

division value or colValue things are set to)

• Use the function to read in the 2nd sheet of the PST01.xls data set. Read it in with gatherCols = TRUE

and give a value for colValue. Then also read it in with gatherCols = FALSE.

Below is the output you should get if you run the function with the above two settings:

formatData("PST01.xls", sheet = 2, colValue = "Population", division = TRUE, gather = TRUE)

formatData("PST01.xls", sheet = 2, gather = FALSE)


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp