Programming in R - Week 2 Assignment
IPAL - The University of Chicago
Due: Sunday, July 12, 2020 at 11:59pm on Canvas
Structure
This assignment will focus on gathering, analyzing, and plotting real data. The answers here are much more
open ended than those in Problem Set 1, and there may not be an obvious “right” way to do things. Try
your best and record any assumptions or major choices you make as comments. Like before, this problem
set will be broken into three sections, each worth 16 points.
The goal of this assignment is to explore the relationship between temperature and homicides in Chicago.
We will use temperature data from the National Oceanic and Atmospheric Administration (NOAA) and
crime data from the City of Chicago Data Portal. The temperature data is pre-collected, but you will need
to retrieve the crime data yourself via an API.
Start by creating a new project/folder for this assignment. Create a new R script to save your code. For
each chunk of code you create, please preface it with a comment describing what your code is doing.
For example, your answers might look like this:
# Loading in saved homicide data
homicides <- read_csv("homicides.csv")
# Finding the number of crimes for each police district
homicides %>%
group_by(district) %>%
summarize(count = n())
Section 1: Visualizing Weather Data
The provided weather data (ohare_temps.csv) comes from the NOAA weather station at O’Hare Airport.
It includes a timestamp and corresponding temperature (in Farenheit) for each hour since January 1st, 2001.
Using functions from the tidyverse and lubridate packages, start by reading the provided CSV and
converting the timestamp column to a datetime format. Next, extract the year, month, day, week, and hour
from your datetime-formatted column and make them into separate columns, called year, month, day, week,
and hour.
Calculate the average temperature for each year and save your results to a new dataframe. Your results
should look similar to the ones below:
head(temps_avg, n = 4)
## # A tibble: 4 x 2
## year mean_temp
1
## <dbl> <dbl>
## 1 2001 51.3
## 2 2002 51.2
## 3 2003 49.3
## 4 2004 50.4
Answer the following questions using code and comments:
1. What are the top two coldest years in your summarized dataset? Exclude 2020 since there is no
summer and fall data.
2. On average, across all months and years, what hour of the day is the hottest?
3. What day has the largest swing in temperature from 3 AM to 3 PM? (Hint: There are many ways to
calculate this. I suggest filter() and spread() or the lag() function)
4. What week in the dataset had the largest year-to-year absolute change in average temperature? In
other words, comparing the average temperature of weeks across all years, what week had the greatest
change from the same week in the previous year?
Finally, using your summarized dataset, do your best to replicate the following plot. Note: 2020 is excluded.
Chicago's polar vortex year
Temperature °F
Average Temperature by Year in Chicago (2001−2019)
Source: NOAA Weather Station, O'Hare Airport
2
Section 2: Downloading and Summarizing Homicide Data
The City of Chicago keeps a fairly comprehensive database of crimes which can be found here: https:
//data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2. Within this database there are
records of all the homicides committed in Chicago since 2001.
We want to extract only these records, however, our previous method of downloading a CSV and using
filter() to keep only the records we want is unlikely to work because the crimes dataset CSV is multiple
GB in size. Instead, we can use the Data Portal’s API to grab only the records we’re interested in. This can
be accomplished in two ways:
1. Use the RSocrata library and connect to the crimes API. There is example documentation on the city
website.
2. Use the raw API and the read_json() function from the jsonlite library to directly query the API
and read JSON data into R.
Your dataset of all homicides should contain 10,000 rows. Once you’ve successfully read the data into R,
use lubridate functions similar to those you used on the temperature data to extract the year, month, day,
week, and hour columns.
Next, answer the following questions using code and comments:
1. What year had the highest number of homicides in Chicago?
2. What hour, on average, has the most homicides?
3. What community areas had the lowest number of homicides? Use a join and community area data
from the Data Portal to determine the the names of each community area.
Finally, replicate the plot below to the best of your ability:
Homicides Over Time in Chicago
Source: City of Chicago Data Portal
3
Section 3: Combining Both Datasets
Finally, we want to combine aggregated data from both datasets into a single plot. First, find the mean
temperature and mean number of homicides by week across all years. Then, merge your results and replicate
the plot below to the best of your ability.
Week
Average # of Homicides
Average Temp
Type
Homicides
Temp
Homicides vs Temperature in Chicago (2001 − 2019) What is potentially wrong with this plot? Is there a way we could improve it? What might explain the
phenomenon that it shows? Answer in a comment.
4
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。