APS106 – Lab #9
In this (final!) lab of APS106, you will practice using the pandas package and analyze Toronto bikeshare data from 2017-2018.
Lab Objectives
Load data from csv files into pandas DataFrames
Add columns to pandas DataFrames
Concatenate pandas DataFrames with similar columns
Extract information from datasets using pandas DataFrame. methods
Lab Deliverables
The following files must be submitted to Gradescope prior to the assignment deadline:
lab9.py
For this assignment, you should complete the following functions:
create_bikeshare_df (BONUS MARKS ONLY – See below)
rank_station_usage
average_trip_duration_by_membership
A Note on the Gradescope autograder. Five testcases are provided on Gradescope to help you prepare your solution. Passing all these testcases does not guarantee your code is correct.
You will need to develop your own testcases to verify your solution works. Your programs will be graded using ten secret testcases. These testcases will be released after the assignment
deadline.
IMPORTANT:
• Do not change the filename or function names
• Do not use input() inside your program
Problem
This week we will imagine that you have been contracted by a bikeshare company in Toronto to
analyze data regarding their bike and station usage in 2017 and 2018. Specifically, they would like to know the following:
1. Within 2017 and 2018, which stations were used most?
2. Within 2017 and 2018, what was the average trip duration for members and non-members? The bikeshare company has recorded the following information for each trip in 2017 and 2018:
1. trip_id – a number identifying the trip
2. trip_start_time – the date and time of the start of the trip
3. trip_stop_time – the date and time of the end of the trip
4. trip_duration_seconds – the length of the trip in seconds
5. from_station_id – the ID number of station where the bike was taken from at the start of the trip
6. from_station_name – the name of the station where the bike was taken from at the start of the trip
7. to_station_id – the ID number of the station where the bike was returned at the end of the trip
8. to_station_name – the name of the station where the bike was returned at the end of the trip 9. user_type – Whether the trip was taken by a member or a non-member
The bikeshare company has provided you with 8 csv files containing this trip data. Each file
contains trip information for one quarter (three month period) within 2017 and 2018. The files are
named using the following structure: “Bikeshare Ridership (year Qquarter number).csv” where year and quarter number are replaced with the year and quarter number when the data was collected. You can download the files from Quercus. We recommend that you open the files and review their structure before moving to the next part of this lab.
(OPTIONAL BONUS PART) Part 1 - Create a single DataFrame for all the data
Update April 9, 2024 – Using the pandas concat function was removed from the content of
APS106 W2024 after the lab was designed. You are NOT responsible for understanding how to use this function for the final exam.
Because this part of the lab was designed to have you use the concat function to create a single DataFrame with all the data from 2017 and 2018, we are making this part of the lab optional. If
you complete this function and pass the tests on Gradescope, you may receive up to 4 bonus
points on this lab that can be applied to increase your grade on any other labs where you did not
receive full marks. The bonus marks will not be applied to any other portion of the course (i.e., term tests, final exam, reflections) and your final total lab grade cannot exceed 100%.
If you choose not to complete this part of the lab, you may download the file
all_bikeshare_data.csv from quercus to complete and test the remainder of the lab.
In this part of the lab you will complete the create_bikeshare_df function. This function accepts a list of filename strings as input and returns a pandas DataFrame. containing the following columns:
Column name |
Type |
Description |
trip_id |
int |
A number identifying the trip |
trip_start_time |
str |
The date and time of the start of the trip |
trip_stop_time |
str |
The date and time of the end of the trip |
trip_duration_seconds |
int |
The length of the trip in seconds |
from_station_id |
float |
The ID number of station where the bike was taken from at the start of the trip |
from_station_name |
str |
The name of the station where the bike was taken from at the start of the trip |
to_station_id |
float |
The ID number of the station where the bike was returned at the end of the trip |
to_station_name |
str |
The name of the station where the bike was returned at the end of the trip |
user_type |
str |
Either “Member” or “Casual” (i.e., non-member) |
quarter |
int |
The quarter during which the trip was taken |
year |
int |
The year during which the trip was taken |
Columns 1-9 come directly from reading the csv files. Your function will need to add columns 10 (quarter) and 11 (year) to the DataFrame. The quarter value in each row should be an int (either 1, 2, 3, or 4) and the year should be an int (either 2017 or 2018). Your function should return a single DataFrame with all these columns. Your function should read each file within the list of files input to the function and concatenate the data from each file into a single DataFrame.
Example usage
Part 2 - Rank stations by usage
In this part of the lab, you will complete the rank_station_usage function. This function will analyze bikeshare data within a specific year and quarter to determine the stations that were used most frequently within that year and quarter. The function takes the following inputs:
1. bikeshare_df- A pandas DataFrame. containing bikeshare data formatted like the DataFrame. returned by the create_bikeshare_df function
2. year- an int specifying the year we want to analyze
3. quarter- an int specifying the quarter we want to analyze
The function should return a DataFrame. that is indexed by station names and has the following columns:
Column name |
Type |
Description |
departure_count |
int |
The total number of trips that started at that station during the specified year and quarter |
return_count |
Int |
The total number of trips that ended at that station during the specified year and quarter |
combined_departure_return_count |
int |
The sum of the departure and return count columns |
The returned DataFrame’s rows should be sorted in descending order according to the
combined_departure_return_count values (i.e., the station with the highest total usage should be in the first row of the returned DataFrame).
Example usage
Part 3 - Average trip length by membership status
In this part, you will complete the average_trip_duration_by_membership function. This function will compute the average trip length for both members and non-members (labeled as
“Casual” under the user_type column) during each year and quarter in the dataset.
The function will accept a single DataFrame. input, bikeshare_df,which will be formatted like the DataFrame. returned by the function in part 1.
The function should return a DataFrame with the following columns:
Column name |
Type |
Description |
year |
int |
The year for which the average trip length was calculated |
quarter |
int |
The quarter for which the the average trip length was calculated |
member_length |
float |
The average trip length in seconds for members during the year and quarter |
non_member_length |
float |
The average trip length in seconds for non-members during the year and quarter |
Example usage
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。