POLITECNICO di TORINO
Dipartimento di Elettronica e Telecomunicazioni
ICT4TS LABORATORYDESCRIPTION
1
2
Tableofcontent
Lab goals 3
Accessing the data 3
Collections 3
Lab description 4
Step 1 – Preliminary data analysis 4
Step 2 – Analysis of the data 4
Step 3 – Prediction using ARIMA models 5
Labgoals
The laboratory part of the ICT4TS focuses on the analysis of free floating car sharing data collected from
open systems in the Internet. Data has been collected from websites of FFCS systems like car2go or Enjoy in
Italy, and made available through a MondoDB database. The goal of the laboratory is twofold:
• to allow students to get used to ICT technologies typically used in the backend of smart society
applications – database and remote server access, and writing analytics using simple scripts • to allow students to work on real data, trying to extract useful information specific to the
application domain – transport system in this case – and get used data pre-processing and filtering.
Accessingthedata
Data is stored in a MongoDB server running on the server bidgatadb.polito.it:27017. It offers read only
access to clients connected a network to the following database:
• Collection name: Carsharing
• User: ictts
• Password: Ictts16!
• Requires SSL with self signed certificates
You can use command line interfaces (e.g., the mongo shell) or GUIs (e.g., the Robomongo application) if
properly configured. For the mongo shell, you can use
mongo --host bigdatadb.polito.it --port 27017 --ssl \
--sslAllowInvalidCertificates -u ictts –p \
--authenticationDatabase carsharing
Collections
The system exposes 4 collections for Car2Go, which are updated in real time. Those are
• "ActiveBookings": Contains cars that are currently booked and not available
• "ActiveParkings": Contains cars that are currently parked and available
• “PermanentBookings": Contains all booking periods recorded so far
• "PermanentParkings": Contains all parking periods recorded so far
The same collections are available for Enjoy as well. Names are self-explanatory: • "enjoy_ActiveBookings": Contains cars that are currently booked and not available
• "enjoy_ActiveParkings": Contains cars that are currently parked and available
• ”enjoy_PermanentBookings": Contains all booking periods recorded so far
• "enjoy_PermanentParkings": Contains all parking periods recorded so far
For Torino and Milano, the system augments the booking information with additional information obtained
from Google Map service: walking, traveling, and public transportation alternative possibilities. Not all of
them are available, due to the limited number of queries google allows.
3
Labdescription
Students work in group of 3 colleagues. Each group is assigned three cities to analyse, as found on the
google drive document. Each group has to work on the project assignment, and submit a report of max 5
pages which describes the finding. Code, scripts, etc., must be added in an appendix.
Step 1– Preliminarydata analysis
To get used to both MongoDB and the data at disposal, investigate first the collections and get used to the
document and field stored in each.
• How many documents are present in each collection?
• Why the number of documents in PermanentParkings and PermanentBooking is similar?
• For which cities the system is collecting data?
• When the collection started? When the collection ended?
• What about the timezone of the timestamps?
Considering each city of your group, check
• How many cars are available in each city? • How many bookings have been recorded on the December 2017 in each city? • How many bookings have also the alternative transportation modes recorded in each city?
For each question, write the MongoDB query, and the answer you get. Add a brief comment, if useful, to
justify the result that you get.
Step 2– Analysisof thedata
Consider each city of your group, and the period of time of October 2017.
Consider the time series (city, timestamp, duration, locations). Process it to further analyse it by producing
the following plots and results:
1. Derive the Cumulative Distribution Function of booking/parking duration, and plot them. Which
consideration can you derive from the results?
a. Which of the CDF is longer? Are there some outliers?
b. Does the CDF change per each city? Why?
c. Does the CDF change over time (e.g., aggregate per each week of data, or per each day or
the week)? Why?
2. Consider the system utilization over time: aggregate rentals per hour of the day, and then plot the
number of booked/parked cars (or percentage of booked/parked cars) per hour versus time of day.
3. Derive a filtering criteria to filter possible outliers (booking periods that are too short/too long), so
to obtain rentals from bookings, filtering system issues or problems with the data collection.
4. Filtering data as above, consider the system utilization over time again. Are you able to filter
outliers?
5. Filtering the data as above, compute the average, median, standard deviation, and percentiles of
the booking/parking duration over time (e.g., per each day of the collection).
a. Does it change over time?
b. Is it possible to spot any periodicity (e.g., weekends vs week days, holidays versus working
periods)?
4
6. Consider one city of your collection and check the position of the cars when parked, and compute
the density of cars during different hours of the day.
a. Plot the parking position of cars in different times using google map. You can use the
Google Fusion Tables to get the plot in minutes -- check
https://support.google.com/fusiontables/answer/2527132?hl=en.
b. Divide the area using a simple squared grid of 500mx500m and compute the density of cars
in each area, and plot the results using a heatmap (i.e., assigning a different colour to each
square to represent the densities of cars).
c. Compute then the O-D matrix, i.e., the number of rentals starting in area i and ending in
area j. Try to visualize the results in a meaningful way.
7. [Optional] For the city of Torino or Milano, try to correlate the probability of a rental with the
availability of other transport means.
a. Extract those valid rentals for which there is also the data for alternative transport systems.
b. Consider one alternative transport system, e.g., public transports. Take the duration, and
divide it into time bins, e.g., [0,5)min, [5,10)min, [10,15)min, … Compute then the number
of rentals for each bin, i.e., the probability of seeing a rental given the duration of public
transport would be in a given interval. Plot the obtained histogram, and try to comment
the results.
Step3– PredictionusingARIMAmodels
Consider the time series of rentals that you obtained in the previous steps, for each city. The goal of this lab
is to experiment with predictions using ARIMA models, and in particular to check how the error change
with respect to parameter tuning. For this, you have to consider the various parameters in the ARIMA
modelling, including both the model parameters (p,d,q), and the training process, i.e., the training windows
size N (how many past sample are used for training), and the training policy, i.e., expanding versus sliding
windows. A possible outline of the work could be
1. For each city, consider the October 2017 time series of the number of rentals recorded at each
hour.
2. Check that there are no missing samples (recall – ARIMA models assume a regular time series, with
no missing data). In case of missing sample – define a policy for fitting missing data. For instance,
use the last value, the average value, replace with zero, replace with average value for the given
time bin, etc.
3. Check if the time series is stationary or not. Decide accordingly whether to use differencing or not
(d=0 or not).
4. Compute the ACF and PACF to observe how they decrease. This is instrumental to guess possible
good values of the p and q parameters.
5. Decide the number of past samples to use for training N, and how many for testing. Given you have
about 30 days (each with 24 hours) you can consider for instance training during the first week of
data, and test the prediction in the second week of data.
6. Given N, (p,d,q), train a model, and compute the error. Consider the MPE or MSE – so that you can
compare results for different cities (absolute errors would obviously be not directly comparable).
7. Now check the impact of parameters:
a. Keep N fixed, and do a grid search varying (p,d,q) and observe how the error decreases.
Choose the best parameters.
5
b. Given the best parameter, and now change N and the learning strategy (expanding versus
sliding window). Keep always the testing on the same amount of data. For instance, if you
use 1 week for testing, you can use from 1 day to 3 weeks for training.
c. Compare results for the different cities. How the relative error changes w.r.t. the absolute
number of rentals?
8. [optional] Try to see how the time horizon h of the prediction impact the performance. Instead of
predicting the number of rentals at t+1, use the model to predict the future rentals at t+h, h in
[1:24].
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。