POLITECNICO di TORINO

Dipartimento di Elettronica e Telecomunicazioni

ICT4TS LABORATORYDESCRIPTION

1

2

Tableofcontent

Lab goals 3

Accessing the data 3

Collections 3

Lab description 4

Step 1 – Preliminary data analysis 4

Step 2 – Analysis of the data 4

Step 3 – Prediction using ARIMA models 5

Labgoals

The laboratory part of the ICT4TS focuses on the analysis of free floating car sharing data collected from

open systems in the Internet. Data has been collected from websites of FFCS systems like car2go or Enjoy in

Italy, and made available through a MondoDB database. The goal of the laboratory is twofold:

? to allow students to get used to ICT technologies typically used in the backend of smart society

applications – database and remote server access, and writing analytics using simple scripts ? to allow students to work on real data, trying to extract useful information specific to the

application domain – transport system in this case – and get used data pre-processing and filtering.

Accessingthedata

Data is stored in a MongoDB server running on the server bidgatadb.polito.it:27017. It offers read only

access to clients connected a network to the following database:

? Collection name: Carsharing

? User: ictts

? Password: Ictts16!

? Requires SSL with self signed certificates

You can use command line interfaces (e.g., the mongo shell) or GUIs (e.g., the Robomongo application) if

properly configured. For the mongo shell, you can use

mongo --host bigdatadb.polito.it --port 27017 --ssl \

--sslAllowInvalidCertificates -u ictts –p \

--authenticationDatabase carsharing

Collections

The system exposes 4 collections for Car2Go, which are updated in real time. Those are

? "ActiveBookings": Contains cars that are currently booked and not available

? "ActiveParkings": Contains cars that are currently parked and available

? “PermanentBookings": Contains all booking periods recorded so far

? "PermanentParkings": Contains all parking periods recorded so far

The same collections are available for Enjoy as well. Names are self-explanatory: ? "enjoy_ActiveBookings": Contains cars that are currently booked and not available

? "enjoy_ActiveParkings": Contains cars that are currently parked and available

? ”enjoy_PermanentBookings": Contains all booking periods recorded so far

? "enjoy_PermanentParkings": Contains all parking periods recorded so far

For Torino and Milano, the system augments the booking information with additional information obtained

from Google Map service: walking, traveling, and public transportation alternative possibilities. Not all of

them are available, due to the limited number of queries google allows.

3

Labdescription

Students work in group of 3 colleagues. Each group is assigned three cities to analyse, as found on the

google drive document. Each group has to work on the project assignment, and submit a report of max 5

pages which describes the finding. Code, scripts, etc., must be added in an appendix.

Step 1– Preliminarydata analysis

To get used to both MongoDB and the data at disposal, investigate first the collections and get used to the

document and field stored in each.

? How many documents are present in each collection?

? Why the number of documents in PermanentParkings and PermanentBooking is similar?

? For which cities the system is collecting data?

? When the collection started? When the collection ended?

? What about the timezone of the timestamps?

Considering each city of your group, check

? How many cars are available in each city? ? How many bookings have been recorded on the December 2017 in each city? ? How many bookings have also the alternative transportation modes recorded in each city?

For each question, write the MongoDB query, and the answer you get. Add a brief comment, if useful, to

justify the result that you get.

Step 2– Analysisof thedata

Consider each city of your group, and the period of time of October 2017.

Consider the time series (city, timestamp, duration, locations). Process it to further analyse it by producing

the following plots and results:

1. Derive the Cumulative Distribution Function of booking/parking duration, and plot them. Which

consideration can you derive from the results?

a. Which of the CDF is longer? Are there some outliers?

b. Does the CDF change per each city? Why?

c. Does the CDF change over time (e.g., aggregate per each week of data, or per each day or

the week)? Why?

2. Consider the system utilization over time: aggregate rentals per hour of the day, and then plot the

number of booked/parked cars (or percentage of booked/parked cars) per hour versus time of day.

3. Derive a filtering criteria to filter possible outliers (booking periods that are too short/too long), so

to obtain rentals from bookings, filtering system issues or problems with the data collection.

4. Filtering data as above, consider the system utilization over time again. Are you able to filter

outliers?

5. Filtering the data as above, compute the average, median, standard deviation, and percentiles of

the booking/parking duration over time (e.g., per each day of the collection).

a. Does it change over time?

b. Is it possible to spot any periodicity (e.g., weekends vs week days, holidays versus working

periods)?

4

6. Consider one city of your collection and check the position of the cars when parked, and compute

the density of cars during different hours of the day.

a. Plot the parking position of cars in different times using google map. You can use the

Google Fusion Tables to get the plot in minutes -- check

https://support.google.com/fusiontables/answer/2527132?hl=en.

b. Divide the area using a simple squared grid of 500mx500m and compute the density of cars

in each area, and plot the results using a heatmap (i.e., assigning a different colour to each

square to represent the densities of cars).

c. Compute then the O-D matrix, i.e., the number of rentals starting in area i and ending in

area j. Try to visualize the results in a meaningful way.

7. [Optional] For the city of Torino or Milano, try to correlate the probability of a rental with the

availability of other transport means.

a. Extract those valid rentals for which there is also the data for alternative transport systems.

b. Consider one alternative transport system, e.g., public transports. Take the duration, and

divide it into time bins, e.g., [0,5)min, [5,10)min, [10,15)min, … Compute then the number

of rentals for each bin, i.e., the probability of seeing a rental given the duration of public

transport would be in a given interval. Plot the obtained histogram, and try to comment

the results.

Step3– PredictionusingARIMAmodels

Consider the time series of rentals that you obtained in the previous steps, for each city. The goal of this lab

is to experiment with predictions using ARIMA models, and in particular to check how the error change

with respect to parameter tuning. For this, you have to consider the various parameters in the ARIMA

modelling, including both the model parameters (p,d,q), and the training process, i.e., the training windows

size N (how many past sample are used for training), and the training policy, i.e., expanding versus sliding

windows. A possible outline of the work could be

1. For each city, consider the October 2017 time series of the number of rentals recorded at each

hour.

2. Check that there are no missing samples (recall – ARIMA models assume a regular time series, with

no missing data). In case of missing sample – define a policy for fitting missing data. For instance,

use the last value, the average value, replace with zero, replace with average value for the given

time bin, etc.

3. Check if the time series is stationary or not. Decide accordingly whether to use differencing or not

(d=0 or not).

4. Compute the ACF and PACF to observe how they decrease. This is instrumental to guess possible

good values of the p and q parameters.

5. Decide the number of past samples to use for training N, and how many for testing. Given you have

about 30 days (each with 24 hours) you can consider for instance training during the first week of

data, and test the prediction in the second week of data.

6. Given N, (p,d,q), train a model, and compute the error. Consider the MPE or MSE – so that you can

compare results for different cities (absolute errors would obviously be not directly comparable).

7. Now check the impact of parameters:

a. Keep N fixed, and do a grid search varying (p,d,q) and observe how the error decreases.

Choose the best parameters.

5

b. Given the best parameter, and now change N and the learning strategy (expanding versus

sliding window). Keep always the testing on the same amount of data. For instance, if you

use 1 week for testing, you can use from 1 day to 3 weeks for training.

c. Compare results for the different cities. How the relative error changes w.r.t. the absolute

number of rentals?

8. [optional] Try to see how the time horizon h of the prediction impact the performance. Instead of

predicting the number of rentals at t+1, use the model to predict the future rentals at t+h, h in

[1:24].

版权所有：编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。