Transport Modelling in The Age of Big Data: Cuauhtémoc Anda Pieter Fourie Alexander Erath
Transport Modelling in The Age of Big Data: Cuauhtémoc Anda Pieter Fourie Alexander Erath
Cuauhtémoc Anda
Pieter Fourie
Alexander Erath
Work Report June 2016
Transport Modelling in the Age of Big Data June 2016
New Big Data sources such as mobile phone call data records, smart card data and geo-coded
social media records allow to observe and understand mobility behaviour on an unprecedented
level of detail. Despite the availability of such new Big Data sources, transport demand models
used in planning practice still, almost exclusively, are based on conventional data such as travel
diary surveys and population census. This literature review brings together recent advances in
harnessing Big Data sources to understand travel behaviour and informing travel demand
models that allow to compute what-if scenarios. From trip identification to activity inference, we
review and analyse the existing data-mining methods that enable these opportunistically collected
mobility traces inform transport demand models. We identify that future research should tap on
the potential of probabilistic models as commonly used in data science. Those data mining
approaches are designed to handle the uncertainty of sparse and noisy data as it is the case for
mobile phone data derived mobility traces. In addition, data fusion approaches should be applied
to integrate disparate but related datasets to blend Big Data with more granular information
from travel diaries. In any case, we also acknowledge that sophisticated modelling knowledge
has developed in the domain of transport planning and therefore we strongly advise that still
domain expert knowledge should build the fundament when applying data driven approaches
in transport planning. These new challenges call for a multidisciplinary collaboration between
transport modellers and data scientists.
i
Transport Modelling in the Age of Big Data June 2016
1 Introduction
Before the ubiquitous sensing of human mobility flows became possible through mobile phones,
public transport smart card transactions or Global Positioning System (GPS)-enabled devices, it
was difficult and expensive to generate large scale or even population-wide samples that allow
developing travel demand models. The aim of such models is not only to replicate with relevant
accuracy actual mobility flow but also the application of what-if scenarios to evaluate the impact
of different infrastructure development decisions.
Despite the availability of such new Big Data sources, transport demand models used in planning
practice still, almost exclusively, are based on conventional data such as travel diary surveys
and population census. While the applied statistical models have become more sophisticated as
computation power grew exponentially over the last decades, the most important change being
the evolution from trip- to activity-based models, the basic modelling paradigm remained the
same: mobility travel diary survey that only cover a small sample of the actual population is
being used to synthesise transport flows of a representative population.
New Big Data sources such as mobile phone call data records, smart card data and geo-coded
social media records allow us to observe and understand mobility behaviour on an unprecedented
level of details. But simply observing is not particularly helpful for planning purposes. To allow
for prediction in what-if scenarios, we need to understand and contextualise the information
contained in such Big Data sources to inform models of travel behaviour and adapt them to be
useful in travel demand modelling frameworks.
This literature review brings together recent advances1 in the fields of harnessing Big Data
sources to understand travel behaviour and inform travel demand models that allow to compute
what-if scenarios. To this end, we first provide a primer on the latest advances in transport
demand modelling including the latest agent-based approaches.
Our focus is then on the research that makes use of the relevant Big Data sources and directly ties
in the methodological toolkit of travel demand models, hence excluding methods that primarily
derive real-time analytics from such Big Data sources. The aim of the paper is to provide the
reader with an overview how Big Data already improve the understanding of mobility flows and
has been applied for transport demand models from a methodological angle. From this collection,
we identify the advantages and disadvantage of the various methodologies and its applicability
for being used in predictive transport models. Conclusions drawn from this literature review
include the description of new modelling applications that the new data sources allow but also
the specification of research gaps that need to be overcome to realise them.
1
Mainly from 2010 to the first quarter of 2016
2
Transport Modelling in the Age of Big Data June 2016
3
Transport Modelling in the Age of Big Data June 2016
are regularly cited as key shortcomings when it comes to the the evaluation of travel demand
management policies such as mobility pricing.
The fundamental principle of activity-based models is the understanding that travel is ultimately
derived from the necessity to participate in activities. The aim of activity-based models is to
predict for each individual the number, sequence and type of the activities are conducted over a
certain time-period subject to a set of spatial, temporal and resources constraints. However, while
activity-based models allow to generate spatially and temporally disaggregate description of
travel demand, for route choice and traffic simulation this travel demand is often aggregated again
to so-called origin-destination-matrices that describe how many trips are conducted between
any two OD-pairs. This restriction was originally due to the lack of simulation models that are
suitable to simulate traffic for a relevant spatial extent, i.e. entire cities or regions and across an
entire day, but still apply today due to the computational requirements of agent-based transport
simulation.
In addition to four step model data requirements, activity based models do require one additional
type of input, a ’synthetic population’ at the level of individual households and persons that is
representative for the actual population of the area of interest. This ’synthetic population’
includes a set of socio-demographic attributes which are then used for travel demand modelling
processes. Moreover, for every agent in the synthetic population, a fully descriptive daily activity
plan, including locations of daily activities such as work or education needs to be derived.
Agent-based transport models Agent-based transport models for strategic transport planning
usually derive travel demand from activity-based modelling approaches but employ microscopic
and completely time-dynamic traffic simulation of each agent’s individual demand based on
system constraints given by the transport network and its attributes. (Balmer, Axhausen and
Nagel, 2006).
While the original development of Transims Smith et al. (1995) as the first large-scale agent-based
transport simulator clearly focused on replacing aggregated transport assignment methods, later
implementations of TRANSIM and more recent developments of agent-based model such as
MATSim (Horni et al., 2016), SimMobility (Adnan et al., 2016), SimAGENT (Goulias et
al., 2012), integrate to different degrees also mode, time, destination and activity scheduling
processes into a single consistent modelling framework. Such an integrated modelling framework
overcomes that travel demand can in disaggregated form throughout the whole modelling process.
Besides the enhanced behavioural consistency, this also allows the modelling and analysis of
modern travel demand management tools such as time- and demand-dependent pricing and new
forms of mobility such as shared and autonomous vehicles.
4
Transport Modelling in the Age of Big Data June 2016
Multi agent-based modelling is built upon a large scale of autonomous agents which perform
their own decisions, interact with one another and with the environment. For each agent, an
initial daily activity plan is assigned as a precise description of the activities’ location, its
durations, start and end time, and the trips connecting two activities, including mode and route.
Among the several agent-based transport models that are currently under continued development,
MATSim takes a special role and can be considered the the currently most widely applied model.
It can integrate a wide range of decision dimensions in a co-evolutionary learning loop, but due to
its modular framework, it can also be used for traffic simulation only and integrated with other
activity-based travel demand models.
In MATSim, a day is simulated multiple times and after each iteration a fraction of the agents is
allowed to modify their plans (i.e. mutation/crossover phase). For instance, they can change
their departure time, the travel mode of a sub-tour, location of a given type of activity, among
others. At the end of each simulated day, the utility of the day is measured for each agent using
a scoring function that rewards agents for performing activities, while penalising them for
travelling, transferring between transport modes, waiting at transit stops and arriving late for
activities, etc. In such way, agents seek to improve their utility over iterations until the system
reaches an equilibrium where the generalised utility can not be longer improved (i.e. a steady-
state is reached) (Balmer et al., 2009).
Due to its wide-coverage in urban settings, the main focus of the survey review will be firstly on
data generated through Smart Card Automatic Fare Collection (SC-AFC) systems, and mobile
phone networks. Both of them can be classified as large-scale opportunistic human mobility
sensors, which are able to provide insights on urban dynamics and human activities at an
unprecedented scale and level of detail. Plus, the advantage that no additional infrastructure is
needed to extract mobility information, since they were designed to collect for public transport
5
Transport Modelling in the Age of Big Data June 2016
In addition, we will also cover research working with GPS data. Its high-resolution and accuracy
has allowed the creation of preprocessing and data mining techniques as well as inference models
that go beyond SC-AFC and mobile phone data. Plus, the recent drop in cost that allowed the
widespread use of GPS sensors in mobile phones and other devices, outlooks an obvious potential
for wide-spread application as urban mobility sensors in the future. Other supplementary datasets
being reviewed are Points of Interest (POI) information, Census and Surveys, and Land use
information. As commented on Calabrese et al. (2015), the purposes of the supplementary
datasets are three-fold. 1) to validate findings extracted from the analysis of large-scale human
mobility sensors. 2) to define scaling factors to extend results to the overall population. 3) to
augment information about urban space, to be able to extract higher level patterns.
From the ubiquitous computing devices, mobile phones have the highest levels of penetration
rate. While conventional mobile phones usually only sporadically exchange information with
cell-towers, the widespread use of smart phones - the level of penetration rate for smart phones
in Singapore reaches 88% of the population2 - provide appealing new opportunities to inform
travel demand model. Smart phones not only exchange data much more frequently with the
mobile network provider allowing more continuous tracking, they also carry a series of additional
sensors provide such as GPS that can be used to better understand mobility patterns. For instance,
Airsage3 is a company that process mobile phone data for transport planning applications.
Mobile phone networks Whether it is GSM, CDMA or LTE, mobile phone networks require
regular and frequent handshakes (i.e. pings) between mobile phone devices and cellular
communication antennas. In order to provide service to the users, mobile phone networks are
constantly and frequently determining the location of the mobile phone devices even if it they
are simply on standby. The user’s location is calculated by determining the location of the cell
antenna closest to the handset. This results in a precision equal to the size of the cell antenna
coverage, which can range among few hundred metres in urban areas.
In order to understand the events that generate the user location updates we need first to know
how the mobile phone network is constituted. The service coverage area of a given mobile phone
2
The Connected Consumer Survey 2014/2015. https://www.consumerbarometer.com/
3
www.airsage.com
6
Transport Modelling in the Age of Big Data June 2016
network is divided into smaller areas of hexagonal shape, referred to as cells. Each cell is the
area in which one can communicate with a certain base station antenna (also referred as cell
tower). A Location Area (LA) is a geographic area covered by base stations antennas belonging
to the same group. For users to be reached wherever they are in the network coverage area, we
can divide the update location procedures in network-triggered and event-triggered.
Most studies found in literature that use mobile network data analysing individual mobility
patterns from mobile phone data make use of Call Detail Records, which is a subset from the
mobile network data used for billing purposes, but also include event-triggered updates.
However, there are also a series of studies that make use of the full spectrum of mobile network
data, including both the event-triggered and network-triggered updates.
Moreover, precision of cell antenna location data can be improved between a dozen to hundred
meters when information is triangulated with signals from other cell antennas. (International
Transport Forum, 2015). By using the Timing Advance, which is a value that corresponds to the
length of time a signal takes to reach the cell tower from a mobile phone, information can be
triangulated from different cell antennas to have a more accurate estimation on the user’s
location. Other techniques used are based on the signal strength received by the mobile phone
using known irradiation diagrams and propagation models for localisation.
7
Transport Modelling in the Age of Big Data June 2016
Smart Card Automatic Fare Collection (SC-AFC) systems are built on radio-frequency
identification (RFID) technology. The objective of RFID is to use radio waves to exchange data
between a reader and an electronic tag for the purpose of identification and transaction operations.
Specifically, SCAFC systems use NFC (Near Field Communication) technology which ensures
secure short range RFID transactions. Since 1990, the use of smart card has become significant
in many sectors since it is perceived as a secure method of user validation and fare payment.
(Trépanier et al., 2007).
For the case of public transportation, commonly, users have to tap in their smart card onto the
reader device at the entrance of buses or metro stations and tap out at the alighting bus stop or
metro station. Besides revenue collection, large quantities of individual detailed information are
collected such as boarding times, boarding stations, alighting times, alighting stations, vehicle
identification. This represents a huge potential in better understanding travel behaviour and
improving current transport systems.
2.2.3 GPS
Purpose: Model Transferability
GPS information allows the collection of a detailed spatio-temporal trace describing the mobility
of an individual. All smartphones are usually equipped with GPS sensors and also some
conventional mobile phones nowadays include it. In order to calculate position, GPS uses
information of at least four satellites in a method called trilateration. In open areas accuracy can
be achieved up to 5 meters but it degrades, however, in areas where GPS signals are impaired by
tall buildings or trees and inside of buildings.
Assisted-GPS (A-GPS) increases location accuracy in urban areas by combining GPS location
signals with cellular location data providing an under 10 meters precision. Similarly, other
forms of hybridised GPS location systems can include the use of Wi-Fi network signals through
the tracking of media access control addresses (MAC addresses) within a network of Wi-Fi
routers.
8
Transport Modelling in the Age of Big Data June 2016
2.2.4 POIs
Purpose: Urban Space Augmentation
Points of Interests (POIs) are a list of business and important places to visit in a city, including their
name, classification and location. There are many possible different sources: Yellow Pages, Google
Places, Yahoo PlaceFinder, which might provide different information. For instance, in Google
Places we can find in addition opening times, reviews and hourly-estimates on the crowdedness of
a place. Furthermore, opportunistic POI datasets can be derived from crowdsourcing platforms and
social networks (e.g. Foursquare, Flickr, Twitter, etc.). The importance of these datasets is the
potential to serve as complementary datasets to augment urban space information and thus, improve
activity and places inference estimations.
Census and surveys provide datasets related to very different areas: demography, health, education
government and security, communication and transport, etc. Such datasets can be used to: 1)
validate home and working areas, 2) validate city patterns such as hotspots, commuting, traffic
flows, 3) validate land use. The main advantage is the very refined spatial resolution which is
often the census block. The main disadvantages are that they are updated usually only every 5 to
10 years.
Other use of Census information is that it provides the means to perform scaling expansion from
information derived by large-scale human mobility sensors.
While the use of census and other conventional survey data has a long tradition in travel demand
modelling and is well documented, the review of the related work does not fit the scope of this
review. However, since census and survey data is still relevant to enrich and mobile phone data
through data fusion, this review brings together pioneering work in those areas.
9
Transport Modelling in the Age of Big Data June 2016
Land use datasets offer access to various information that allows to characterise an area based
on its planned and effective land use. Such land use data usually specifies for each plot the
designated usage purpose in case of built up environment also the usage intensity. However, as
different authorities use different land use classifications, the developed models are usually
customised to local conditions which restricts the direct transferability between regions. Within
the scope of this review, land use data is of interest due to its potential to impute the purposes of
activities as identified from SC-AFC and mobile phone data.
10
Transport Modelling in the Age of Big Data June 2016
Several efforts have been done to improve the original idea by Barry et al. (2002). Zhao et al.
(2007) expanded the idea to rail-to-bus sequences. Trépanier, Tranchant and Chapleau (2007)
incorporated the possibility of looking at the next day, even observing weekly travel patterns to
11
Transport Modelling in the Age of Big Data June 2016
complete missing information for the bus system of Gatineau in Quebec. Munizaga and Palma
(2012) proposed a multimodal public transport methodology using time constraints instead of
distance constraint. For these studies, the reported success varies from 66% to a 80% of the
individual trips reconstructed.
Furthermore, a different approach to reconstruct individual trips from smart card data which is
based on a semi-supervised learning algorithm has been presented by Yuan, Wang, Zhang, Xie
and Sun (2013). They proposed an integrated learning method in which they align the monetary,
geospatial and temporal spaces to extrapolate a series of critical domain specific constraints.
They incorporate those constraints in a semi-supervised conditional random field algorithm to
infer the exact boarding and alighting stop even if there exist records on trips with unknown
boarding and alighting information. Given only 10% trips with known alighting/boarding stops,
they inferred more than 78% alighting and boarding stops from trips with missing information.
The relevance of the work is not only the reconstruction of origin-only labelled trips, but a
systematic way to recover individual mobility history from urban scale smart card transactions.
This can be helpful as a pre-processing stage for later analysis or incorporation to transport
demand models.
Having information only on smart card data draws a limitation in identifying the spatio-temporal
dimension of an individual performing an activity, since public transport is not exclusively used
for all trips throughout a day. Chakirov and Erath (2012) describes the limitation with the
concept of public transport trip consistency, in which consistency means that a person who
arrived to the activity location by public transport, has to leave it after ending the activity also by
public transport. Although smart card data do not record any other means of transport except
from public transport, the most obvious cases of inconsistency can be identified by analysing
the distances between the alighting location of the last journey and the boarding location of the
following journey. This allows to identify if other means of transport such as taxi, car or walking
12
Transport Modelling in the Age of Big Data June 2016
For instance, on a typical workday in Singapore, Chakirov and Erath (2012) found that from
persons with more than one journey recorded in the smart card data, 90% of the journeys, start
less than 1 km away from the previous alighting location. This indicates firstly, that the majority
of public transport users don’t switch to other modes of transport between public transport
journeys and therefore have consistent journey chains. And secondly, that with some degree of
uncertainty an area can be limited for possible activity locations.
Once individual trips have been reconstructed to the point of knowing boarding locations and
final destinations, a possible application is the calculation of a public transport OD matrix. It is
important to take into account trips which could not be able to reconstruct. For this situation a
typical solution is to build expansion factors. The work by Munizaga and Palma (2012) shows
how to build expansion factors for smart card data trips associated with an origin but not with a
destination, and for trips associated with no origin or transaction. For the former ones, it is
assumed that the distribution of trips is the same as that of other trips with the same origin; whilst,
for the latter case, the distribution of trips is associated only by their time disaggregation.
Such is the case of the work by Chakirov and Erath (2012). They proposed a multinomial logit
model with activity duration, activity start time and land use as the the utility variables to match
a discrete-choice space consisting of work activity, home activity and other activity as the target
labels. Utilities for the model were constructed mainly using piecewise linear functions. For the
case of activity duration and start time, the utility function was calibrated using information
13
Transport Modelling in the Age of Big Data June 2016
from the local travel diary survey, whilst for the land use information stemmed from the urban
planning authority’s Master Plan.
After having obtained the home and work locations, a validation process is required to determine
the model’s accuracy. One possible validation process is the one used by Yuan et al. (2013). A
2D Kernel Density Estimation (KDE) was applied to identify hot spots that enabled a comparison
against data derived from travel diary surveys.
The work of Bouman (2012) represents a first attempt to implement an agent-based micro-
simulation of public transport for the cities of Amsterdam and Rotterdam. Based only on smart
card data, the main challenge of the work is the generation of the agents’ activity plans. For such
task, they focus on the extraction of commuters’ home-work-home pattern looking at several days
of the same user. Work and home stations are identified as the two most visited stations during
the weekdays, from which home is identified as the most visited station during the weekend.
For smart cards ids whose data was not suitable to be fitted with such a pattern, but at least an
activity chain could be reconstructed for one particular day, travel demand was specified by
introducing dummy activities at the intermediate stations of the tour. Finally, for highly irregular
travel patterns, a new agent for each of the remaining trips was generated.
14
Transport Modelling in the Age of Big Data June 2016
The synthetic-population generation process suffers from different limitations that mainly stem
from the various assumptions imposed in the modelling process. Opportunities for future research
can be identified in a more accurate and validated representation of the actual travel demand
including the inference of trip purposes and socio-demographic characteristics of the traveller.
To this end, we consider the long observation periods of smart card data as an opportunity to
apply modern data mining techniques to infer additional information. Along this line, (Bouman
et al., 2013) for example explored how the concept of eigenbehaviours (Eagle and Pentland,
2009) can be applied to derive spatio-temporal patterns.
Another challenge when using smart card data for simulation is to model potential interactions of
public transport vehicles with other transport modes (i.e. cars). A recent work by Fourie, Erath,
Ordóñez Medina, Chakirov and Axhausen (2016) develops a simplified agent-based transport
simulation for Singapore’s public transport. In contrast to Bouman (2012), the interaction with
private vehicles was accounted by introducing a stochastic model of the speed of buses between
public transport stops and bus dwell time behaviour at stops. This allows not only to improve the
simulation time substantially, but also to predict the operational stability of alternative public
transport schedules, making simulations of system-wide network redesigns possible. For this
purpose, they adapted three inputs to the MATSim environment.
Firstly, a reconstruction of bus trajectories from smart card data was developed. Given all boarding
and alighting transactions of bus users, the position in space and time of the corresponding buses
was estimated. They imputed the time it takes for a bus to travel between bus stops locations by
grouping its transactions at each stop into sets that represent bus dwell operations. Then, from
the reconstructed bus trajectories they determined the number of services and the time when the
services start for every bus line in Singapore. For the particular case of train services, they took
the start times from the Google Transit Feed Specification (GTFS) since especially during peak
hours train trajectories cannot reliably inferred from public transport smart card data Sun et al.
(2012).
Secondly, they needed to generate activity plans for each agent in the simulation. To that end,
they established a 25 min threshold to identify the final alighting location of each multistage
trip to not split journeys at transfer points. Since smart card records only document boarding
times but not when a person actually arrived at the bus stop: given that average headway for
most bus services is 10 minutes or shorter, they assumed a uniform arrival time distributions
and randomly drew the actual arriving time from the bus stop with the only parameter being the
corresponding headway between consecutive services of the specified line.
15
Transport Modelling in the Age of Big Data June 2016
Lastly, as only public transport vehicles were simulated, a simplified network offered the
opportunity to lower the computational demands of the simulation substantially. Instead of the
MATSim queue model, a stochastic travel time model was introduced to model travel times
between two subsequent stops. The model was fitted based on a multinomial regression model
assuming that stop-to-stop follow a normal distribution Fourie (2014). As shown by Sarlas and
Axhausen (2015) the parameters that determined the speed of vehicles in a network link were
related not only to the level of demand on the link (taken from smart card data), but also to the
topographical information contained in the network description. To account for dwell time
variability in the simulation framework they included the model presented by Sun, Tirachini,
Axhausen, Erath and Lee (2013).
As a case study to showcase the abilities of the model, they simulated the impact when splitting
one of the longest bus lines in Singapore. The results suggested that incidences of bus bunching
can be significantly reduced during the morning peak hour, and that headway reliability is also
improved considerably.
The work of Fourie et al. (2016) shows one of the possible integrations of big data algorithms
within an agent-based transport modelling framework. They enhanced the simulation results of
a complex model by introducing what is called in the computer science literature as a machine
learning surrogate. They substitute a compartment of the whole model with the behaviour
obtained from the statistics of the smart card data. The results not only represent a more accurate
representation of the real world, but also an improvement of the overall computation time.
However, still several limitations have to be addressed such as implementing the passing behaviour
in the queue simulation, the reconstruction of train trajectories, a better representation of walking,
waiting and transfer activities to better represent route and mode choice preferences. In addition,
public transport smart card data obviously do not contain any information of motorised travel
demand and active mobility. To tackle the aforementioned problems, additional datasets can be
included.
16
Transport Modelling in the Age of Big Data June 2016
Moreover, for the different studies found, the quality of the data also depends on the type of
dataset provided by the mobile network operator. While some studies work with the full
spectrum of mobile phone management signals (e.g. location area (LA) handles, device updates),
others work with the Call Detail Records (CDR) subset, or with already preprocessed data (e.g.
triangulated location estimates). Thus, the general challenge in using mobile phone data is how
to robustly extract people’s trip sequences from sparse and noisy measurements and enrich the
extracted trips with semantic meaning (i.e. trip purpose) (Widhalm et al., 2015).
In the following, we present a series of studies that aimed to reconstruct individual trips from
mobile phone data in order to extract mobility patterns. Thereby, we focus on approaches that
allow the generation of OD-matrices and applications with potential to be adopted for agent-
based simulations. For this purpose, we have divided the relevant literature based on their
methodological approach and their scope. The first group of studies introduced resembles the
traditional trip-based approach, a second group focuses on extracting stay locations from noisy
mobile phone traces, and the third group attempts to infer activities performed at the extracted
locations. At the end of the chapter, we finalise presenting the studies that have been made using
mobile phone data specifically for the generation of agent-based simulations.
17
Transport Modelling in the Age of Big Data June 2016
This idea was then formalised by Wang, Hunter, Bayen, Schechtner and González (2012) with
the definition of the transient OD matrix. The concept recognises that even if segments of the
trip are unobserved in the CDR (e.g. real origin and destination locations) due to mobile phone
inactivity, still a large portion of the actual ODs are retain and can be use to analyse road usage
patterns. The transient OD matrices are constructed by simply counting trips for each pair of
consecutive calls made within the same hour from two different towers, and then the OD trips
are assigned to the road network by a shortest path algorithm.
Similarly, Iqbal, Choudhury, Wang and González (2014) demonstrate the development of OD
matrices using CDRs from Dhaka, Bangladesh, and traffic counts from a video vehicle detection
system. Firstly, tower-to-tower transient OD matrices are generated and then associated with
corresponding nodes of the traffic network converting them to node-to-node transient OD
matrices. Then, the transient OD matrices are scaled up to match the traffic counts. To determine
the scaling factors, an optimisation-based approach is used which minimises the differences
between observed and simulated traffic counts at the points where the traffic counts are available.
Lastly, for the estimation of the final OD matrix they introduce correction factors to account for
the mobile phone market penetration rates and mobile phone usage.
As discussed by Jiang et al. (2015) one of the problems of the trip-based approach is that it can
introduce biases when CDR data are low in spatial resolution. In addition, the former methods
are not able to handle noisy measurements from raw mobile phone traces. In order to avoid
these issues, the approach of the next group of studies presented is based on parsing the
trajectories observed into stay-locations.
Additional to a lower spatial resolution in comparison with GPS traces, mobile phone data
suffers from a phenomenon called supersonic jumps or signal jumps (i.e. outliers). These are
events that suddenly occur kilometres away within a short period of time. Although such jumps
18
Transport Modelling in the Age of Big Data June 2016
usually are system inherent data noise, some jumps might be triggered by external mechanisms
in aims to protecting the privacy of the users (Horn et al., 2014). To use mobile phone data for
accurate traffic modelling, these shortcomings must be considered in order to derive realistic
trajectories.
Temporal-based clustering Schlaich, Otterstätter and Friedrich (2010) work with Location
Area (LA) updates from a region of southwest Germany. The algorithm proposed is built on
the principle that if a user remains a considerably longer time in a location area than the time
required for directly traversing the area, the user potentially starts or ends a trip in the respective
location area. For this purpose they suggested a 60 min rule in which if the time period between
the first login and the last logout of a multiple visited LA is 60 min or more, then it is considered
to be a stay location. In addition, as a strategy to deal with signal jumps, they calculated a
jumpiness factor and deleted user entries that exceeded a given threshold.
Certainly, the approach suffers from several limitations due to the resolution of the trips extracted
being at the broad LA level and not the cell-tower area level. For instance, in their preprocessing
step, the decision to delete consecutive data points from users with the same location areas,
disables them to estimate arrival times and activity durations. Also, since their method requires
a minimum of three LAs, they deleted users that show records of less than three different LAs.
Hence, a more robust methodology would be desirable that can handle noisy signals and outlier
points without the need to delete any entries, and that can be able to estimate trips at the cell-tower
resolution.
Distance-based clustering Calabrese, Lorenzo, Liu and Ratti (2011) proposed a method to
identify trips at the cell-tower level based on Call Detail Records (CDRs) generated from phone
calls, messages and internet usage. They also included a strategy to handle noisy traces
employing different clustering techniques. For the preprocessing step they first characterised
the individual calling activity and verified that it was frequent enough to allow monitoring the
user’s movement over time with a fine enough resolution. Then, they applied a low-pass filter
with a 10-minute resampling rate and a clustering technique to identify minor oscillations
around a common location. As for the extraction of stay-points, they performed a distance-based
clustering to fuse points within a 1 km area. The centroid of the cluster was defined to be a
virtual location and in a final step, individual trips were reconstructed by connecting the paths
from the identified virtual locations. However, the methodology lacks to robustly filter out
passing by events.
19
Transport Modelling in the Age of Big Data June 2016
At this point, we can recapitulate and reformulate the main ideas towards extracting places from
raw mobile phone data according to the data mining pipeline suggested by Jiang et al. (2015).
First, the need to eliminate outlier noise and signal jumps between towers. Secondly, the need to
cluster points that are spatially close and temporally adjacent into a single location. And thirdly,
the need to agglomerate points that are spatially close but not necessarily adjacent in temporal
consecutive sequence, since we are interested in the unique stay locations that a user frequents.
In addition, an estimation on activity start times and durations is also needed. The following
studies represent the latest endeavour (from 2013 to 2015) to mine location points from mobile
phone data.
Preprocessing techniques For the first goal, Horn, Klampfl, Cik and Reiter (2014) perform an
evaluation on three different types of filters to detect outliers on mobile phone traces: a
Recursive Naive Filter, a Recursive Look-Ahead Filter, and a Kalman Filter. On the one hand,
the first two basically act as low-pass filters (Calabrese et al., 2011). They smooth out large
positioning errors by introducing an upper bound constraint on the travel speed. Hence, the
speed is calculated for each each consecutive pair of points (Recursive Naive Filter), or each
triad of points (Recursive Look-Ahead Filter) and compared to a certain threshold. On the other
hand, the Kalman Filter is a probabilistic approach that reconstructs the trajectory. The results
demonstrated that the Recursive Look-Ahead Filter performed better as it eliminated the outlier
points, and in addition maintained the accuracy of the trajectories. Although the Kalman Filter
also eliminated the outlier points, the trajectories lose accuracy. However, the satisfactory results
of Ficek and Kencl (2012) to extend the spatial resolution of the Reality Mining Dataset (Eagle
and , Sandy) using a Gaussian Mixture Model, suggest that given the low-resolution of CDRs,
more complex probabilistic filters are needed in order to outperform the naïve approaches.
20
Transport Modelling in the Age of Big Data June 2016
Time-distance clustering For the second and third goals (i.e. location extraction), Jiang et al.
(2013, 2015), Toole et al. (2015), and Alexander et al. (2015) used time and distance clustering
techniques to filter out passing by points. Firstly, they grouped points that are spatially close by
measuring the distance between two consecutive points and comparing them to a distance
threshold (e.g. roaming distance of 300 m). Then, the clusters obtained are considered to be
potential stays if the time between the first and the last observation in the cluster are separated by
a time greater than a time threshold (e.g. 10 min). Then, the geographic location of the potential
stay is set to be at the centroid of all points within the cluster. Due to noise in locations, multiple
potential stays that are actually the same place may be estimated at a slightly different geographic
coordinate on different observation days. To account for this, a final agglomerative clustering
algorithm is used to consolidate candidate stays to a single semantic location regardless of the
temporal sequence of the records.
Trip Validation It is important to verify for the algorithms proposed, that users with more phone
activity do not have systematic differences in travel behaviour. For instance, that there does not
exist a correlation between the number of places detected and the mobile phone usage. Jiang et
al. (2015) segmented users into 5 groups according to the frequency of total number of phone
usage observations. Then they examined for each group the daily travel patterns, including daily
number of trips, and daily number of unique destinations. Finally, they compared the frequency
distributions of both the number or trips and the daily number of unique locations and conclude
that they follow similar patterns.
Activity start times and durations After having identified the stay locations, Widhalm et al.
(2015) continue the study by estimating the arrival time as the average between the earliest record
in the arrival activity (i.e. the upper bound of the arrival time), and a lower bound estimate,
calculated as the sum of the latest record at the previous location plus the travel time between
the previous and present location. The travel time was determined as the distance between the
consecutive centroid of clusters divided by an assumed travel speed. The same process was
performed for the expected time of departure, and the activity duration calculated by subtracting
both estimates.
Another alternative to infer arrival/departure times of activities is the one of Alexander et al.
(2015). They proposed to use probability density functions of activity durations derived from
the National Household Travel Survey. They constructed six hourly distributions for weekdays
and weekends and the following trip purposes: home-based work (HBW), home-based other
(HBO), and non-home based (NHB). Then, they randomly generated the departing time within
the time window of observations, using the distribution that corresponded to the day (weekday,
weekend) and the trip purpose (HBW, HBO, NHB).
21
Transport Modelling in the Age of Big Data June 2016
Several authors (Jiang et al., 2015; Alexander et al., 2015; Toole et al., 2015) have proposed the
expansion factor calculation by identifying the tower a user is connected to while being at home.
Since the ratio of mobile phone users to the population is not uniform within the research region,
each user is assigned a home census area, and expansion factors are computed for each area by
measuring the ratio of the total population living in the area, and the sum of users whose home-
tower was identified inside the area. General OD matrices are built, and if a certain mode of
transport wants to be considered, the vehicle OD matrix is approximated by weighting the total
number of user trips by the vehicle usage rate in the home census zones. However, the limitation
of the methodology is the generalisation of travel patterns by census areas which might reflect
biases in the vehicle trips ODs.
Alternatively, Zhang, Qin, Dong and Ran (2010) build an expansion factor based on a probabilistic
approach. They firstly looked at the problem on how to avoid multiplication of trips when there
exist more than one mobile phone probe in a vehicle. They calculate a conversion factor from
mobile phone probe flows into equivalent vehicle flows, using the following assumptions: —
1) Mobile phones in close proximity (i.e. the same car) generate signal transition events at
exactly the same time. 2) There is a very small probability that some parallel travelling cars are
crossing at least two LA boundaries at two same timestamps. 3) Within the saturation headway
(typically 2 seconds) there is only one vehicle crossing an LA boundary in each lane. Secondly,
they computed a conditional probability for mobile phone ownership including mobile phone
market penetration data, market shares for a given mobile phone carrier, and age and income
distributions from census data. Finally, for the purpose of projecting to the population vehicle
OD matrix, they calculated the scaling factor through a Horvitz-Thompson estimator, in which
the conversion factor from mobile phone to vehicle flows was included, as well as the posterior
probability of mobile phone ownership.
In contrast with the home-based approach, the model takes into account the conversion from
individual mobile phone flows to vehicle flows (i.e. shared rides) not as an a posteriori step, but
integrated in their model. Plus, the fact of characterising the population mobile phone ownership
by age and income, instead of zones, avoids socio-economical biases more accurately in the
projection process.
22
Transport Modelling in the Age of Big Data June 2016
Activity inference by time-frequency rules One of the direct ways to infer contextual
information such as location’s function or trip purpose, is by time-frequency rules. Several
authors (Toole et al., 2015; Alexander et al., 2015; Jiang et al., 2015) have improved the general
idea shown in Wang et al. (2012) and Iqbal et al. (2014) in using both visit frequencies and
temporal data to identify work, home, and other locations. A user home location is defined as the
stay point most frequently observed between 8 pm and 7 am during weekdays and weekends.
Whereas, work location is defined as the stay point, other than home, that users visits the most
between 7 am and 8 pm on weekdays. Since some individuals do not work, the work location is
left blank if the candidate location is not visited more than once per week or if the location is less
than 500 m away from the home location (to avoid work identification through signal noise
rather than a distinct location). Another variation (Alexander et al., 2015) is that work location is
identified as the stay to which the user travels the maximum total distance from home location,
to identify evening and night shift jobs. All remaining non-home or work stay points are
designated as other.
23
Transport Modelling in the Age of Big Data June 2016
Activity inference by probabilistic modelling Time-frequency rules to infer activity context
are a straightforward method, but may not be useful for certain groups in the population. More
robust inference methods can be applied using probabilistic models, in which other datasources
can be naturally included (semantic-enriched geographic data, POIs, etc.). Thus, we can reference
the spatial information and characteristics of destinations to build probabilistic models to infer
activity types at different destinations in space and time.
Jiang et al. (2013) formulate the target probability as the probability that an individual performs
a certain activity at a certain time depending (conditional) to her/his destination information
and her/his extracted unlabelled activity chain (i.e. motif, (Schneider et al., 2013)). Once the
problem has been reformulated as a target probability, the next step is to choose a probabilistic
inference model suitable to handle the dependencies on the random variables and to calibrate the
parameters of the model. For the latter problem, two approaches exist based on the availability
of training data: — Supervised learning, and Unsupervised learning.
Activity Clustering When there is no available labelled training data, we can still reveal the
spatial-temporal structure of activities by performing unsupervised clustering. Widhalm et al.
(2015) and Yang et al. (2015) clustered the stays into meaningful categories based on stay start
time, duration, sequence of stay locations, and a vector of land use shares. To model the
dependencies between the explanatory variables and construct the joint distribution they used a
Relational Markov Network (based on the work of (Liao et al., 2005)), which is an extension of
undirected graphical models known as Markov Random Fields. To compute the joint posterior
distribution they used Rejection Sampling, which is a technique that takes samples from the
posterior distribution. Finally, to train the network in an unsupervised way, they used Expectation
Maximisation (EM) algorithm and applied this methodology for case studies in Boston and
Vienna.
24
Transport Modelling in the Age of Big Data June 2016
The resulting clusters reflect trip chains and activity scheduling patterns that agree well with
data obtained from traditional surveys. Moreover, the comparison between both cities showed
similarity in their clusters. Still, several future improvements are suggested. Firstly, to study the
relationship between the automatically discovered activity clusters and conventional activity
types used in traditional surveys more closely. Secondly, to examine and include the interaction
between land use and travel behaviour (e.g. density, regional accessibility, roadway connectivity,
etc.) in the modelling, for instance how the inclusion of point-of-interest (POI) databases allows
to further improve the accuracy of the results. And thirdly, the inclusion of the results in a
simulation model (i.e. agent-based model) to compare the resulting traffic flows to actual traffic
measurements as a validation step.
25
Transport Modelling in the Age of Big Data June 2016
The work of Zilske and Nagel (2014) represents a pilot study of such an approach. They seek to
replace travel diaries with sets of CDRs as input data for agent-oriented traffic simulation. For
this purpose, they generated synthetic CDR data from a MATSim simulation using a plug-in that
introduces cell coverage and a mobile phone usage model for the agents. CDR information is
generated on arrival at or departure from activity locations. After CDRs are generated, they
identified every observed person with a MATSim agent and converted every call information
into an activity. The main limitation on the study is the simplification of the mobile phone data
generation process. The full-spectre of mobile phone data is not represented (e.g. cell handovers,
automatic location updates, mobile internet usage), plus CDR information was not generated for
instances outside the activity locations (no mobile phone activity on-the-go). These resulted in an
underrepresentation of the traffic simulation in MATSim.
However, in their next paper on this topic, Zilske and Nagel (2015) propose to mitigate the
underrepresentation and reduce the spatio-temporal uncertainty by fusing the CDR dataset with
traffic counts. They computed an expansion factor of the population to compensate for the
underestimate demand and matched it to traffic counts. From each trace they created several
agents and used the expanded population as a buffer (with the introduction of a stay-at-home plan
probability) to steer the demand towards matching the known link volume counts. Additionally,
in order that the modelled travel demand matches with the traffic counts, they introduced a
parameter to alter probability by which a particular activity plan is chosen. Intuitively, the offset
was a calculation based on how much a specific choice of a plan contributes to the whole traffic
system fitting to the traffic counts.
26
Transport Modelling in the Age of Big Data June 2016
Although public transport smart card and mobile phone data have wide coverage across a city’s
population, the lack of purpose for the identified trips needs to be inferred in order to be useful
for activity-based modelling approaches. Temporal features such as activity start time and
duration may be enough to determine primary locations (i.e. work, home) for the majority of the
population. However, for secondary locations in which individuals perform a range of activities
from dining, shopping, and different types of leisure activities, enrichment of the spatial feature
space is needed from the establishments surrounding a detected stop area from the mobility
traces. Such information can be derived from Points of Interest datasets.
In terms of supplementing mobile phone data with POI information, Noulas and Mascolo (2013)
compute the most popular activity within an area delimited by cellular antennas, using CDRs and
Foursquare check ins and places data. They built a set of features to exploit semantic annotations
of Foursquare data and test different supervised learning classifiers for the following classes: Arts
and Entertainment, College and Education, Food, Work, Nightlife, Parks and Outdoors, Shops,
Travel spots. Results show better accuracy for Nightlife, Arts and Entertainment, followed by
Shops and Parks and Outdoors, whilst, the classifiers did not show good results for College and
Education. These results encourage the use of POI as a complement on helping with secondary
activities inference, in which, temporal features are not enough.
A similar work was done by Phithakkitnukoon, Horanont, Di Lorenzo, Shibasaki and Ratti
(2010) for which they used POIs extracted from Yahoo maps API. They cluster four different
types of activities (eating, shopping, entertainment, and recreational) using k-means algorithm,
and estimate the most probable activity in each cluster using Bayes theorem. Using the most
probable activity in a region and work place identification through temporal features (Calabrese
et al., 2011), they infer a daily activity pattern for each of the users. As a result, they found a
strong correlation in daily activity patterns within groups of people who share a common work
area’s profile.
It is important to note that both studies focus on identifying the most probable/popular activity in
a delimited area, and not the most probable activity given a daily individual tour behaviour. For
the latter, it is important to include individual and temporal variables in the activity recognition
process such as the previously visited places and the starting time and duration of the activities,
so that the region-based approach can be changed to an individual-based activity inference
process. While approaches to do so with mobile phones and fare collection smart cards are
limited and have been presented in previous chapter, in the next section we will present the
extensive efforts done to annotate individual trip purposes using GPS traces.
27
Transport Modelling in the Age of Big Data June 2016
5.2 GPS
GPS consist of a spatio-temporal high-resolution track (i.e. usually one reading per second,
depending on configuration and application purpose). The importance of GPS technology or
mobility applications is growing steadily: it is widely used for location-based smart phone
application, to monitor public transport operations (GPS-enabled buses and taxis), and in the
automobile industry as core technology for various applications such as route guidance but also
is applied for toll collection and travel diary surveys.
However, in this chapter, we focus on applications that use GPS data to derive information with
a semantic meaning (i.e. trip purpose). Depending on their methods, the following taxonomy
according to Huang, Li and Yue (2010) is presented.
5.2.1 Distance-based
The basic idea is to assign the closest POIs to the raw trajectory’s clusters. Commonly, a
minimisation of the Euclidean distance is used between the location of nearby POIs and the
identified GPS stay point (Bohte and Maat, 2009); Xie, Deng and Zhou (2009) present a more
elaborated procedures for which they construct a Voronoi diagram using POIs as Voronoi sites
and then select the POIs closest to the polyline geometry of the trajectory.
Although distance-based methods are easy to implement, they are only suitable for when there
exists a high-accurate GPS trace and therefore are not ideally suited for application to mobile
phone and smart card data.
5.2.2 Attractiveness-based
Different to the distance-based methods, attractiveness-based approached are designed to also
include POI-related information to assign a particular POI given a GPS-point cluster. Huang et al.
(2010) measured the spatio-temporal POI attractiveness based on statical factors (e.g. size of
POI, popularity, and category) and a dynamic function for the attractiveness variation along the
hours of the day. From this information they constructed an attractiveness prism, and selected
the POI for which the mobility trajectory intersects the prism. Furletti, Cintia, Renso and
Spinsanti (2013) present a similar approach, but apply a gravitational model to identify the most
probable activity from a list of ranked POIs.
28
Transport Modelling in the Age of Big Data June 2016
One of the advantages of the attractiveness-based approach is that it is also suitable for other
types of mobility traces to a certain extent. The CDR + POI examples covered in the previous
section (Noulas and Mascolo, 2013; Phithakkitnukoon et al., 2010) can be fitted in this category,
for the reason that they seek the most attractive/popular activity in an area.
Learning The structure of a graphical model can be learned from data automatically or pre-
defined by human knowledge. Graphical models usually contain hidden variables to be inferred
(i.e. activity / trip purpose). The learning process of graphical models is to estimate the
probabilistic dependency between different variables given the observed data. Expectation and
Maximisation (EM) algorithms are commonly used methods.
Inference The inference process is to predict the status of hidden variables, given the values of
observed variables and learned parameters. The inference algorithms can include deterministic
approaches, such as variational methods, and stochastic algorithms like Gibbs Sampling.
In the following, we present representative examples for both generative and discriminative
classification graphical models.
29
Transport Modelling in the Age of Big Data June 2016
Different extensions of HMMs have been developed to compute activity likelihoods and
probabilistic estimates of the purpose behind the stop. For instance, Liao et al. (2007b) proposed
a hierarchical HMM trained in an unsupervised manner using Expectation Maximisation (EM)
to learn the parameters of the models, and Rao-Blackwellised particle filter for the inference task.
However, the model still needed to be expanded in order to include information about time of day
and day of the week. Such modification was included in Duong et al. (2005). They introduce
a Switching Hidden Semi-Markov Model (S-HSMM) to exploit both the inherent hierarchical
organization of the activities and their typical durations.
More recently, Yan, Chakraborty, Parent, Spaccapietra and Aberer (2011) developed a framework
that enriches trajectories with any kind of semantic data provided by POI datasets based on a
Hidden Markov Model. To define the initial probabilities they used the percentage of POI
samples belonging to each category. Then they use information on transitions between regions
(annotated with land use information) to construct the state transition. Finally, to infer the hidden
states they maximised the likelihood of the HMM. One of their main contributions is the effective
use of both land use information with POIs to infer the activity purpose. Additionally, Baratchi,
Meratnia, Havinga, Skidmore and Toxopeus (2014) propose an extension of the hierarchical
HSMM that captures spatio-temporal associations in the locational history in both stay-points
and trips connecting the stops.
30
Transport Modelling in the Age of Big Data June 2016
Conditional Random Fields (CRFs) are an example of discriminative models suitable for
classification tasks with complex and overlapped attributes or observations. Liao et al. (2007a)
show a holistic approach using hierarchical CRFs to extract places and activities from GPS
traces. The main objective of their work is to segment a user’s day into everyday activities such
as working, visiting, or travel and to recognise and label significant places such as workplace,
friend’s house, or bus stop. To determine activities, the model relies on temporal features, such
as duration or time of day, and geographic information such as locations of restaurants, stores,
and bus stops. They use maximum pseudo-likelihood estimation to learn the parameters of the
model, and belief propagation for the inference task.
Relational Markov Networks (RMN) are extensions of CRFs that provide a relational language
for describing clique structures and enforcing parameter sharing at the template level. Liao et
al. (2005) trained a RMN for labelling the following activities: at home, at work, shopping,
dinning out, visiting, and other. They incorporate global features (e.g. number of home
locations), temporal information (e.g. duration, time of day), and spatial information (POIs) in
the clique templates. For the inference and learning task they develop a technique based on the
Markov-Chain Monte-Carlo algorithm in a supervised manner. They showed that it is possible
to learn the parameters of a complex model using less data by using priors extracted from other
people’s data.
31
Transport Modelling in the Age of Big Data June 2016
6 Conclusion
6.1 Data driven Agent Based Modelling for Transport Planning
Traditional data sources for transport forecasting, i.e. household travel surveys, are of undeniable
value. They not only cover detailed data on individual and household mobility patterns but also
include relevant information on travel modes and purposes. Yet, stand-alone they are not able to
exploit the full benefits of the agent-based transport modelling paradigm. Two main limitations
can be identified. Firstly, they represent only a small sample of the population (normally around
1%). Secondly, they are usually only updated every five to ten years.4
Opportunistic human mobility sensors tackle these drawbacks and become a promising path to
continue developing agent based models for transport planning. The tradeoff of using such
opportunistic widely-collected information is its raw nature. An additional analytic effort has to
be done to identify trips and trip purposes so that they can be integrated in the agent-based
simulations. Thus, the key challenge is the development of robust algorithms that can extract
daily individual schedules from sparse mobility traces. Specifically, as mentioned in Jiang et al.
(2013), developing effective techniques to link the association rules of semantic land use and
POI information of the diverse areas that individuals visit is an open challenge for estimating the
activity types that individuals engage.
4
However, a few authorities have started with continuous surveys also using smart phones to lower the response
burden and increase data quality, especially with regards to capturing activities that only last over a short
duration.
32
Transport Modelling in the Age of Big Data June 2016
CDRs it is likely that such training sample would not be easy to obtain. One of the possible
options is to use information from travel surveys and design a feature-space in which CDRs
and travel survey information can work along in a transfer-learning paradigm. Finally, another
important issue to note, specifically, about inference models designed from GPS traces, is that
usually they are trained and validated for small samples of the population (e.g. 4 persons in Liao
et al. (2007a)). This certainly rises questions about their performance when scalable to the city
size, in which a wider set of behavioural and mobility patterns from individuals might invalidate
the results achieved by the models.
As mentioned in Calabrese et al. (2015), there are some challenges in comparing different
datasets, even if they are related. The main one being the different collection periods and
different spatial units. For instance, census data is usually available at the block level, while
mobile phone data relates to individual cell towers. However, we also see the main advantage of
using different human mobility sensors and supplementary datasets such as travel diary data to
complement underlying the importance of data fusion approaches.
From the literature review on the use of smart card and mobile phone data in transport modelling,
we have seen mainly the use of census or household travel surveys as a mean to validate or to
build expansion factors from the algorithms proposed (Alexander et al., 2015), or approaches
that simply concatenate the features in a classification algorithm (Noulas and Mascolo, 2013;
Chakirov and Erath, 2012). In the Big Data era, however, the aspiration will be to unlock the
power of knowledge from multiple disparate, but potentially connected datasets (Zheng, 2015).
Therefore, we expect the most promising approaches will be applications will stem from data
science domains such as machine learning and data mining.
33
Transport Modelling in the Age of Big Data June 2016
Cross-Domain Data Fusion Zheng (2015) presents a good survey that analyses, classifies and
exemplifies methodologies for Cross-Domain Data Fusion. From a transport planning
perspective, the promising methods for activity and transport inference are the probabilistic
semantic-meaning data fusion methods, namely, the Probabilistic Graphical Models reviewed in
the previous chapter. As mentioned before, these models are able to capture the dependencies
and correlations between the features in order to produce better estimates. They also constitute
an essential tool to reason coherently from limited and noisy observations (Koller and Friedman,
2009).
34
Transport Modelling in the Age of Big Data June 2016
New Models of Data Ownership Location obfuscation might not be the viable mechanism for
sharing individual mobility traces. Other suggestions, aside from the algorithmic perspective, are
related to changing the current data ownership paradigm. A popular idea is giving people
ownership of their data (Pentland, 2009). With the creation of data vaults (Mun et al., 2010)
or by means of a data trust (Lawrence, 2016), each individual would have the right to dispose
or distribute their personal information. However, a short-term implementation doesn’t seem
plausible since there exist conflict of interests with big data-driven companies. Still, a transparent
use of personal information is certainly a relevant way to facilitate access for researchers to
both develop better anonymisation methodologies and showcase the societal benefits of using
anonymised mobility data.
35
Transport Modelling in the Age of Big Data June 2016
In the second part of the paper, we present a literature review on the methodologies needed to
extract mobility behaviour from such Big Data sources. From trip identification to activity
inference, and their application for transport demand models, we review the efforts in a step-by-
step manner both for the public transport smart card and mobile phone data. We also cover the
relevance of other datasets such as POIs to infer trip purposes and also document how GPS-based
data collection and Probabilistic Graphical Models can be included for model refinement.
Finally, we discuss the findings of the literature review and also identify a set of future challenges,
in particular with regards to data privacy implications.
36
Transport Modelling in the Age of Big Data June 2016
1. Further exploration of probabilistic approaches that can handle the uncertainty of Big data
mobility traces in the modelling process and showcasing the relevance of such approaches for
instance through validation with independent data sets such as loop-detector data.
2. Integration of different available datasets in a data-fusion scheme. For instance, smart card
data and CDRs for mode inference or POI datasets and semantic information in social networks
to inform in the activity inference process of CDR traces.
3. Exploration on Transfer Learning approaches to cover up the lack of training samples when
using opportunistic collected data sources.
4. Seek new ways to validate the results for every step of the data mining pipeline.
5. Further exploration of preprocessing techniques for sparse and noisy mobility traces. For
instance, demonstrate the effectiveness of non-linear Kalman filters, or more complex probabilistic
filters (e.g. Gaussian processes).
In any case, we also acknowledge that sophisticated modelling knowledge has developed in the
domain of transport planning and therefore we strongly advise that still domain expert knowledge
should build the fundament when applying data driven approaches in transport planning. These
new challenges call for a multidisciplinary collaboration between transport modellers and data
scientists
37
Transport Modelling in the Age of Big Data June 2016
8 Acknowledgements
We would like to thank Seungjae Lee for organising the Seoul Metropolitan Forum 2016 at the
University of Seoul where an earlier version of this paper was presented. This research has been
conducted at the Singapore-ETH Centre for Global Environmental Sustainability (SEC), co-
funded by the Singapore National Research Foundation (NRF) and ETH Zurich.“
9 References
Adnan, M., F. C. Pereira, C. M. Lima Azevedo, K. Basak, M. Lovric, S. Raveau, Y. Zhu,
J. Ferreira, C. Zegras and M. E. Ben-Akiva (2016) SimMobility: A Multi-scale Integrated
Agent-Based Simulation Platform, paper presented at the Transportation Research Board 95th
Annual Meeting.
Balmer, M., K. Axhausen and K. Nagel (2006) Agent-based demand-modeling framework for
large-scale microsimulations, Transportation Research Record: Journal of the Transportation
Research Board, (1985) 125–134.
Barry, J., R. Newhouser, A. Rahbee and S. Sayeda (2002) Origin and Destination Estimation in
New York City with Automated Fare System Data, Transportation Research Record: Journal
of the Transportation Research Board, 1817, 183–187, January 2002, ISSN 0361-1981.
38
Transport Modelling in the Age of Big Data June 2016
Bohte, W. and K. Maat (2009) Deriving and validating trip purposes and travel modes for multi-
day GPS-based travel surveys: a large-scale application in the Netherlands, Transportation
Research Part C: Emerging Technologies, 17 (3) 285–297.
Bouman, P. (2012) Recognizing demand patterns from smart card data for agent-based micro-
simulation of public transport, Ph.D. Thesis, Department of Decision and Information Sciences,
Erasmus University Rotterdam, The Netherlands.
Bouman, P., E. Van der Hurk, L. Kroon, T. Li and P. Vervest (2013) Detecting activity patterns
from smart card Data, paper presented at the BNAIC 2013: Proceedings of the 25th Benelux
Conference on Artificial Intelligence, Delft, The Netherlands, November 7-8, 2013.
Caceres, N., J. P. Wideberg and F. G. Benitez (2007) Deriving origin destination data from a
mobile phone network, IET Intelligent Transport Systems, 1 (1) 15–26, March 2007, ISSN
1751-956X.
Calabrese, F., L. Ferrari and V. D. Blondel (2015) Urban sensing using mobile phone network
data: a survey of research, ACM Computing Surveys (CSUR), 47 (2) 25.
Calabrese, F., G. D. Lorenzo, L. Liu and C. Ratti (2011) Estimating Origin-Destination Flows
Using Mobile Phone Location Data, IEEE Pervasive Computing, 10 (4) 36–44, April 2011,
ISSN 1536-1268.
Castiglione, J., M. Bradley and J. Gliebe (2015) Activity-based travel demand models: a primer,
Transportation Research Board, Washington, DC, ISBN 978-0-309-27399-2.
Chakirov, A. and A. Erath (2012) Activity Identification and Primary Location Modelling based
on Smart Card Payment Data for Public Transport, Toronto, June 2012.
de Dios Ortuzar, J. and L. G. Willumsen (2011) Modelling transport, John Wiley & Sons.
de Montjoye, Y.-A., C. A. Hidalgo, M. Verleysen and V. D. Blondel (2013) Unique in the crowd:
The privacy bounds of human mobility, Scientific reports, 3.
Devillaine, F., M. Munizaga and M. Trépanier (2012) Detection of Activities of Public Transport
Users by Analyzing Smart Card Data, Transportation Research Record: Journal of the
Transportation Research Board, 2276, 48–55, December 2012, ISSN 0361-1981.
Duong, T. V., H. H. Bui, D. Q. Phung and S. Venkatesh (2005) Activity recognition and
abnormality detection with the switching hidden semi-Markov model, paper presented at the
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005.
CVPR 2005, vol. 1, 838–845 vol. 1, June 2005.
39
Transport Modelling in the Age of Big Data June 2016
Eagle, N. and A. (Sandy) Pentland (2006) Reality Mining: Sensing Complex Social Systems,
Personal Ubiquitous Comput., 10 (4) 255–268, March 2006, ISSN 1617-4909.
Ficek, M. and L. Kencl (2012) Inter-call mobility model: A spatio-temporal refinement of call
data records using a Gaussian mixture model, paper presented at the INFOCOM, 2012
Proceedings IEEE, 469–477.
Fourie, P. J. (2014) Reconstructing bus vehicle trajectories from transit smart-card data, Working
paper, 986.
Fourie, P. J., A. Erath, S. A. Ordóñez Medina, A. Chakirov and K. W. Axhausen (2016) Using
smartcard data for agent-based transport simulation: the case of Singapore, in J.-D.
Schmoecker and F. Kurauchi (eds.) Public Transport Planning with Smart Card Data, Taylor
& Francis.
Furletti, B., P. Cintia, C. Renso and L. Spinsanti (2013) Inferring human activities from gps
tracks, paper presented at the Proceedings of the 2nd ACM SIGKDD International Workshop
on Urban Computing, 5.
Horn, C., S. Klampfl, M. Cik and T. Reiter (2014) Detecting Outliers in Cell Phone Data,
Transportation Research Record: Journal of the Transportation Research Board, 2405, 49–56,
July 2014, ISSN 0361-1981.
Horni, A., K. Nagel and K. Axhausen (2016) The Multi-Agent Transport Simulation MATSim,
Ubiquity Press, June 2016, ISBN 978-1-909188-75-4.
Hu, X., M. Yuan, J. Yao, Y. Deng, L. Chen, Q. Yang, H. Guan and J. Zeng (2015) Differential
privacy in telco big data platform, Proceedings of the VLDB Endowment, 8 (12) 1692–1703.
Huang, L., Q. Li and Y. Yue (2010) Activity identification from GPS trajectories using spatial
temporal POIs’ attractiveness, paper presented at the Proceedings of the 2nd ACM SIGSPATIAL
International Workshop on Location Based Social Networks, 27–30.
International Transport Forum (2015) Big Data and Transport - Understanding and assessing
options, Technical Report, OECD, Paris.
Iqbal, M. S., C. F. Choudhury, P. Wang and M. C. González (2014) Development of ori- gin–
destination matrices using mobile phone call data, Transportation Research Part C:
Emerging Technologies, 40, 63–74, March 2014, ISSN 0968-090X.
40
Transport Modelling in the Age of Big Data June 2016
Jiang, S., J. Ferreira Jr and M. C. González (2015) Activity-Based Human Mobility Patterns
Inferred from Mobile Phone Data: A Case Study of Singapore, paper presented at the Int.
Workshop on Urban Computing.
Jiang, S., G. A. Fiore, Y. Yang, J. Ferreira Jr, E. Frazzoli and M. C. González (2013) A review
of urban computing for mobile phone traces: current methods, challenges and opportunities,
paper presented at the Proceedings of the 2nd ACM SIGKDD international workshop on
Urban Computing, 2.
Koller, D. and N. Friedman (2009) Probabilistic graphical models: principles and techniques,
MIT press.
Lawrence, N. (2016) Data trusts could allay our privacy fears, June 2016,
http://www.theguardian.com/media-network/2016/jun/03/data-
trusts-privacy-fears-feudalism-democracy.
Liao, L., D. Fox and H. Kautz (2005) Location-based Activity Recognition Using Relational
Markov Networks, paper presented at the Proceedings of the 19th International Joint
Conference on Artificial Intelligence, IJCAI’05, 773–778, San Francisco, CA, USA.
Liao, L., D. Fox and H. Kautz (2007a) Extracting Places and Activities from GPS Traces Using
Hierarchical Conditional Random Fields, The International Journal of Robotics Research,
26 (1) 119–134, January 2007, ISSN 0278-3649, 1741-3176.
Liao, L., D. J. Patterson, D. Fox and H. Kautz (2007b) Learning and inferring transportation
routines, Artificial Intelligence, 171 (5–6) 311–331, April 2007, ISSN 0004-3702.
Mun, M., S. Hao, N. Mishra, K. Shilton, J. Burke, D. Estrin, M. Hansen and R. Govindan (2010)
Personal data vaults: a locus of control for personal data streams, paper presented at the
Proceedings of the 6th International COnference, 17.
41
Transport Modelling in the Age of Big Data June 2016
Noulas, A. and C. Mascolo (2013) Exploiting foursquare and cellular data to infer user activity
in urban environments, paper presented at the Mobile Data Management (MDM), 2013 IEEE
14th International Conference on, vol. 1, 167–176.
Pentland, A. (2009) Reality Mining of Mobile Communications: Toward A New Deal On Data,
in Social Computing and Behavioral Modeling, 1–1, Springer US, ISBN 978-1-4419-0055-5
978-1-4419-0056-2. DOI: 10.1007/978-1-4419-0056-2_1.
Pozdnoukhov, A. (2015) Activity Based Travel Demand Modelling with Cellular Data, October
2015, http://www.ucconnect.berkeley.edu/workforce-development/
ucconnect-transportation-planning-workshops/role-big-data-
transportation.
Primault, V., S. B. Mokhtar, C. Lauradoux and L. Brunie (2014) Differentially private location
privacy in practice, arXiv preprint arXiv:1410.7744.
Sarlas, G. and K. W. Axhausen (2015) Localized Speed Prediction with the use of Spatial
Simultaneous Autoregressive Models, paper presented at the Transportation Research Board
94th Annual Meeting.
Schlaich, J., T. Otterstätter and M. Friedrich (2010) Generating trajectories from mobile phone
data, paper presented at the Proceedings of the 89th annual meeting compendium of papers,
transportation research board of the national academies.
Seaborn, C., J. Attanucci and N. Wilson (2009) Analyzing Multimodal Public Transport Journeys
in London with Smart Card Fare Payment Data, Transportation Research Record: Journal of
the Transportation Research Board, 2121, 55–62, December 2009, ISSN 0361-1981.
42
Transport Modelling in the Age of Big Data June 2016
Sun, L., D.-H. Lee, A. Erath and X. Huang (2012) Using Smart Card Data to Extract Passenger’s
Spatio-temporal Density and Train’s Trajectory of MRT System, paper presented at the
Proceedings of the ACM SIGKDD International Workshop on Urban Computing, UrbComp
’12, 142–148, New York, NY, USA, ISBN 978-1-4503-1542-5.
Sun, L., A. Tirachini, K. W. Axhausen, A. Erath and D.-H. Lee (2013) Models of Bus
Boarding/Alighting Dynamics and Dwell Time Variability.
Sutton, C. and A. McCallum (2006) An introduction to conditional random fields for relational
learning, Introduction to statistical relational learning, 93–128.
Toole, J. L., S. Colak, B. Sturt, L. P. Alexander, A. Evsukoff and M. C. González (2015) The path
most traveled: Travel demand estimation using big data resources, Transportation Research
Part C: Emerging Technologies, 58, 162–177.
Trépanier, M., N. Tranchant and R. Chapleau (2007) Individual Trip Destination Estimation in a
Transit Smart Card Automated Fare Collection System, Journal of Intelligent Transportation
Systems, 11 (1) 1–14, April 2007, ISSN 1547-2450.
Wang, H., F. Calabrese, G. D. Lorenzo and C. Ratti (2010) Transportation mode inference from
anonymized and aggregated mobile phone call detail records, paper presented at the 2010
13th International IEEE Conference on Intelligent Transportation Systems (ITSC), 318–323,
September 2010.
Widhalm, P., Y. Yang, M. Ulm, S. Athavale and M. C. González (2015) Discovering urban
activity patterns in cell phone data, Transportation, 42 (4) 597–623.
Xie, K., K. Deng and X. Zhou (2009) From trajectories to activities: a spatio-temporal join
approach, paper presented at the Proceedings of the 2009 International Workshop on Location
Based Social Networks, 25–32.
Yan, Z., D. Chakraborty, C. Parent, S. Spaccapietra and K. Aberer (2011) SeMiTri: a framework
for semantic annotation of heterogeneous trajectories, paper presented at the Proceedings of
the 14th international conference on extending database technology, 259–270.
Yang, Y., P. Widhalm, S. Athavale and M. C. González (2015) Mobility Sequence Extraction
and Labeling Using Sparse Cell Phone Data.
Yuan, N. J., Y. Wang, F. Zhang, X. Xie and G. Sun (2013) Reconstructing Individual Mobility
from Smart Card Transactions: A Space Alignment Approach, paper presented at the 2013
IEEE 13th International Conference on Data Mining (ICDM), 877–886, December 2013.
43
Transport Modelling in the Age of Big Data June 2016
Zhang, Y., X. Qin, S. Dong and B. Ran (2010) Daily OD matrix estimation using cellular probe
data, paper presented at the 89th Annual Meeting Transportation Research Board.
Zhao, J., A. Rahbee and N. H. M. Wilson (2007) Estimating a Rail Passenger Trip Origin-
Destination Matrix Using Automatic Data Collection Systems, Computer-Aided Civil and
Infrastructure Engineering, 22 (5) 376–387, July 2007, ISSN 1467-8667.
Zheng, Y. (2015) Methodologies for cross-domain data fusion: an overview, Big Data, IEEE
Transactions on, 1 (1) 16–34.
Zilske, M. and K. Nagel (2014) Studying the Accuracy of Demand Generation from Mobile
Phone Trajectories with Synthetic Data, Procedia Computer Science, 32, 802–807, ISSN
1877-0509.
Zilske, M. and K. Nagel (2015) A Simulation-based Approach for Constructing All-day Travel
Chains from Mobile Phone Data, Procedia Computer Science, 52, 468–475, ISSN 1877-0509.
44