TD-2026: TRANSITDATA 2026
PROGRAM FOR THURSDAY, JUNE 25TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:10 Session 22A: Multimodal integration and New Tools for Assessing the Influence of Land-Use on Transit
Location: Left Room
09:00
Modeling the Toronto PATH with GTFS-Pathways for Wayfinding and Pedestrian Flow Analysis

ABSTRACT. (In-person presentation)

The PATH is a mostly underground pedestrian walkway network in downtown Toronto that spans more than 30 kilometres. It connects over 75 buildings, including office towers, shopping malls, and major tourist attractions to the city’s public transit system. The PATH provides access to six stations on Toronto’s subway Line 1, six streetcar routes, and Canada’s busiest intermodal transportation hub – Union Station, which serves the Greater Toronto’s regional rail lines, the airport rail link known as the UP Express, as well as national and international passenger trains. The PATH is thus a major component of Toronto’s transportation network, facilitating the movement of commuters, residents, and tourists in a safe, climate-controlled environment. Despite its scale and importance, the PATH remains underrepresented in digital transportation models, and it is often excluded from multimodal trip planning tools and accessibility analyses. The primary reason for this is the lack of a digital, parseable, and interoperable model of the PATH.

In this research, a model of the Toronto PATH was developed using the GTFS-Pathways specification, an extension of the General Transit Feed Specification (GTFS). The purpose of GTFS-Pathways is to facilitate navigation in complex indoor spaces by detailing pedestrian walkways and vertical circulation elements like stairs, escalators, and elevators. GTFS-Pathways enables seamless integration with other transit data in GTFS format such as timetables, routes, stops, and service hours. The integrated GTFS data enables accurate trip planning for transit users looking to reach their destination along the PATH. Other potential use cases for this model are emergency response and crowd management. These are particularly important during major events such as concerts and the FIFA World Cup in 2026, when understanding the PATH’s role and limitations becomes critical.

The methodology of this work consists of several interconnected tasks. The first is collecting up-to-date data for the PATH. Multiple data collection approaches were explored. At places where building plan drawings were available, an automated takeoff method was proposed. In other areas, we tested field data collection methods including LiDAR, Bluetooth beacons, and conventional tools such as a measuring wheel and a handheld compass. These methods were evaluated and discussed to inform challenges regarding data collection for similar research projects.

After the data collection phase, the research team evaluated existing maps and databases of the PATH to act as an intermediate data format before conversion to the GTFS-Pathways format. This included the City of Toronto’s static wayfinding map, various proprietary indoor visualization tools, and the OpenStreetMap (OSM) database. The first two were dismissed as they are not suitably formatted for analytical studies. OSM was chosen as an intermediate data format as it is an open, collaborative, and flexible mapping database. Parts of the PATH have already been mapped in OSM in previous efforts by contributors. However, the accuracy and completeness of the data is not verified, and there is no unified tagging convention across the entire PATH network. Furthermore, the current data requires updates based on the changes in the PATH over the past years. Hence, this work builds on previous attempts to map the PATH, rather than duplicating the work. Ultimately, the OSM data is not limited for use in the project, but it is readily available for other researchers and professionals.

After updating the OSM database with detailed PATH mapping, a tool was developed to convert the PATH network in OSM to the GTFS-Pathways format. The converter was designed to ensure that the output meets data requirements outlined in the GTFS documentation. In addition to highlighting its trip planning capability, we demonstrated how the GTFS-Pathways model could support accessibility studies and multi-modal connectivity. Additionally, the converter is flexible and thus may be applied to other locations with similar networks such as Montreal’s RÉSO and Calgary’s Plus 15. This enables comparative studies of similar systems across cities.

Finally, CAD floorplans for transit hubs and major buildings on the PATH were automatically generated from the GTFS-Pathways data, which showcased the walkways and vertical circulation elements within the PATH network. We demonstrated how these CAD floorplans could be used for dynamic pedestrian modeling and spatial analyses, which would enable studies on evacuation planning, wayfinding, and crowd management within the PATH.

GTFS has grown to become the de facto global standard for transit data. As a relatively new extension to GTFS, the Pathways component is yet to be adopted and deployed by many transit agencies. Additionally, the existing implementations remain relatively small, mainly limited to subway stations. To date, no transit agency – or data producer – in Canada has published data in the GTFS-Pathways format. Hence, this project serves as a pilot that explores the challenges of developing such a model in terms of data collection, processing, and validation for a complex pedestrian network like the Toronto PATH. Additionally, testing the capability of GTFS-Pathways for such a large-scale network uncovered gaps and limitations of the Pathways data specification in its current state.

In this study, we produced a digital model for the Toronto PATH network and integrated the model within Toronto’s transit data environment thanks to the GTFS data specification. This work contributes to the development of integrated transportation models and opens the door to more detailed analyses in Toronto, and in other cities. The project lays the framework for future research into the pedestrian component of transit research.

09:06
Joint Analysis of Neighbourhood and Transit Preferences in Immersive Virtual Reality

ABSTRACT. Canada’s population reached 41 million in 2024, a 25% increase over two decades, intensifying the pressure on housing, urban infrastructure, and public transit. Access to reliable and affordable transportation is central to social and economic participation, yet traditional survey methods rarely capture how individuals perceive transit and neighbourhood environments when making complex residential and mobility decisions. This study introduces an innovative virtual reality (VR) experiment that generates rich behavioral data to explore how people experience, evaluate, and choose between different neighbourhood and transportation options.

Using the Unity game engine and a head-mounted VR headset, participants are immersed in three interactive VR environments representing Downtown, Midtown, and Suburban neighbourhoods in the Greater Toronto Area (GTA). These environments model neighbourhoods using a combination of abstracted real-world features and stylized representations of housing forms, street layouts, and transit options. Housing preferences are represented through visible dwelling types and contextual cues such as density and building size. Transportation preferences are embedded through the presences of walking paths, subway entrances, varying street designs, and bus stops. Neighbourhood characteristics, such as weather conditions, safety indicators (e.g., time of day, population density), and cultural familiarity elements, can dynamically modify the neighbourhood scenes in VR to simulate how they influence preferences.

This research follows a three stage design beginning with a pre-VR survey that collects detailed information on participants’ demographics, socioeconomic situation, current housing and travel patterns, and their initial neighbourhood and transportation preferences. Participants then complete a VR experiment, a sequence of 12 scenes separated into 4 sets of triplets. Each scene is generated through a factorial combination of neighbourhood type (Suburb, Midtown, Downtown), environmental condition (Default, Unsafe/Rainy, Snowy, or Ethnic), and transportation accessibility level (Low or High). Although there are 24 total scenes that are possible, the participant will complete 12 random scenes. These scenarios are presented in randomized triplets so that each participant experiences a unique sequence while being able to see every neighbourhood type once per triplet. Within each scene, participants freely explore the environment using teleportation-based movement and evaluate visible housing forms, street layouts, and transportation options. Transit, driving, cycling, and rideshare choices are represented as floating physical markers, found in the scene, displaying realistic travel times and costs drawn from commuting data, requiring participants to consider accessibility before selecting, with a UI feature, which mode they would personally choose for a work commute. After each triplet of scenes, participants select which neighbourhood they would prefer to live in, and after all scenarios are completed, they will identify their overall preferred neighbourhood type. Throughout this process, the system continuously records positional and gaze data to capture where participants move, what environmental elements they examine, and how long they fixate on key features such as transit infrastructure, safety cues, or housing types. These behavioral signals provide insight into how individuals interpret environmental information and how those interpretations shape neighbourhood desirability and mode choice, allowing comparisons between pre-existing beliefs and the preferences formed during immersive, in-context evaluation.

To date, data collection has yielded 50 participants, with an anticipated sample of approximately 200 participants by the time of the conference. Recruitment occurs through social media campaigns (LinkedIn, Instagram, and TikTok), in-person outreach, and physical posters placed in high-traffic areas across Downtown, Midtown, and Suburban GTA communities to capture a broad demographic range of residents. These recruitment efforts also allow the study to compare VR responses across spatially distinct neighbourhood types.

Preliminary results show clear shifts in participants’ preferences after experiencing the VR scenarios. While pre-VR surveys indicated similar levels of interest across all neighbourhood types, post-VR responses revealed significantly stronger preference for Midtown and Downtown environments, areas characterized by higher public transit accessibility, a greater mix of services, and stronger commercial activity. These environments elicited both longer dwell times and higher gaze concentration on transit infrastructure found in these scenes.

Mode choice responses also shifted after VR exposure. Participants reported increased openness to cycling and ridesharing modes. Public transit remained one of the most preferred modes, particularly in the Downtown and Midtown environments where subway and bus infrastructure were prominent and easily accessible. In suburban scenes, transit preference declined slightly, driven by longer perceived commuting distances and lower accessibility to public transit, while driving received higher consideration. These results illustrate how environmental cues impact willingness to use different commuting modes.

Beyond measuring preference shifts, the study demonstrates the potential of VR as a behavioral data collection tool for public transit research. The immersive format allows participants to react naturally to environmental conditions that are difficult to replicate in traditional text or web-based surveys. The resulting dataset offers new insights into how complex perceptions influence neighbourhood, housing, and transportation choices. By linking immersive behavioral data with demographic and contextual factors, VR enables planners to better anticipate how people might respond to new transit services, neighbourhood redesign, or policy interventions. The study provides a replicable framework for integrating human-centred, perception-based evidence into data-driven transit, housing, and equity planning.

09:12
Using LiDAR to Support Service Planning Decisions at WMATA

ABSTRACT. (My preference for this presentation, if chosen, is in-person.)

Transit-oriented Development (TOD) is the term provided for dense, walkable construction placed near high-frequency transit connections. People that occupy housing or frequent businesses near these hubs are more likely to use transit as their primary mode of travel, boosting ridership of the transit agency and reducing congestion on the region’s road network. While there are numerous ways to identify the benefits of TOD, the Washington Metropolitan Area Transit Authority (WMATA/"Metro") was interested in better understanding the types of service planning decisions we could make that would spur this type of development and exploring new tools that could better inform the future of our network. After several months of research and feedback from peer agencies across the United States, Metro developed a way of quantifying this development using free, open-source LiDAR (Light Detection and Ranging) data. With this technology, it is possible to track new construction and compare the levels of development over time by using light deflection to create three-dimensional models of the Earth’s surface and items upon it such as buildings and trees. Combined with GIS tools, LiDAR data allows novel means of measuring the levels of development around transit hubs from both economic and structural points of view. To showcase this, our research first focused on quantifying the development impacts of the 2004 construction of NoMA Metro Station. This project is one of the most prolific and well documented examples of TOD in the nation. Using LiDAR, we were able to quantify and map the growth of this once-industrial neighborhood into a vibrant city center by measuring the building heights and exteriors based on when the LiDAR data was scanned and color-coded, allowing us to not only to see skyline development but to also analyze the broader land use via plot shapes. By correlating this to other factors, such as American Community Survey demographic data and internal ridership and scheduling data, Metro can quantify this growth to show how transit helps drive economic development of an area. This approach also allowed Metro to highlight the correlation between transit growth and development in neighborhoods where that relationship is less well known. For example, the 14th Street corridor has grown rapidly over the last 15 years despite limited Metrorail access. Mapping TOD patterns with LiDAR highlights other dense, fast-growing hubs—like NoMa—that may call for improved transit service. Looking at past and present trends together helps Metro pinpoint where expansions such as new stations or high-frequency bus lines could have had the greatest impact and what kind of investments can spur development in the future. We believe this research could benefit other transit agencies and be easily replicated due to its reliance on open-source data. Other agencies seeking to measure their TOD connections and inform planning decisions could use this methodology in numerous ways, further enhancing data-driven decision-making. Additionally, there is high potential for automation by gathering ridership data and downloading LIDAR data at set intervals. In this presentation at TransitData, I look to further explain this process and go in-depth on the use cases at Metro, while also showcasing some of the pathways towards automating this analysis.

09:18
Determinants of Participants’ Trip-Booking Frequency in a MaaS Trial: A Double Machine Learning Analysis

ABSTRACT. The concept of Mobility as a Service (MaaS) has been promoted as a promising way to reduce the complexity of travel planning and execution across multimodal transport services. Understanding users’ actual travel behaviours within MaaS systems is critical for designing, implementing, and maintaining effective operational strategies. However, empirical evidence on large-scale MaaS adoption and trip-booking behaviour in real-world environments remains limited. Most existing studies focus on users’ bundle subscription behaviours, offering insights into enhancing the attractiveness of MaaS bundles, but the actual trip-booking behaviour enabled by these bundles remains under-explored. In particular, the determinants of users’ public transport (PT) trip booking frequency (i.e., number of trip bookings) on the MaaS platform and how different subscription patterns translate into different trip booking behaviours are still unclear.

This study addresses these gaps in understanding trip booking using data from a MaaS trial at the University of Queensland (UQ) in Brisbane, Australia. The data involves MaaS subscriptions and PT trip bookings from 1,229 student and staff participants during the first semester of 2023. The trial offered weekly (PT7), monthly (PT30), and quarterly (PT90) public transport bundles with unlimited access to buses, trains, ferries, and trams, as well as multimodal bundles combining PT with shared e-scooters and e-bikes, with additional usage charged on a pay-as-you-go basis. Participants completed a sign-up survey capturing demographics, user type, car ownership, ability to ride e-bikes or e-scooters, and motivations for joining the trial, including reducing transport costs, simplifying travel booking, supporting research, recommendations from peers, and curiosity. Trip booking records were used to calculate average trip distance, share of trips with transfers, and proportion of weekend or night trips. Residential built-environment characteristics, such as land-use mix, public transport stop density, intersection density, bikeway and footpath density, population density, distance to the CBD, land-use mix, the Index of Relative Socio-economic Advantage and Disadvantage (IRSAD) were also included. Participants’ typical MaaS subscription patterns were classified into seven categories (e.g., PT7 Dominant, PT30 Attrition, PT30 Committed, PT7 & PT30 Infrequent, PT90 Dominant, Multimodal Committed, PT90 to PT30 Committed), providing a rich dataset for analysing determinants of trip-booking frequency.

Participants’ demographics, motivations for joining the MaaS trial, built-environment characteristics, and historical trip booking behaviour may jointly influence both subscription patterns and trip-booking frequency, rendering subscription patterns endogenous in explaining the number of trip bookings. Treating them as exogenous can bias estimates, while excluding them overlooks an important behavioural pathway. To address this challenge, the study adopts a double machine learning (DML) framework capable of handling high-dimensional and nonlinear confounding effects. In this study, the outcome variable is the total number of trip bookings of each participant over the semester. The treatment variable is the typical MaaS subscription pattern, and confounding variables include demographics, motivations for joining the MaaS trial, built-environment attributes, and historical trip booking characteristics. Highly collinear predictors were excluded by checking correlation coefficients (> 0.8). The DML procedure proceeds in three stages. First, two XGBoost models are estimated based on the confounding variables: an XGBoost regressor for predicting the outcome and a multiclass XGBoost classifier for predicting the treatment. Second, we residualise the outcome and treatment by subtracting their model-predicted values. Third, these residualised components are fed into a causal forest with 5-fold cross-fitting to obtain orthogonalised estimates of the average and heterogeneous treatment effects of subscription patterns on the number of trip bookings.

Shapley Additive Explanations (SHAP) dependence plots reveal clear threshold effects for several confounding variables. For example, when the percentage of weekend or holiday bookings exceeds 0.2, SHAP values become positive, indicating that predicted trip bookings tend to increase as the share of trips during these periods rises. For the percentage of trips involving transfers, SHAP values transition from negative or near zero to mildly positive between 0.15 and 0.2, after which the effect flattens. Average trip distance is associated with higher SHAP values below approximately 8–10 km, but becomes mostly negative beyond ~20 km. For built-environment variables, IRSAD exhibits an overall negative relationship with bookings, with a steeper decline above the population average of ~1000, where the model consistently predicts fewer bookings. Land-use mix performs best in the mid-range (0.4–0.6), while SHAP values are generally negative below ~0.3 and above ~0.7, suggesting an optimal intermediate level of land-use diversity for PT demand. SHAP values for PT stop density are mostly negative at low densities, rise and become positive as density increases, indicating that areas with more frequently-served PT stops are associated with higher predicted trip bookings.

After controlling for nonlinear confounding effects in the DML framework, all subscription patterns exhibit positive and statistically significant causal effects on trip bookings relative to PT7 & PT30 Infrequent users. The largest uplift is observed for PT90 to PT30 Committed (ATE ≈150.3 trips vs. a raw gap of 180.0), followed by PT30 Committed (≈125.9 vs. 144.5), PT7 Dominant (≈85.6 vs. 107.2), PT90 Dominant (≈83.4 vs. 118.2), and Multimodal Committed (≈76.3 vs. 84.3), while PT30 Attrition shows a smaller but still positive effect (≈49.9 vs. 63.5). Comparing these adjusted estimates with their corresponding raw mean differences indicates that confounding explains roughly 10–30% of the initially observed gaps across groups. Nevertheless, the majority of each gap persists after adjustment. This indicates that, after adjusting for confounding factors, approximately 70–90% of the originally observed differences remain and can be attributed to the causal effects of typical subscription patterns.

Overall, the study makes three key contributions. First, it moves beyond preference-based analyses of bundle subscriptions to quantify how typical bundle choices influence actual PT trip-booking behaviour. Second, it offers a detailed examination of the determinants of participants’ trip-booking frequency on the MaaS platform, uncovering important nonlinearities and threshold effects. Third, it showcases the value of double machine learning and related machine-learning methods for estimating causal effects in the presence of high-dimensional, nonlinear confounding effects. Together, these findings provide insights for MaaS operators and policymakers seeking to design subscription bundles that foster greater PT usage.

09:24
Evaluating Behavioral Responses to Mobility Incentives and Uber Integration in a Public MaaS Platform

ABSTRACT. Mobility-as-a-Service (MaaS) platforms integrate trip planning and payment across multiple modes. Public agencies are investing in Mobility-as-a-Service (MaaS) and financial incentives to promote sustainable transport, but critical questions remain about their effectiveness: Do they meaningfully change behavior, and does integrating ride-hailing complement or substitute public transit? This study uses a unique quasi-experiment and longitudinal telemetry from the Vamos-EZHub platform in San Joaquin County, California, to provide causal evidence on how travelers respond to two sequential policy interventions: (1) a prepaid Mobility Incentives (MI) program, and (2) a transit-triggered $5 Uber credit integrated into the Vamos app.

Our dataset comprises detailed, timestamped logs of user behavior, including route searches, ticket purchases, ticket activations, and Uber trips, from January 2023 to December 2024. The study defines three distinct phases:

• Baseline: Pre-MI/Pre-Uber (Jan-Jun 2023)

• Intervention 1: MI-only period (Jul 2023-Mar 2024)

• Intervention 2: MI + Uber Credit period (Apr-Dec 2024)

We employ a two-way fixed effects regression framework to isolate the impact of these interventions, comparing users to themselves over time while accounting for seasonal trends.

Analysis 1: Platform Engagement. We first assess whether the incentives increased user engagement with the Vamos platform, measured through monthly rates of route searches and transit ticket purchases. The models estimate the incremental effects of the MI program and the Uber integration, testing the hypothesis that financial incentives and new mode options stimulate sustained platform use.

Analysis 2: Complement vs. Substitute. We then investigate the relationship between Uber and transit. Using a rule-based algorithm, we identify "Uber-transit linked trip" as Uber ride begins or ends within 250 meters of a transit stop and occurs within 90 minutes of a validated transit ticket activation. We model changes in users' monthly transit activations, Uber trips, and the share of Uber trips linked to transit. This design directly tests whether Uber serves as a complement (increasing or supporting transit use) or a substitute (replacing transit trips). Moreover, we use Event-study graphs to visualize the dynamic timing of any behavioral shifts.

Analysis 3: Heterogeneous Effects. Furthermore, by using the linked MI survey subsample, we also explore which users are more likely to use Uber as a complement to transit versus as a substitute. We look into how patterns vary by income, vehicle ownership, disability status, and other reported mobility constraints. This provides context on who benefits from improved connectivity and who might be induced to substitute away from transit.

Expected results include whether MI funds and Uber credits increased MaaS platform engagement, whether Uber complements or substitutes transit trips, and whether the share of Uber trips linked to transit grew as multimodal integration matured. The analyses will clarify when and for whom Uber serves as a first-/last-mile connector versus a transit-replacing mode. By combining automated MaaS telemetry with survey insights, this study will provide transit agencies and policymakers with an evidence-based framework for designing effective mobility incentives and structuring TNC partnerships to maximize public transit ridership and equitable access.

09:00-10:10 Session 22B: Data Management Challenges
Location: Center Room
09:00
Analytical Tools at Société de Transport Montréal: A Comprehensive Business Intelligence Framework for Public Transport Planning

ABSTRACT. Author’s preference for presentation is: Virtual Public transit agencies face challenges in optimizing service delivery while managing resource constraints and evolving passenger demands. This paper presents a comprehensive suite of analytical tools developed for the Planning and Network Development department by the Analytical team at Société de transport de Montréal (STM), demonstrating how integrated business intelligence frameworks can transform transit planning practices. Our approach combines Python-based data processing, SQL database management, machine learning algorithms, and Power BI visualization to create actionable insights across multiple operational dimensions. The analytical framework addresses four critical domains of transit operations. The stop crowding tool analyzes passenger loading patterns at individual bus stops during peak periods using automatic passenger counter (APC) data to calculate crowding rates defined as passenger-minutes exceeding established capacity thresholds. Historical data reveal seasonal patterns reflecting ridership fluctuations. The passenger load profile analysis provides route-level visualization of vehicle occupancy patterns across entire alignments, processing APC data to identify critical loading points, maximum load locations, and sections with excess or insufficient capacity, supporting service design decisions including stop spacing, vehicle assignment, and frequency allocation. The interval monitoring system evaluates service reliability by measuring adherence to scheduled headways using automated vehicle location (AVL) data to calculate realized intervals between consecutive bus passages, with SARIMA and other algorithms enabling forecasting for scenario planning of service modifications and infrastructure investments. ArcGIS Maps integration within Power BI to provide geospatial context for performance metrics, including commercial speed, dwell time distributions, and person-minute exposure on route segments. This passenger-centric metric combines passenger loads and travel times, enabling prioritization of intervention locations based on passenger exposure. The technical architecture consists of automatic Python scripts for data extraction and transformation, SQL Server for efficient data warehousing and aggregation across temporal and spatial dimensions, and Power BI for interactive visualization. This integrated approach transforms transit planning from reactive operations management to proactive, data-driven decision-making, enabling planners to understand current performance, predict future challenges, and evaluate intervention strategies before implementation.

09:06
An Enhanced Data Specification for Next-Generation Transit Planning and Operations

ABSTRACT. Public transit planning increasingly depends on standardized data feeds to support service design, operational analysis, and strategic decision-making. While the General Transit Feed Specification (GTFS) has become the industry standard for publishing schedule and network information, it remains limited by its static structure and lack of explicit representations of vehicle circulation and passenger dynamics. This paper introduces the Transit Planning Feed Specification (TPFS), an enhanced data framework that extends GTFS through the integration of three complementary feed categories: Revenue, Non-Revenue, and Ridership feeds. These new feeds capture critical operational and behavioral dimensions, including deadheading movements, passenger demand variations, and vehicle flow, which are not explicitly represented in existing standards. Using the Dallas Area Rapid Transit (DART) system as a case study, we demonstrate a systematic process for transforming GTFS data into TPFS using algorithmic and mathematical modeling approaches. A real-world application to transit electrification planning illustrates the value of TPFS in supporting optimized charging infrastructure placement and energy management strategies. The results highlight the potential of TPFS to bridge key data gaps in transit systems and provide a more comprehensive foundation for advanced public transportation planning and optimization.

09:12
Storing and Sharing Transit Data at Scale: A Workflow for Big Data Without Big Costs

ABSTRACT. *In person presentation*

Public transit operators collect an enormous amount of operational data each day, mostly through Automatic Vehicle Location (AVL) and Automatic Passenger Counting (APC) systems. The resulting datasets, often consisting of millions of GPS points per week, contain useful information that can help operators improve scheduling, reduce delays, optimize routes, and enhance customer satisfaction. However, much of this data remains locked away in enterprise relational databases or proprietary systems that are not designed for open access. As a result, many of the stakeholders who could benefit most from such data including academic researchers, students, consultants or policy makers face significant barriers in accessing or working with it.

For many transit operators, data is provisioned through ad-hoc requests, extracted manually by technical staff, or shared through limited public dashboards that provide only aggregated summaries. While dashboards can be useful, they do not support deeper investigations into issues such as neighbourhood-level service reliability, equity of access, or operator performance. Moreover, because traditional enterprise databases are optimized for internal operational needs rather than public use, transit operators often struggle to share raw or disaggregated data externally without considerable effort. As transit systems continue to produce high-resolution data, there is an increasing need for cost-effective, scalable, and openly accessible data infrastructure.

This presentation introduces a cloud-native extract-transform-load (ETL) pipeline designed to address these challenges by providing stakeholders a cost effective, scalable, and easily replicable method for collecting, storing, and sharing vast amounts of GTFS-realtime vehicle location data. The system uses serverless compute resources on Amazon Web Services (AWS), which allows the pipeline to run efficiently without dedicated servers or complex management overhead. A lightweight event scheduler triggers the ingestion process at frequent intervals, capturing GTFS-realtime vehicle position feeds from transit operators with a frequency as low as 10 seconds.

Once the data is captured, the ETL pipeline transforms it into Geoparquet, a modern open-source data format designed specifically for large-scale geospatial and tabular data. Unlike traditional relational database tables, Geoparquet is a columnar storage format, meaning it organizes data by columns rather than rows. This structure offers significant advantages: it compresses very efficiently, supports fast querying, and works well with cloud object storage systems such as Amazon S3. Because Geoparquet is “cloud-native,” it eliminates many of the bottlenecks associated with database-centric workflows. Users can remotely query, and work with the data without needing to download it, request access credentials to a live database or require any specialized geospatial software.

Another key benefit of Geoparquet is its compatibility with modern analytics tools such as DuckDB, a lightweight in-process analytical engine that allows users to query millions of rows of data from their local machines. DuckDB reads columnar formats natively, so students, researchers, and consultants can perform complex temporal and spatial analysis on large datasets without requiring a high-performance server or cloud compute cluster. This dramatically lowers barriers to entry. Instead of waiting for database exports or navigating restrictive IT systems, stakeholders can access static files stored in a public or semi-public cloud storage and work with it directly.

To demonstrate the system’s capabilities we applied this pipeline to collect GTFS-realtime position data from six Canadian transit operators: Calgary Transit, Edmonton Transit Service, OC Transpo, TransLink, Toronto Transit Commission (TTC) and Société de transport de Montréal (STM). Over the course of the demonstration which lasted 6 months, the pipeline successfully collected millions of rows of real-time vehicle position data. These data files were stored as static Geoparquet files in a cloud storage environment and made available for local querying, aggregation, and visualization.

Beyond storage and querying, the project also demonstrated how accessible visualization tools can be layered on top of this infrastructure. Using browser-based, open-source visualization platforms capable of handling large datasets, users can explore patterns such as route-level reliability, peak-period congestion, service coverage gaps, or variations in travel speeds over time. This type of exploratory analysis is often beyond the capabilities of traditional public transit dashboards but becomes feasible when raw, high-resolution data is available in an efficient, standardized format.

By showcasing a pipeline that is cost effective, transparent, and reproducible, this work presents a viable alternative for stakeholders to store and query vast amounts of transit location data without time-consuming data requests to transit operators. The architecture requires minimal operational maintenance, scales automatically with data volume, and does not depend on proprietary software. More importantly, this approach empowers a broader community of users to access and work with transit data directly. Academic researchers can conduct longitudinal studies; consultants can analyze performance for planning projects; policy makers can evaluate transit reliability in underserved neighbourhoods; and students can engage with real-world mobility data using only lightweight computing resources.

09:18
AI-Driven Citizen Development: Building Open-Source Transit Tools Faster and Smarter

ABSTRACT. Public transit agencies increasingly seek cost-effective, innovative solutions to leverage data for planning, performance, and customer insights. Our team adopted AI-assisted coding tools — such as GitHub Copilot - to empower data scientists, many without formal software engineering backgrounds, to become citizen developers. This approach enabled rapid learning and accelerated development of high-quality, open-source tools that deliver significant business value at a fraction of traditional outsourcing costs. These solutions are not only low-cost to maintain but also foster organizational agility and technical self-sufficiency.

While using AI to assist coding is not inherently novel, this talk aims to present real-world use cases that demonstrate how pushing boundaries beyond conventional scopes can unlock transformative outcomes. Attendees will gain practical insights into how AI-driven development democratizes technical capability, reduces barriers to innovation, and creates sustainable solutions for transit data challenges —encouraging others to rethink what’s possible when AI becomes a development partner.

09:24
When Counters Miscount: Lessons from Quebec RTC’s Data Journey

ABSTRACT. IN PERSON ****************** RTC has developed a rigorous methodology to assess the accuracy of passenger counter data based on several criteria: the presence of active counters on board, consistency between boardings and alightings, absence of abnormal onboard loads, and alignment with smart card validations. The Mobility Data Analytics team then created a Power BI report that integrates multiple data sources, enabling trip-level analysis and individualized vehicle monitoring. One surprising finding was the wide range of causes behind data issues, affecting both older and newer vehicles. Examples include: 1. Hardware failures 2. Integration issues within the ITS 3. Incorrect configurations 4. Interference from other onboard equipment (door arms, light signals, other IT systems) These cases (on which we will provide some insight during the presentation) illustrate the complexity of managing passenger counting data and the importance of continuous maintenance and regular monitoring, in collaboration with internal teams (IT, maintenance, management) and suppliers. Passenger counting data quality is a strategic issue for public transit and requires a multidisciplinary, proactive approach. Even when considering a transition to more advanced technologies (such as camera-based counters), the principles of data monitoring and validation will remain essential.

(Do not hesitate to contact us if you need more details.)

09:00-10:10 Session 22C: Exploring Fare Policy Impacts and Elasticities
Location: Right Room
09:00
Behavioural Effects of Fare Changes on MaaS Usage of Public Transport

ABSTRACT. Reduced public transport (PT) fares have been widely adopted as a policy tool to enhance urban mobility, alleviate traffic congestion, and promote sustainable transportation. Meanwhile, Mobility as a Service (MaaS) has emerged as a complementary strategy that integrates multiple transport modes into a unified platform, improving PT accessibility and convenience. However, a critical paradox exists: while reduced PT fares aim to increase ridership, they may simultaneously undermine the financial viability of MaaS operations by diminishing the price advantage of MaaS bundles and potentially driving user attrition. Despite this tension, no prior research has empirically examined how reduced PT fares affect MaaS retention. This study addresses this gap by investigating the behavioural and spatial factors that influence users' decisions to continue or discontinue their MaaS subscriptions under a substantially subsidised PT fare environment.

The study draws on the ODIN PASS MaaS trial conducted at the University of Queensland (UQ), Brisbane, Australia. The trial context is particularly relevant given the Queensland government's introduction of a 50-cent flat PT fare on 5 August 2024, applicable to all PT modes regardless of distance or time of day. Trip booking records were collected across Semester 1 (February to May 2024) and Semester 2 (July to October 2024), encompassing the period before and after the fare policy was introduced. After excluding potential graduating students and UQ staff, 3,913 student participants were retained for analysis.

A two-stage sequence analysis was first employed to identify distinct PT travel behaviour. The first-stage analysis identified four clusters: Active Users (27.7%), Price Switchers (26.7%), Inactive Users (37.6%), and New Adopters (7.9%). To enable a rigorous evaluation of MaaS retention, the analysis focused on 2,130 users classified as Active Users or Price Switchers, who exhibited a stable pre-policy travel baseline. A second-stage sequence analysis on this subset yielded three refined clusters: Steady Users (27.8%), Price Switchers (49.1%), and Active Users (21.3%). Nearly half of engaged users discontinued MaaS subscriptions following the introduction of the 50-cent fare, while approximately half maintained their usage.

Users were further stratified into "movers" (those who relocated residences between semesters, n=427) and "non-movers" (those who remained at the same address, n=1,703), given that residential relocation may confound the relationship between built environment factors and MaaS retention. For movers, a multinomial logit (MNL) model was estimated, while for non-movers, an integrated path-choice model was applied to capture both direct and indirect (mediated) effects of built environment and socioeconomic characteristics on MaaS retention, with the joint intensity of discretionary trip booking and travel day ratio serving as the mediating variable.

Results from both models consistently demonstrate that financial benefit and habit strength are the two primary drivers of MaaS retention under reduced PT fares. Specifically, a lower PT-only bundle cost gap, indicating that users still derive net financial benefit from the MaaS subscription even after the 50-cent fare, significantly increasing the probability of remaining in the trial. In parallel, a higher maximum number of consecutive travel days, reflecting habitual engagement with the MaaS platform, is positively and significantly associated with retention. These findings suggest that users who developed strong travel routines prior to the fare reduction are more resilient to the price shock and more likely to sustain their MaaS subscriptions. For non-movers, the integrated path-choice model further reveals that built environment and socioeconomic characteristics exert indirect effects on MaaS retention through their influence on discretionary travel intensity. Specifically, residents in areas with higher land use mix and higher socioeconomic advantage (as measured by IRSAD) are less likely to engage in MaaS-mediated discretionary travel, as most destinations remain walkable without PT. Additionally, longer PT travel time from home to the CBD reduces discretionary MaaS usage, which in turn lowers retention probability. These indirect effects highlight the role of the built environment in shaping the perceived value of MaaS beyond mere commuting.

This study provides novel empirical evidence on the interaction between top-down PT subsidy policies and bottom-up commercial MaaS schemes. The findings suggest that MaaS operators and policymakers should prioritise users who exhibit habitual usage patterns and derive clear financial benefits from subscriptions, even under a heavily subsidised PT context. Furthermore, targeting users in areas with lower land use mix and longer PT travel times may help sustain MaaS ridership. These insights offer important implications for the long-term operational sustainability of MaaS in cities implementing broad PT fare reduction policies.

09:06
D.C.'s Parking Cash-Out Policy: Employer Compliance Patterns and Transit Usage Implications

ABSTRACT. The author's preference for presentation is: in-person

Two years after Washington D.C.'s parking cash-out ordinance took effect in January 2023, this study provides the first comprehensive evaluation of implementation patterns and explores potential transit ridership implications. The policy requires organizations with 20+ D.C. employees offering parking to implement either a cash-out program, which allows employees to trade parking for transit subsidies or cash, develop Transportation Demand Management (TDM) plans targeting 10% annual trip reduction, pay $100/month/employee compliance fees, or claim exemption. Analysis of 948 employers subject to the law reveals significant implementation challenges. Despite parking cash-out's theoretical promise as a cost-effective TDM strategy, D.C.'s experience suggests substantial implementation barriers. We find 76% of employers claimed exemptions, limiting policy reach by excluding major institutional employers. Among the 230 non-exempt employers, only 57% adopted cash-out programs. More critically, we find a substantial implementation-participation gap: among 123 companies offering cash-out, average employee uptake of the programs stands at only 31.9%, with over 40% of employers reporting participation rates below 20%. Compliance strategy also varies systematically by organizational characteristics: small employers show higher per-capita cash-out adoption, while large organizations predominantly choose to implement TDM plans. Compliance fee payment emerges as economically rational for employers whose parking market values significantly exceed the $100 monthly penalty. We plan to complement this employer analysis with an exploratory examination of the Washington Metropolitan Area Transit Authority’s (WMATA) automated fare collection data, focusing on aggregate transit usage patterns at stations serving cash-out employer clusters and changes in the agency’s transit benefits program (SmartBenefits) enrollment. Given low participation rates and multiple potential uses of cash-out funds (transit, vanpool, bike, or pocketing cash), we focus on identifying organizational and spatial contexts where transit mode shift is most likely—including employer size, industry sector, transit accessibility, and interaction with existing SmartBenefits programs. This research contributes to understanding post-pandemic workplace commuting behavior and TDM policy effectiveness by revealing critical gaps between policy design, employer implementation, and employee participation. Our findings will help identify conditions where parking cash-outs are more likely to succeed by better aligning financial incentives, reducing implementation burdens, strengthening enforcement of exemption criteria, and complementary strategies to promote employee awareness and transit accessibility.

09:12
Estimation of Excess and Foregone Revenue to Support Planning for New Fare Products and Structure

ABSTRACT. Automated fare collection (AFC) systems allow transit agencies to introduce innovative fare structures, such as fare capping or graduated discounts. Fare capping can improve customer experience by guaranteeing customers the best price and eliminating customers’ guesswork in product choices, but they also have significant revenue impacts for agencies. Many of TransLink’s customers do not make enough trips each month to make their purchase of their monthly pass financially worthwhile, and fare capping would result in reduced revenue as a result. This work uses data from TransLink’s Compass AFC system to estimate the financial impacts for customers and the agency. The methods presented here estimate the “excess revenue” and “foregone revenue” collected from travel product sales. Excess revenue is defined as the difference between actual revenue and the hypothetical revenue under the assumption of optimal customer choices (i.e. the choices that would minimize customer expenditure). Foregone revenue is defined as the travel that monthly pass holders take over and above the financial breakeven point for their passes, or consumer surplus achieved from the purchase of the pass relative to pay as you go prices. These metrics, disaggregated by product type, customer segment, or geography, are being used to help inform TransLink’s fare policy review. The metrics derived from this work will help estimate a) the potential revenue impacts for the agency of structural changes to fare policy, including price capping or travel-frequency based discounts, and b) the distribution of over and under-payment by transit customers to help estimate the agency and customer impacts of new products. This work advances important considerations for agencies that are considering structural changes to their fare policy, both in terms of methodology and substantive results from TransLink’s investigations. The presentation will cover source data, methodology, and visualization of results.

10:10-10:40Coffee Break
10:40-11:40 Session 23: Plenary: Transit Integrated Data Exchange Specification (TIDES) - Status and Implementation

One of the biggest challenges for the transit industry is the variety of data formats being used, most of which are proprietary in nature; this makes managing the data, ensuring quality assurance, and sharing applications extremely difficult. TIDES (the Transit Integrated Data Exchange Specification) is an open-source data specification for transit operations data including vehicle locations (CAD/AVL), passenger counts (APC), and fare transactions (AFC). This session will introduce the specification and outline how it can benefit transit practitioners and academics alike.

· John Levin (retired), Director of Strategic Initiatives, Metro Transit, Minneapolis-St Paul, and TIDES Board Coordinator (and guiding force)

· Christopher Yamas, TIDES Program Manager, Jarvus Innovations

Location: Center Room
12:00-13:15Lunch Break
13:30-14:30 Session 26A: Analyzing Origin-Destination Patterns (ODX)
Location: Left Room
13:30
Estimating Network-Level Transit Origin-Destination Matrices from Fragmented Automatic Data Sources

ABSTRACT. Presentation preference: In-person

Introduction Public transportation agencies depend on a comprehensive understanding of passenger movements within their networks. The estimation of origin–destination (OD) matrices provides a fundamental means to achieve this understanding and has been substantially enhanced by advancements in automatic data collection systems, including Automatic Passenger Counters (APC) for recording boardings and alightings, Automated Fare Collection (AFC) for fare validation, and Automatic Vehicle Location (AVL) for fleet tracking. OD estimation based on APC data is highly underdetermined, as multiple OD matrices can satisfy the same marginal totals. Bayesian inference has been widely applied to address this challenge, although most studies have focused on route-level estimation. Constructing network-level OD matrices, which are essential for identifying transfer hubs, supporting multimodal scheduling, and optimizing fleet operations, requires additional information on passenger transfer behavior. AFC data are particularly valuable in this regard, as they record each passenger’s journey across the network, including transfers. Such data typically require trip chaining to infer alightings, since passengers only need to tap in at boardings. Although smart cards remain the primary source of AFC data, alternatives such as mobile ticketing applications and credit-card based payments are increasingly used. In these environments, the representativeness of OD matrices can be a concern, as adoption is often limited. For example, only 30% of Calgary Transit passengers use the MyFare mobile ticketing application to pay fares. Existing research often focuses on OD estimation using a single, unified source of mobility data. These approaches tend to fall short in producing reliable OD matrices in fragmented data environments, where multiple mobility sources cover only a portion of the passengers. Analytical approaches, such as trip chaining, generate a seed OD matrix from one source and scale it using APC data, but their accuracy depends heavily on the quality of the seed matrix. Moreover, probabilistic Bayesian methods are generally limited to route-level estimation and lack formulations for integrating multiple data sources. This study presents a hierarchical Bayesian framework for estimating network-level OD matrices, which is robust to data incompleteness and capable of incorporating multiple data sources. The framework is validated on a simulated Sioux Falls network before being applied to Calgary Transit. While the project focuses on the MyFare AFC system in Calgary Transit, a new fare collection platform with low early adoption, the methodology is also applicable to other transit systems facing fragmented mobility data due to the introduction of open credit-card payment methods, such as Tap and Ride in Bay Area Rapid Transit (BART), California. By integrating multiple data sources with APC records, the framework enables accurate estimation of network-level OD flows from incomplete mobility data.

Methodology The methodology employs a hierarchical Bayesian framework to estimate route-level OD flows. The estimation problem is inherently underdetermined: while APC data provide the total number of passengers boarding and alighting at each stop, multiple configurations of OD flows can satisfy these marginal totals. To constrain the solution space, the model assumes alighting probabilities at downstream stops remain stable during the study period. Additional data, such as AFC records, also improve identifiability by providing partial OD observations. Route-level flows are modeled multinomially, connected through a Dirichlet prior that governs passenger allocation across destinations. A non-informative hyperprior allows the model to learn uncertainty from data. The full posterior combines likelihood, prior, and hyperprior components, assuming conditional independence between data sources. A transfer flow refers to passengers alighting from one route and boarding another within a reasonable time and distance. Transfer blocks, which are groups of stops in close proximity where transfers are likely to occur, are first identified. Valid transfer flows are then determined based on spatial, temporal, topological, and logical criteria. APC counts from primary and secondary routes of each transfer are then used to estimate flow. This underdetermined problem is solved using a hierarchical Bayesian model. Transfer probabilities are assumed stable and modeled as multinomial variables drawn from Dirichlet priors. Transfer flows from AFC data strengthen inference, producing posterior distributions jointly estimating flows and probabilities. After sampling route-level and transfer flows from their posterior distributions, the network-level OD matrix is constructed using a rule-based algorithm. Transfer flows at specific stops are allocated across linked itineraries based on relative frequencies in AFC data. When no linked itineraries exist, flows are uniformly distributed across all associated itineraries. The final OD matrix aggregates all singleton and linked flows, assigning them according to first boarding and final alighting stops to represent passenger movements across the network.

Results The methodology has been applied to the simulated Sioux Falls transit network with synthetic AFC and APC data; 24 stops served by five bidirectional bus routes over two hours. Ground-truth OD tables serve as accuracy benchmarks, and the methodology’s performance under varying AFC penetration rates is compared with scaling methods. Results from the hierarchical Bayesian model indicate that route-level OD flows and alighting probabilities are estimated very close to the simulated-truth values across all penetration rates, with low variances. For transfer flows and probabilities, accuracy improves and variance decreases as AFC penetration rates increase, providing additional data. Overall, the method outperforms scaling methods, particularly at AFC penetration rates below 60%. In these cases, seed matrices contain many zeros, which heavily penalize multiplicative factors, whereas probabilistic estimation more effectively captures the underlying flow distributions. The next stage is to apply the methodology to Calgary Transit, which includes MyFare mobile ticketing and APC records from October 15, 2024. Calgary Transit’s system records purchases (Purchase Table) and onboard validations (Scan Table). Matching these tables identifies multiple tickets purchased by the same user which reduces the number of single scans and improves alighting inference through trip chaining. Since Calgary Transit lacks external ground-truth OD data, endogenous validation is performed. As most MyFare users are university students, a sub region encompassing the University of Calgary is selected. The OD table derived from scans in this region, scaled to APC counts, serves as ground-truth. By reducing scan records to 30 percent, representing overall MyFare penetration rate, and applying the hierarchical Bayesian model, estimated OD patterns can be compared with this ground-truth.

13:36
Where are my passengers going? Reconstructing passenger flows via anonymized mobility data

ABSTRACT. Attendance: In-person

Defining mobility patterns and passenger flows is fundamental for an efficient public transport system. The collection of precise data on demand and the reconstruction of passenger flows result in a higher customer focus regarding services and fare structures. It also leads to enhanced passenger information systems and improved revenue allocation, increasing the overall passenger satisfaction and service levels. Thereby, the generation of origin-destination (OD) matrices is fundamental for modeling and analyzing travel demand and network flows.

While solutions that generate these data as a direct consequence of their use exist, such as Check-In/Check-Out (CICO) and Be-In/Be-Out (BIBO) systems based on Radio Frequency Identification (RFID) or Bluetooth Low Energy (BLE) technologies, not every public transport operator has access to such technologies. Surveys remain a suitable avenue to estimate ridership and passenger flows, but they suffer from limitations such as sample sizes, varying sample representation, and repeatability. In addition to these manual efforts to reconstruct passenger flows, there have been attempts by means of sensor fusion to reconstruct the behavior of passengers, including through the passive monitoring of Wi-Fi and Bluetooth signals to determine passenger flows and ridership. In general, these methods provide information about the ridership of a specific vehicles but are largely unable to determine wider mobility patterns. This is due in part to the challenges provided by privacy-preserving measures, such as Media Access Control (MAC) Address randomization.

Alternative means to retrieve the missing information to reconstruct passenger flows are therefore crucial to fill these gaps. In the scope of the research project ANYMOS, methodologies to reconstruct passenger flows based on user connections request, ticket sales data, and auxiliary data such as demographic data have been developed and showcased for the public transport authority of Karlsruhe (KVV) in Germany. Besides the construction of OD-matrices a major focus of the project was the design of a data processing pipeline that could produce the desired results while preserving privacy and anonymizing the original data sources. This step was mandatory since the server logs from the booking app contain all the requests to the backend made by the customers , including connection requests, timestamps, an origin and a destination, as well as GNSS coordinates or a point of interest (POI) such as a bus stop or the customer’s own house. In order to anonymize the dataset and avoid potential reidentification, each request was assigned to the closest public transport stop using a Voronoi based approach. This subdivision process allowed for a sufficiently fine-grained reconstruction of OD data while not exposing any personalized information of the app users. In a second step the connection requests were matched to the tickets purchased through the application. Comparing ticket sales with connection queries provide a measure of how many requests convert to sales and trips taken by the public transit users.

To ensure the statistical representativeness of the data derived for the user request and the ticketing purchases, demographic data from the German National Census were compared with our sample data in order to make appropriate statements about passenger flows across the whole catchment area of the public transport authority.

In order to process and visualize the data, a backend system based on Apache Airflow was additionally developed. The system enabled automatic anonymization, processing and visualization of the user request data. Several dynamic visualizations were also designed in ANYMOS and implemented within an interactive dashboard to visualize the resulting passenger flows. The public transport authority could thus aggregate and visualize passenger flows at various spatial and temporal aggregation levels. Further analytical visualizations include OD matrices and fusion with operational data, allowing more insights into passenger flow data.

The presentation will focus on a methodology to reconstruct passenger flows based on connection and routing requests made by passengers, combined with ticketing and socio-demographic data. In addition, the presentation will highlight the dynamic visualizations of passengers flows and O/D matrices and highlight new insights into passenger flow data when combined with additional auxiliary data.

13:42
Validating the Use of APC Data to Monitor Changes in City-wide Transit Origin-destination Flows Using Socioeconomic Variables Data

ABSTRACT. Preference for presentation: In-person

1. Introduction

Automatic Passenger Count (APC) technologies are widely used by transit agencies, primarily for ridership reporting. However, estimating passenger origin-destination (OD) flows from APC data is a cost-effective way for agencies to acquire essential information used for service planning, design, and operations. Given the widespread ongoing collection of APC data across agencies, APC-derived city-wide OD flows could be monitored over time to understand trends and changes in passenger flow patterns. While the use of Automatic Fare Card data would be beneficial when available, such data are not as prevalent and would be best combined with APC data for improved OD flow estimation.

In a previous study a method was developed for city-wide OD flow monitoring using APC data. The method was illustrated by investigating the impacts of the covid-19 pandemic on transit travel patterns in Columbus, Ohio. This previous study is unique in investigating city-wide OD transit bus passenger travel using APC data and an Iterative Proposal Fitting Method (IPF) based methodology accessible for ready implementation at transit agencies. In the study reported here, socioeconomic variables are used to validate changes in the OD flow patterns associated with the covid-19 pandemic that are identified from using the developed methodology with available APC data.

2. Data

The Central Ohio Transit Authority (COTA) serves the greater Columbus region. COTA APC data collected on 936,141 APC monitored bus trips serving 76 route-directions and more than 3,000 stops from May 2018 through February 2020 (before covid-19 pandemic restrictions) and from September 2020 through July 2021 (during pandemic restrictions) are used in this study to estimate route-direction-period OD flow matrices. APC data collected from March 2020 through August 2020 are omitted from the analysis. The estimation of COTA passenger zonal OD flows during that period would lead to unreliable results because information on transit service provision during this period was highly dynamic and not accurately documented by COTA considering the challenging pandemic lockdown circumstances.

In addition, COTA schedule information – which includes scheduled arrival/departure times for time-points and route patterns – and General Transit Feed Specification (GTFS) – which includes scheduled arrival/departure times for all stops but does not include route pattern information – are used to scale up the route-direction-period OD matrices estimated from APC data available on approximately 20% of the fleet to capture the full bus service operation.

Moreover, 2017-2021 Columbus socioeconomic US census American Community Survey Data are used to validate changes in the estimated passenger zonal OD flows between the pre-covid-19 pandemic and the during-the-pandemic periods. In addition to median zonal household income, the following variables are considered (in the form of percentage within each zone): no vehicle ownership, essential workers related to healthcare and custodial services, and college enrolment.

3. Methodology and validation results

Bus trip stop-to-stop passenger flow estimates from APC data using the IPF method are aggregated across the 7 to 9 am time-of-day period, scaled up to account for bus trips not served by APC equipped buses, and then aggregated across stops within each of 42 contiguous zones for each month of the pre-pandemic and during-the-pandemic periods.

This validation study considered The Ohio State University (OSU) campus and the Columbus Central Business District (CBD) zones. The OSU campus is a large education and employment zone that also includes a major medical center. The Columbus CBD is a large employment zone and serves as a major hub for transfers between COTA routes. From the APC-based estimated monthly volume passenger OD flows, the probabilities of travel from each zone conditional on the OSU campus as the destination zone are calculated for the pre-pandemic and during-the-pandemic periods, and similarly when considering the CBD as the destination zone.

The conditional probabilities greater than or equal to 0.05 are used to identify origin zones for further investigation for each of the destination zones considered (OSU campus and Columbus CBD). The volume OD flows per month for the selected origin zones are compared across the pre-pandemic and during-the-pandemic periods, and the origin zones that exhibited a percentage decrease in flows to each of the two destination zones during the pandemic that is larger than the decrease in flow from all 42 origin zones are distinguished from the origin zones that exhibited a smaller decrease. For each of the two destination zones, the identified origin zones are then compared considering the socioeconomic variables in the origin zones.

Several differences are explained by such variables based on a priori expectations. While a priori expectations were not established for other relatively large or small changes in OD flows to the two destination zones considered, once quantified these differences could be explained by some of the socioeconomic variables associated with the origin zones.

Considering the OSU campus zone as a destination, origin zones with large decreases in flows with respect to the decrease in flow from all origin zones had large college enrolment percentages, presumably resulting from most course instruction and work being conducted remotely in the during-the-pandemic period. In contrast, zones with smaller flow decreases had low median incomes, large essential healthcare and custodial worker percentages, and large no auto ownership percentages, presumably reflecting the more transit dependent workers at OSU’s large medical center or campus custodial services.

Considering the CBD zone as a destination, the origin zone with largest decrease in transit flows had large median income and small no auto ownership percentage, presumably because travelers from this zone are less transit dependent and more likely to have the option to work remotely. In contrast, as when considering the OSU campus zone as a destination, zones with smaller decreases in flows had low median incomes, large essential healthcare and custodial worker percentages, and large no auto ownership percentages.

4. Conclusion

The above a priori expectations confirmations and new findings validate the reliability and effectiveness of using APC data to monitor city-wide transit passenger OD flows.

13:48
Estimating Subway Origin–Destination Matrices in Entry-Only Fare Systems Using Passive Wi-Fi Traces and Station Gate Counts

ABSTRACT. Accurate and timely Origin–Destination (OD) matrices are fundamental to understanding passenger demand patterns and ridership variability across transit networks. They serve as a cornerstone for strategic planning, service scheduling, and operational management of transit systems. Traditionally, OD matrices are constructed using data from large-scale household travel surveys. These surveys are among the most resource-intensive and methodologically complex data collection efforts, but they typically yield comprehensive and high-quality information—providing valuable socio-demographic and trip-level information. However, the resulting OD matrices are often sparse, represent a single point in time, and do not reflect changes over time. Additionally, the infrequent execution of those surveys and the relatively long time it takes to validate and release the data takes away from the data's timeliness.

The proliferation of automated data collection technologies have prompted a shift toward leveraging passively collected digital data sources. Among these, Automated Fare Collection (AFC) systems offer unprecedented opportunities to analyze travel behavior and spatial interaction patterns at the individual level. This is especially true for “closed” AFC systems, where passengers both tap-in and tap-out with smart cards. However, in “open” AFC systems—such as that used by the Toronto Transit Commission (TTC)—the absence of a mandatory tap-out event hinders the direct estimation of alighting stations, complicating efforts to reconstruct full passenger trajectories.

Recent advancements in network connectivity have introduced Wi-Fi probe data as a complementary or alternative source for OD estimation. Many transit agencies provide free Wi-Fi service throughout their networks, and the resulting device-to-access-point connections can be used to trace passenger movement with high temporal granularity. These observations allow for the creation of detailed and accurate OD estimates that are updated frequently. As a result, methods that generate OD matrices from Wi-Fi data can significantly decrease the time needed for demand estimation, producing fine-grained, dynamic, and continuously updated OD matrices at a relatively low cost.

Building upon the growing body of research that integrates multiple data sources for OD estimation and behavioral analysis, this study advances the understanding of how Wi-Fi and AFC data can be operationalized in flat fare Urban Rail Transit (URT) systems that require tap-in at entry. Specifically, the study pursues three objectives:

- Develop a replicable pipeline for generating OD matrices for entry-only urban rail systems using Wi-Fi and AFC data; - Demonstrate the framework through a comprehensive case study of Toronto’s TTC subway system, comparing model-generated OD matrices with those derived from the Transportation Tomorrow Survey (TTS); and - Evaluate the potential of extending station-level OD matrices to generate household-level OD matrices that capture passenger flow trends consistent with station-level patterns not captured in TTS data.

The Doubly Constrained Iterative Proportional Fitting (IPF) Method employed in this paper uses a reference seed matrix containing a subset of all trips, which is expanded and balanced to match the total flow counts of the entire population. This requires an seed matrix containing detailed OD flows, recorded from a representative subset of the system users; and disaggregated origin/destination counts, recorded from all trips on the system. By scaling the sample-level seed matrix to the population-level counts, the doubly constrained IPF ensures that the sum of each row/column in the expanded matrix matches the corresponding station entry/exit flow.

For station-level estimation, the seed matrix is the cumulative trips between each station OD pair, captured by the Wi-Fi traces, and is constrained against AFC gate counts for each station. For zonal OD estimation, the seed matrix is extended by distributing the station-level OD matrix over the TAZs using station-to-TAZ counts recorded in the TTS, and is constrained against the demand associated with each TAZ, also recorded in the TTS.

For station-level OD estimation, subway stations are grouped into 10 regions to reflect average morning peak flows in downtown, midtown, and peripheral areas. We generated an estimated OD matrix and compared it with the 2022 TTS subway OD matrix to evaluate system-wide, station-level, and temporal stability. The results show that the estimated matrix is a reasonable proxy for the TTS, exhibiting strong geometric similarity based on cosine similarity for both origins and destinations, with most values close to one.

Future work will include expanding the analysis to other modes of public transport with Wi-Fi connection service available in Toronto to generate a multi-modal OD-matrix.

13:30-14:30 Session 26B: Understanding/Enhancing Bus Performance
Location: Center Room
13:30
Red light, green light : bus speed profile visualization along congested roads using GTFS-RT data

ABSTRACT. In the last several years there have been significant transformations in daily travel behaviour in the Montreal Metropolitan Area. The widespread adoption of remote work has changed the balance of daily travel demand, which impacts traffic volumes at intersections depending on the day of the week. Additionally, the expansion of biking infrastructure has created a modal shift towards biking, which increases the volumes of bikes at intersections. The on-time performance of the bus network in the Longueuil Agglomeration has taken a hit in this same period, dropping 5% since 2019. Travel time variability has increased on some critical arteries of the road network. To address this issue, the Réseau de Transport de Longueuil assembled a project team to evaluate potential solutions. An important and problematic artery of the road network was chosen for the pilot project: Chemin de Chambly. One solution considered was to change the way the service is provided along the road, converting local service to an express service redirected on the highway. However, since this approach is a workaround and doesn’t solve the core issue, another solution was also considered. The alternative solution, which will be covered here, consists in analyzing bus speeds along the problematic artery, and coupling them with street elements like traffic lights, reserved lanes, and bus stops. The objective of this analysis is to identify the root causes of travel time variability and then take targeted measures to resolve the specific issues identified. To this end, several data sources were combined: the road network and reserved lanes, the traffic light locations, the bus stop locations, and the GTFS-RT vehicle position points of the bus network. The GTFS-RT data is publicly available through an API that is managed by our partner. The data can be pulled as frequently as every 5 seconds. For this test, the data was pulled at 60 second intervals. A sample of this dataset was extracted for the afternoon peak on weekdays for the month of September 2025. The data was then filtered by route_id to keep only the routes that operated on Chemin de Chambly. When mapping this subset of data, the first observation was that many data points were not mapped along the planned routes that were used as the filtering criteria. The main reasons for this discrepancy were: • Wrong route_id assigned to data points within the GTFS-RT data • Road work / closures that forced vehicles onto another path After mapping the raw data, the next step was to match every data point to a road segment in order to visualise the speeds along the different sections of the road. First, the planned routes and their shape_id were matched to the road network. This ensured that each data point was associated only to a road segment that shared a corresponding shape_id. Next, the data points were matched using a buffer around each road segment. Finally, for each data point, the percentage of its position from the start of the segment was calculated in order to align each point on an X/Y graph. This step was required since the points are gathered every 60 seconds and do not necessarily align across different trips. Once the data transformation was complete, the median speed was calculated for each segment and illustrated through a color gradient. The data was split into segments, and the data points were mapped along the reserved lanes, the traffic lights, the bus stops, and the daily average activity per bus stop. An X/Y graph was created, plotting the speed against the progression along the road segment, in order to identify the speed profiles for each segment and compare them to the planned speeds. With this data in hand, it is now possible to answer various questions, including: • How many times during a trip are the buses stopping at a traffic light? • Is there more congestion on Tuesdays, Wednesdays, and Thursdays compared to Mondays and Fridays? • Are there specific times of day or trip directions with more congestion? It is clear that the GTFS-RT data has a lot of value for the diagnostic of variability in travel times. This data allows us to visualize speeds across road segments and match these speeds with road elements. With this information in hand, it becomes easier to identify corrective measures that could be put in place to help with the regularity in trip times. This is the first step of a network diagnostic that will help us design our trip times, and which will be shared with the city’s road traffic department to find solutions to the growing on-time performance issues we are facing. The next steps would be to: • Design a production environment to gather and store GTFS-RT data • Design the data transformation pipeline • Create an interactive map that would allow us to filter and visualize data at a higher level and extract subsets of data for more detailed analysis.

13:36
Envisioning Transit: Simple Questions, Open Data, and What We Can Learn From Them

ABSTRACT. Modern transit systems generate an enormous amount of information about how people move through a city, but surprisingly little of that information is easy to access, understand, or use to tell meaningful stories. Riders, advocates, policymakers, and even planners often find themselves navigating a patchwork of datasets that illuminate certain aspects of the network while obscuring others. Meanwhile, despite advances in data collection, analytical methods, and visualization techniques across both industry and academia, compelling stories that translate this complexity into accessible, evidence-based insights remain notably rare. The disconnect between the data that exists, the analyses we are capable of performing, and the narratives that could help riders and decision-makers make sense of it underscores why the Transit Data symposium remains so important.

Envisioning Transit is an attempt to respond directly to the missing link between what is publicly available and what we can learn from it. The project is an open, exploratory web-based data storytelling platform that asks simple, intuitive questions about how cities move. It uses openly available data, transparent analytical methods, and reproducible visualizations to investigate the answers. Envisioning Transit is motivated by the types of questions that riders, planners, and advocates care about the most: Has transit been getting faster or slower? Do dense neighbourhoods actually receive more frequent service? While these questions are simple in principle and can be theoretically addressed using open data such as GTFS and census information, the process of answering them reveals the challenges of working with operational transit data and, in many cases, the need for more consistent and transparent data publication. Envisioning Transit aims to show what can be learned from public data today, where key limitations constrain deeper understanding, and how a more collaborative data ecosystem could strengthen both analysis and public conversation.

For this project I will produce two data stories, centred around major Canadian cities. The first asks: How fast does a city’s bus network really move, and how has this changed over time? Using historical GTFS archives for multiple Canadian cities, the data pipeline reconstructs route geometries, computes trip-level and route-level scheduled speeds, and examines how these values shift across GTFS dates. To make the results relatable, the analysis begins with a single route that has maintained consistent geometry across multiple years. By tracing its scheduled run times and average speeds, the platform highlights how schedule padding, minor routing adjustments, or shifts in stop spacing can influence overall performance. This micro-level analysis expands to an agency-wide perspective by aggregating average speed across all routes for each feed date, revealing longer-term trends and providing context for system-level conversations about investment, congestion, or service redesigns. The analysis then shifts to the rider: How have travel times between major destinations changed? Has it gotten faster or slower? In this portion, we show how underlying scheduling decisions can translate to overall rider experience, by comparing how trips from 10 years ago might match up to one in the present day. This connects how small, gradual operational changes can evolve to impact the overall usefulness of transit in a city.

The second story turns to the relationship of land use and transit: Do denser parts of cities see more frequent transit? By pairing GTFS-derived scheduled frequency with publicly available land-use and population density data, the platform evaluates how well service allocation aligns with urban form. The analysis can identify routes in high-density areas that receive less frequent service than expected, corridors that maintain high frequency despite modest density, or emerging growth areas where service has not kept pace with land-use change. Map-based visualizations highlight spatial mismatches that may not be evident in tables or traditional dashboards. These findings illustrate both the strengths and limitations of scheduled GTFS for understanding service allocation and point to the kinds of data (ridership, real-time headway adherence, or parcel-level land use) that would enable deeper analysis. Taken together, these two perspectives provide a unified view of what scheduled service data can reveal about transit performance and how urban context shapes these patterns. Envisioning Transit demonstrates the analytical potential of open data while also highlighting the constraints that prevent a fuller understanding of service quality. The platform is fully open-source and intended for collaboration, enabling agencies, researchers, and community groups to extend its metrics, develop visualizations, integrate new datasets where available, or adapt the narrative structure to their own regions. Ultimately, Envisioning Transit is both an analytical toolkit and a conversation starter about the state of open transit data, the value of transparency, and the role that accessible, reproducible analysis can play in improving mobility systems.

Implementing these stories requires a technical foundation that is open, reproducible, and portable across cities. The platform architecture has two components: A Python analysis pipeline processes historical GTFS feeds, reconstructing route geometries, computing metrics such as scheduled speed, trip runtimes, service frequency, and estimated travel times, and storing these indicators in a structured PostGIS database. The pipeline is modular and reproducible, using open-source tools like GTFS-Lite and clear metadata to support multi-year, multi-city comparisons. A web-based visualization environment built with Next.js, Tailwind CSS, Mapbox, and modern React charting libraries presents these metrics through interactive, narrative visualizations. Rather than static dashboards, these data stories help readers explore how transit performance evolves across geographies and time while revealing the limits of open data. The presentation will outline the platform and results of the case study, but focus also on obviating the areas where existing transit data in Canada makes answering simple policy questions difficult. Attendees will come away with practical examples, ideas they can adapt in their own work, and a clearer sense of how open transit data can become a foundation for more compelling, transparent communication. The goal is not only to demonstrate what can be built from open data, but to spark new ideas among agencies, researchers, and practitioners about how simple questions can lead to meaningful insights and better public conversations about transit.

13:42
Quantifying Signal-Delay Contributions to Bus Travel Times with Implications for Scheduling

ABSTRACT. Reliable travel time estimates are important for transit agencies and passengers. Agencies depend on accurate estimates for scheduling and resource allocation, while passengers rely on them for trip planning. Current practices often use aggregated analysis levels, typically at timepoint level, masking the detailed contributions to travel times from dwell times, congestion, and signal delays. These simplifications can lead to unreliable schedules and poor passenger experiences.

Recent studies have conducted more granular analyses of individual travel time components, such as dwell time determinants and signal delay impacts. Studies on dwell times showed how they are influenced by ridership variations, fare collection methods, and accessibility needs. Studies on signal delays often focus on signal priority strategies with inconsistent effectiveness results. Some showed positive effects while others did not show significant gains.

Studies have also started to examine the interactions between different components, such as dwell time optimizations to minimize subsequent signal delays, but these methods rarely account for cumulative effects across multiple intersections. Variance decomposition also reveals that interaction effects, especially between departure times and signal timings, significantly influence overall travel time variability. Despite these findings, practical models incorporating such details remain scarce, leaving planners reliant on averages that may misrepresent actual conditions.

Hence, we aim to: (1) integrate detailed components, departure times, inter-stop travel times, dwell times, signal delays, and their interactions into a travel time estimation model; and (2) demonstrate its application for scenario-based planning, enabling proactive evaluation of congestion, ridership, and signal timing changes.

We propose to analyze three categories of detailed travel time components, inter-stop times, which relate to congestion, dwell times, which relate to passenger activities, and red light waiting times, which relate to traffic signal impacts. Departure times are also used an input to calculate traffic signal states given the fixed signal timing plans used in Montréal. Departure times and inter-stop times come directly from vehicle logs; dwell times are estimated via regression model using boarding and alighting data; and signal delays are inferred from estimated cycle parameters using archived bus positions.

To generate the vehicle trajectory, the model would cumulatively sum up the individual travel time components from the given departure. A vehicle would depart at the given departure time from the first stop. Then, it would add the inter-stop travel time to the next location that potentially requires a stop, i.e. a passenger stop or a traffic signal. We check if passengers are waiting or if the light is red given the arrival time, respectively. If passengers are waiting in the given planning scenarios, we add the estimated dwell times. If the traffic signal is red, we add the remaining red time based on estimated signal timings. If there are no passengers or if the light is green, the vehicle would pass right away with no time added. Then, repeatedly, we would add the inter-stop times to the next location to check for passengers or traffic signals.

Due to the lack of detailed empirical analysis on red times in the literature, we first summarize the signal impacts on Montréal’s bus network. Red times contribute significantly to bus travel times, where they account for 15% of total travel time, equating to 1,624 service hours per weekday and approximately 19,572 passenger-hours daily. Central routes have the highest delays, often exceeding 20% of total travel time, due to dense traffic signals. Suburban areas with less traffic signals and highway express routes average below 10%. Given the higher ridership in the central areas, the impacts on passengers are significant. Red time variations also represent 55% of total travel time variability, highlighting their significance in affecting service reliability.

To evaluate our proposed travel time model, which incorporates signal timings, Route 27, a 4.5 km, straight route linking residential areas to the metro, was analyzed. Using the detailed components (inter-stop times, ridership observations, and estimated signal timings), the model simulated different ridership and congestion scenarios. Estimated trajectories for various ridership and congestion levels (20th, 50th, and 80th percentiles) correspond well to the original observations and show non-linear impacts, due to varying signal cycles. Hence, small local changes can cascade into significant delays when buses fall out of sync with green waves.

The model was further tested for one future trip using January 2024 observations, updated departure times, and revised signal timings. Estimated and observed median travel times were compared across eight future sign-up periods. Differences averaged 20 seconds per trip and around 30 seconds to 1 minute at stops, validating the model’s accuracy. Importantly, the observed travel times increased by 1.5 minutes in later periods, and the proposed model was able to capture these increases. If planners rely solely on historical averages, they may not be able to adapt to changing operating conditions, which could affect schedule adherence.

The model results for all westbound Route 27 trips in autumn 2024 also showed improvements compared to simply using historical averages. For 70% of trips, estimates were 1 minute closer to actual times compared to using averages; a few trips improved by 2 minutes, which is important given the 3-minute layover times on this route. However, about 30% of trips did not show improvements, and the results for a few trips worsened due to unmodeled factors like mid-route driver changes and operator behaviour at yellow lights.

Finally, the study explored the possibility of adjusting departure times by a maximum 2 minutes to better align with signal cycles. Results indicate that 84% of trips could benefit, with potential savings of 1 to 2.5 minutes per trip, which is significant for this 20-minute route. These adjustments could be integrated into interline optimization algorithms.

However, some limitations remain. The model does not fully account for mid-route driver changes, operator behaviour variability, or ridership changes due to varying headways. Data constraints, especially on actuated signal timings and detailed ridership patterns, also limit the precision and the model's applicability elsewhere. Future research should continue to incorporate high-resolution vehicle location data, real-time control strategies, and demand forecasting to improve the model.

13:48
Person-Centric Reinforcement Learning for Adaptive Traffic Signal Control: Event-Based Passenger Delay at Stops

ABSTRACT. Preferred presentation mode: in-person (or virtual/undecided)

---

Most reinforcement learning (RL) adaptive traffic signal control (ATSC) systems for transit-heavy corridors are still optimized and evaluated using vehicle-centric measures such as vehicle delay or at best in-vehicle bus delay. These proxies are convenient for control, but they ignore the “additional” passenger delay that builds up at the downstream bus stop when a bus is delayed. In practice, this downstream passenger delay may justify expediting a late but fairly empty bus at a traffic light, not because of the few passengers already on board but because of the passengers waiting at the next stop. Downstream passenger delay is often ignored in ATSC formulations. This additional passenger delay/waiting time at the stop is distinct from door-to-door travel time, which bundles walking, expected waiting time based on the headway, in-vehicle time, and egress. This work introduces a person-centric RL formulation and a stop-level evaluation suite that explicitly balances traffic delay and in-vehicle transit delay upstream of the signal with passenger delay at the downstream stop, and that redefines how passenger delay is represented in the reward and evaluation of multimodal ATSC controllers. The study uses the Yonge-Steeles Aimsun testbed in North York, Toronto, with five signalized intersections on the Yonge corridor controlled by decentralized RL agents and the remaining intersections running the city’s semi-actuated plan. The traffic demand is calibrated using an origin-destination matrix and turning-movement counts from the City of Toronto, while the transit services represent Toronto Transit Commission (TTC) and York Region Transit (YRT) selected routes and schedules. This yields a demand pattern that closely matches observed volumes along the corridor. On top of this existing infrastructure, we adopt the eMARLIN-T-MM controller developed by Othman et al. (2025), which is a decentralized multi-agent RL system where each intersection uses a transformer-based encoder to summarize local traffic history (from cameras and sensors) and neighboring intersection information into an embedding, and a Q-network to select signal phases, with agents trained using a deep Q-learning-based algorithm. Our contribution is a new person-centric reward and evaluation layer. The reward function has three components: (1) person-weighted traffic delay in the detection zones, (2) person-weighted in-vehicle transit delay in the detection zones, and (3) a stop-level event-based term called Passenger Delay at Stop due to Late arrival (PDSL). The first two terms are computed every simulation second exactly as in the work by Othman et al. (2025), so the controller continues to see second-by-second conditions for both cars and buses. The third term is defined and applied only at the instant a bus is expected to reach its next stop late. At a given moment, we know (i) how late the bus is relative to its scheduled headway or arrival and (ii) how many passengers at that stop are affected. Lateness is obtained by comparing the actual headway between consecutive buses crossing a fixed reference point upstream of the stop with the scheduled headway. To determine how many passengers are affected, each stop maintains a simple λ-based passenger accumulator, which is a running counter that increases every second according to an assumed arrival rate (0.333 passengers/second). When a bus serves a stop, if everyone boards, the accumulator is reset; if the bus’s capacity is insufficient, denied-boardings carry-over is added to the next bus’s affected-passenger count. This λ-based process is a deliberate design choice to serve as a placeholder demand model that lets us test person-centric control now and can later be replaced, one-for-one, by stop-level APC/AVL counts as they become available, without changing the reward definition. At a late arrival event, PDSL is computed as: PDSL = (late seconds) × (passengers waiting at arrival + denied boarding carry over) and added to the reward for that step. The passenger accumulator itself is updated every second, so the controller always has an up-to-date estimate of how many riders are currently waiting/delayed at the next stop. Evaluating PDSL when a bus is expected to arrive at a stop late links the penalty directly to the realized additional delay at that stop while keeping the state fully aware of the evolving passenger buildup. Learning uses the same transformer-based deep Q-learning setup as in the eMARLIN-T-MM controller by Othman et al. (2025), in which they developed decentralized agents with a shared encoder Q-network architecture trained via off-policy temporal-difference updates. In our formulation, the event-based PDSL term simply appears in the reward if a bus is expected to arrive late at its next stop. We call our extended RL-based algorithm embedding-communicated Multi-Agent Reinforcement Learning for Integrated Network with Passenger Delay at Stops (eMARLIN-PDS). For evaluation, we run two pipelines under identical Toronto demand. The first is a baseline with the existing multimodal reward used in the eMARLIN-T-MM study, and the second uses our proposed person-centric reward with PDSL. Both are assessed with a unified metric suite that reports the following: total person delay, transit-only person delay, car-only person delay, average person delay by mode, denied boardings (by stop and total), bus on-time performance at stops, per-stop PDSL aggregated over the simulation, showing where and for how many passengers the controller actually reduces late-arrival harm. Methodologically, the contribution is to layer an event-based, stop-level passenger metric onto a decentralized RL ATSC controller. For agencies and data-driven practitioners, the framework offers a way to move from “expedite the bus no matter what” to “expedite when late-arrival passenger harm is actually expected at the next stop”, while still protecting overall multimodal corridor performance.

13:30-14:30 Session 26C: Demand Forecasting
Location: Right Room
13:30
Predicting Bus Station Demand : A Crowdsourced Data Approach

ABSTRACT. undecided

As urban populations grow and traffic congestion increases, public transportation systems offer a sustainable alternative by reducing dependence on private vehicles and lowering carbon emissions. Moreover, understanding station demand is particularly challenging in fare-free systems such as Luxembourg's, where traditional ticketing data is unavailable. This study presents a novel framework for forecasting bus station demand using Google Popular Times (GPT) data through a two-stage deep learning methodology. We develop a predictive Sequence-to-Sequence (Seq2Seq) model that forecasts station occupancy levels over a 24-hour horizon based on the previous 72 hours of data.

Introduction

Bus stations serve as critical point in urban networks. However, accurate demand prediction at stop level is challenging. Traditional demand forecasting relies heavily on fare transaction data. While alternative approaches using mobile applications and GPS tracking offer improved accuracy, but privacy and cost are the main defects. Smit m [1] proposed a four-stage modeling approach using spatial, demographic, and service-level variables to establish a demand forecasting framework. A multilayer perceptron (MLP) model for bus ridership was proposed by Farahmand et al. [2]. Pelletier et al. [3] demonstrated the utility of smart card data for analyzing boarding and alighting patterns. Other studies such as Liyanage et al. [4] applied bidirectional long short-term memory (BiLSTM) to smart card data, achieving high accuracy for bus demand prediction. Google Popular Times (GPT) data offers an alternative with broad geographic coverage and hourly temporal resolution. Recent studies, such as TransitCrowd, have demonstrated GPT's effectiveness to predict passenger volumes at subway stations [5]. Case Study Area, Dataset and Methodology GPT quantifies activity patterns across POIs using anonymized smartphone data on a 0-100 scale representing relative hourly visits. This study utilized GPT data collected over a three-month period from 92 bus stops located in Esch-sur-Alzette.

Recurrent Neural Network Models Long Short-Term Memory (LSTM) as a branch of Recurrent Neural Networks (RNNs) have the ability to efficiently capture both long-term and short-term dependencies and it can be integrated with Seq2Seq architectures to effectively model temporal patterns [6]. Seq2Seq models particularly effective for sequence-based tasks including machine translation, and time series forecasting [7]. To forecast the next day's bus station demand, a sliding window approach is employed where input sequences consist of 72 consecutive hours of historical data (previous three days). For a given time point t, the input sequence comprises [ D_(t-71),…., D_t] with the corresponding target being the subsequent 24 hours of observations [D_(t+1),…., D_(t+24)]. By training on these 72-hour historical GPT data, the model learns to capture temporal dependencies and cyclical patterns within the time series. Model performance was evaluated using RMSE, MAE, and R² metrics on the test set.

RESULTS

The results shows that station-specific Seq2Seq models outperformed a single city-wide model (RMSE=11.50, MAE=8.94, R²=0.67), demonstrating benefits of capturing individual station characteristics.

Figure 5 shows the six best station-specific models, comparing actual vs. predicted demand over 24 hours.. The results demonstrate strong predictive accuracy across these top-performing stations. Station Esch, Iewescht Homescht achieved the highest performance with an R² of 0.9359 and MAE of 4.58, exhibiting excellent alignment between predicted and actual demand patterns, particularly capturing the pronounced peak around hour 15. Station Esch, Villa follows with an R² of 0.9172 and MAE of 5.37, successfully tracking the sustained high-demand period between hours 3-12. Station Esch, Auszeibreck (R² = 0.9133, MAE = 4.90) demonstrates robust performance in predicting the gradual demand increase and subsequent decline. The visual comparison between actual and predicted values across all six stations reveals that the models effectively capture both overall demand trends and peak patterns, though minor deviations occur during periods of rapid demand fluctuation.

CONCLUSIONS This study introduced a novel framework for predicting bus station demand using GPT data and deep learning models. The approach employed a two-step methodology: detecting outliers data, then training customized Seq2Seq models for each bus station to forecast next 24-hour demand. Results demonstrated that the models effectively captured temporal dependencies, with RMSE, MAE, and R² metrics confirming robust performance across stations.

13:36
From Urban Activity to Timetables: Integrating Google Popular Times and GTFS for Station-Level Assessment in Luxembourg City

ABSTRACT. Preference: undecided

Google Popular Times (GPT) measures how busy a place is relative to its peak, and General Transit Feed Specification (GTFS) data describe planned train movements. This paper examines how train frequency and nearby land use relate to station activity in Luxembourg. We construct three data layers for six stations: (i) station-level GPT curves, (ii) weighted activity profiles of nearby points of interest (POIs), and (iii) hourly GTFS-based train frequencies. We find strong correlations between station demand, POI activity, and scheduled service, but supply–demand ratios are highly heterogeneous, indicating likely mismatches between provision and demand.

Introduction

Google Popular Times (GPT) aggregates anonymous smartphone traces into hourly activity indices that, although relative, can approximate passenger flows and station demand patterns [1-4]. The General Transit Feed Specification (GTFS) provides standardized transit timetables. Most public-transport demand studies rely on smart-card and passenger-count data, which are rarely public [5]. Against this background, we combine station- and catchment-level GPT with GTFS to assess how transit supply corresponds to surrounding station activity in Luxembourg City, tackling the questions (i) station GPT mirrors nearby POI activity, (ii) scheduled frequencies reflect demand, and (iii) where supply–demand mismatches occur.

Methodology

Luxembourg City has six railway stations. We use GPT data, which is an hourly busyness index (0–100) normalized to each location’s peak, for both stations and nearby POIs. Figure 1 maps the stations, their circular catchment areas, and POIs with available GPT, shown as red triangles.

The empirical design has three components. First, GPT of stations over 12 weeks were cleaned and checked to derive representative Station GPT Profiles (“SGP”). Then, for each station, we removed GPT outliers for POIs within 500 m and averaged their 12-week series into a single weekly curve. From this we computed three indicators: mean GPT (average over all hours in a week), duty cycle (share of hours in a week with GPT > 50), and coefficient of variation (spikiness). These indicators define POI weights, producing a Catchment POI Profile (“CPP”) per station. Lastly, GTFS timetables were converted to hourly departure frequencies.

We computed summary statistics, supply–demand ratio, and pairwise Pearson correlations. We structured the data as an hourly panel, so correlations capture variation across both stations and time. To further test these relationships, we estimated linear regressions with SGP as the dependent variable and the CPP and Freq as predictors, assessing whether station activity responds to catchment land-use intensity and service supply. Figure 2 summarizes the methodological framework.

Results

Table 1 summarizes measures derived from processed typical-week profiles. MeanSGP ranges from 31 to 47, whereas the POI activity index varies much more. MeanFreq ranges from about 3 trains per hour at Cents-Hamm to about 28 at Luxembourg. Consequently, the supply–demand ratio is highest at Luxembourg (0.80) and lowest at Cents-Hamm (0.07), with intermediate values at other stations.

An apparently counterintuitive finding is that some minor stations have higher mean SGP than the main hub. Because GPT is normalized to peak activity, low frequency stops such as Cents Hamm (meanSGP ≈41) show higher averages than the main hub (≈35); longer waits inflate relative crowding, whereas high frequency hubs have lower meanSGP due to off peak dilution. Furthermore, both correlation and regression analyses show that station activity correlates strongly with catchment busyness and service frequency; the regression identifies land use intensity as the primary driver and frequency as a secondary but meaningful effect.

Linear regression models with hourly SGP as the dependent variable and CPP and Frequency as predictors confirm these patterns. Both predictors have the expected positive, statistically significant coefficients, with the CPP showing the larger standardized effect, indicating that catchment land-use intensity is the primary driver of station activity, while service frequency exerts a secondary but still meaningful influence on GPT levels.

Figure 3 complements these statistics by plotting the full 168-hour GPT profiles for the six stations. Most show a pronounced bimodal commuting pattern with morning and late-afternoon peaks, but peak height, width, and weekend shoulders differ: some suburban stops sustain relatively high evening and weekend activity, whereas the central hub exhibits sharper, more concentrated peaks, consistent with its transfer role.

Conclusions

This study combines station-level and POI-based GPT data with GTFS schedules to explore supply-demand alignment at Luxembourg rail stations. The GPT indicators reveal coherent activity patterns that track both catchment busyness and scheduled frequencies. However, supply–demand ratios remain highly heterogeneous, with some small stations experiencing sustained relative crowding under low service. Future work will refine the weighting model, and test the approach in other urban contexts.

13:42
Innovative Approaches to Comprehensive Ridership Analysis

ABSTRACT. [We have no preference for the presentation format, whether it is in person or virtual.]

Achieving a sustainable balance between service quality and operational costs is a core challenge for municipal transport planners. Service quality encompasses connectivity, comfort, and accessibility, while operational costs include infrastructure, maintenance, staffing, and energy consumption. Striking this balance is essential for delivering efficient, cost-effective transit systems that meet community needs.

To address this challenge, WSP has developed Transkit, a digital decision-support ecosystem designed to analyse existing or proposed public transport networks. Transkit evaluates the balance between service quality and associated capital and operating costs using a unified database.

This presentation will focus on Transkit’s comprehensive ridership analysis module, which diagnoses Level of Service (LOS) issues, such as ride comfort, and assesses rolling stock utilization efficiency. Our methodology introduces a set of evaluation indicators, some based on proprietary algorithms, enabling automated diagnostics of route segments. Unlike traditional ridership analysis, which often centres on passenger counts and occupancy rates, our approach identifies uneven loading patterns, boarding and alighting behaviours, and inefficiencies across route sections. For example, routes with high overall utilization may still exhibit long segments with low load factors, leading to resource waste.

The system automatically flags issues such as: • Overcrowded or underutilized route sections • Routes with uneven loading patterns • Long routes dominated by short passenger trips

Planners can then explore corrective strategies using a cluster-based approach, adjusting route frequency, rolling stock capacity, and introducing variations such as: • Short or long branches • Express or limited-stop services • Peak/off-peak adjustments

Each scenario is evaluated through a cost–benefit lens, balancing LOS and operational efficiency. This process enables significant reductions in operating expenses—vehicles, drivers, vehicle-km, and hours—without compromising service quality.

We will also present our proprietary algorithm for constructing detailed origin–destination matrices within individual routes, covering all stop-to-stop pairs. This capability enhances decision-making for route redesign and provides robust insights into route duplication, directness, and transfer patterns. This approach prevents misclassification of routes as inefficient based solely on geographic straightness, recognizing that visually “crooked” routes may be quite straight from the passengers' point of view. Adjustments aimed at improving apparent efficiency can, in fact, degrade network performance if passenger distribution is ignored.

The methodology has been successfully applied in diverse municipalities worldwide, including Barrie (Ontario), St. Albert (Alberta), Almaty (Kazakhstan), and Tel Aviv (Israel), where it identified route patterns with significant LOS and efficiency gaps. Based on these findings, targeted measures were proposed to enhance service using the minimum necessary resources, enabling planners to achieve a rational balance between quality and cost. The session will showcase the implementation of these analytics within Transkit and provide practical examples demonstrating how data-driven diagnostics can optimize transit planning and operations.

13:48
Transforming Passively Collected Smart-card Data to Inputs Reliable for Choice Modeling

ABSTRACT. In-Person Presentation.

This study presents a framework for transforming passively collected smart-card data into inputs that are considered reliable for choice modeling. To do so, we define the criteria for reliability and build a framework for producing reliable inputs.

In the Random Utility Maximization approaches, utilities are computed based on the outcome that would have been achieved if a different alternative had been considered. Consequently, we need a safe approach to gather data that represents what a traveler can plausibly know about all alternatives at the moment of decision. This has been investigated in the literature using different approaches.

Conventional revealed preference (RP) surveys rely on self-reported trips and often suffer from low response rates (around 20–30%), which can bias results. In contrast, smart-card data captures actual travel behavior for nearly all users automatically. Therefore, we can extract actual travel behavior rather than stated intentions. However, they are constrained by existing network conditions and suffer from issues such as missing sociodemographic information or implausible travel times. These limitations make them difficult to use directly in behavioral models.

This raises an important question: can behavioral choice models be estimated using only smart-card data derived from passive sources? This study argues that there is a gap in the literature to answer this question safely because of common challenges and pitfalls, including causal inference and data leakage. Therefore, we need a safe approach that enables us to transform real-world data and passively collected data into features, provided that the approach is carefully validated and theoretically grounded.

Using smart-card data alone introduces several methodological difficulties. First, the limited variation in some of the observed attributes, such as travel time and fare, reduces the ability to identify behavioral sensitivities. Second, these attributes are not randomly assigned and may be correlated with unobserved individual characteristics, introducing endogeneity. Third, smart-card data rarely includes unchosen alternatives, making it necessary to infer their attributes. Finally, the Multinomial Logit (MNL) model’s Independence of Irrelevant Alternatives (IIA) property may bias estimates when alternatives share unobserved components‎1.

To address these challenges, we develop a systematic preprocessing framework that uses data-driven modeling and estimation. The key idea is to reconstruct the full decision context that travelers faced while avoiding the use of information unavailable at the time of choice. As a result, the output dataset should satisfy the basic requirements for discrete choice analysis: individual observations of discrete, mutually exclusive alternatives; measurable explanatory attributes; variation in those attributes; and a defined choice set for each decision-maker.

Because smart card data only records the chosen mode, attributes for unchosen modes are inferred through a structured pipeline. The process involves defining choice sets for each origin–destination (OD) pair, estimating travel distances, developing mode-specific predictive models of travel and waiting times based on OD characteristics and time of day, and predicting missing attributes for unchosen alternatives. These inferred variables complete the comparison among all alternatives in the same context, allowing estimation of behavioral parameters from fully specified choice sets.

A critical aspect of the framework is preventing data leakage, which means using information that was only known after a decision was made. Leakage leads to unrealistic predictive accuracy and biased behavioral parameters. Therefore, all attributes must be reconstructed to represent information that could reasonably have been known to the traveler at the time of decision. Instead of using realized travel durations or departure times, the framework relies on predictions from statistically and behaviorally consistent models, which makes the resulting dataset suitable for causal behavioral analysis and policy evaluation.

Because smart-card data is anonymous, socio-demographic variation is approximated using contextual proxies. Boarding locations are linked to census-based indicators such as median income and employment levels at each socio-demographic zone. Moreover, fare media type (e.g., student, senior, adult) serves as a proxy for traveler groups. These proxies provide socio-demographic heterogeneity without compromising user privacy.

Another benefit of using this pipeline is that it accommodates a panel data having multiple observations from the same individual and introducing intra-person correlation within multiple days. This does not exist in famous datasets, such as in Optima (Swiss)‎2 and LPMC (London)‎3 datasets, where each person only reports travel for a single day. So, we cannot observe their habits and variability over time, which makes the downstream model overfit‎4.

Using the processed dataset, we estimated both Multinomial Logit (MNL) and Latent Class models. The estimated coefficients display expected behavioral signs and magnitudes, confirming the internal consistency of the data. Incorporating socio-demographic proxies improved model performance and revealed intuitive behavioral patterns, such as lower sensitivity to waiting time among riders from higher-income zones. Excluding implausible trips, such as those with extreme travel times, increased the magnitude of the travel time coefficient by approximately 20%, suggesting that unfiltered outliers attenuate behavioral sensitivities. The proposed approach, compared with statistical imputation and neural network methods, produced more stable, monotonic, and interpretable predictions. This ensures that estimated attributes align with real-world expectations, such as longer distances correspond to longer travel times.

Beyond systematic preprocessing, the study shows the robustness of model results and sensitivity of choice probabilities to changes such as the effects of outliers. Moreover, the variability and sample representativeness of the data for different sociodemographic groups will be discussed.

In this study, we provide three main contributions. First, we review common pitfalls while using passively collected data and provide a checklist of potential ways to avoid them. Second, we introduce a reproducible preprocessing framework that transforms noisy and incomplete smart card data into a reliable dataset for discrete choice modeling. Third, we demonstrate that robust behavioral parameters can be estimated using smart-card data alone, provided that the choice context is reconstructed realistically and data leakage is avoided.

Overall, the proposed framework shows that behavioral insights can be credibly estimated using passively collected data from smart card systems, which becomes critical, particularly when traditional survey-based modeling and routing simulation tools are not available.

14:30-15:00Coffee Break
15:00-16:15 Session 27: Plenary Session: Transit Data Challenge

The Transit Data Challenge is a first-of-its-kind opportunity to tackle real problems. The Transit Data Challenge invited student teams from Canadian universities to tackle real-world challenges in public transit data. Participants developed innovative solutions at the intersection of transportation, data science, and public policy — showcasing how advanced analytics, artificial intelligence, and modern data infrastructure can make transit systems smarter, more equitable, and more responsive to the communities they serve. This session will present the three Finalist Teams and their applications; more information about the three Finalists is available at https://www.transitdata2026.ca/datachallenge.

Location: Center Room
16:15-17:00 Session 28: Plenary Session: Beyond Transit Data – Looking to the Future

To wrap up the TransitData Symposium, we will broaden the perspective beyond just transit data, in order to look towards a future of broad geospatial data and integrated mobility.

· Arif Rafiq, Industry Manager, Transportation, Esri Canada, Executive Board Director, ITS Canada

· Jesse Coleman, Manager Transportation Data & Analytics, City of Toronto

Location: Center Room