Using the Modern Data Stack to Improve Transit Outcomes in California

Background

California Integrated Travel Project (Cal-ITP) is a statewide effort to make transit easier to use and more cost effective for riders in the state of California.

Cal-ITP needed a technology partner to help them collect data about California’s 200+ transit agencies, and transform that data into usable models to support a wide range of data users and analysis products. Jarvus was selected as that partner and determined that Cal-ITP offered an exciting opportunity to bring modern data stack principles to a public sector context.

Project context

Cal-ITP’s data infrastructure serves a diverse ecosystem of users and project goals, including both internal and external data stakeholders, all of whom have different needs and levels of self-service. For example, project leadership may ask for a quick number that they can share with an outside stakeholder while data analysts are working on long-term research projects.

Graphic showing three colorful shapes with different data users. The blue shape says: 'Data users within Cal-ITP: Project leadership, Data analysts, Operational users, Developers'. The green shape says 'Transit agencies'.  The orange shape says 'External third parties'.
Internal and external data stakeholders for the Cal-ITP project.

Cal-ITP’s data sources are heterogeneous. The project in many ways is more similar to a research initiative than traditional business analytics because it involves scraping data from open sources with fewer guarantees than structured APIs would have in many business contexts. The project’s emphasis on assessing data quality requires that the ingest pipeline have extremely reliable uptime and capture rates. The nature of the GTFS data specification means that the ingested GTFS data is of variable completeness, uses a variety of components/features, and can take multiple approaches to represent the same concept. The project’s data sources also vary widely in size and update frequency, from large real-time data captured every 20 seconds to a relatively very small manually-maintained internal database with perhaps weekly data updates.

Table showing Cal-ITP data sources with data sizes and file formats.
Summary of data sources for the Cal-ITP pipeline.

One primary goal of project’s data stakeholders is to create data products, which are built and maintained by the data users directly rather than each being maintained by the Jarvus data services team. The list of data products was not fully defined at project outset, but it was known to include at least:

Over time, the original products have matured and added new feature requirements, and new products have been developed, including a variety of published analyses.

Approach

The first step in developing a solution was to recognize that the data needs of the project would be better served by taking a platform approach than by tackling individual product, source, or user requirements one by one. The project needed an extensible data platform, rather than a set of bespoke pipelines tailored to each individual deliverable.

Graphic showing a reframing of requirements as a list of individual products to the requirement of an extensible data platform that supports different products.
Reframing requirements.

By assessing the requirements for such a platform, it became clear that Cal-ITP’s needs align in many ways with the principles of the modern data stack. One clear formulation of those principles comes from Atlan, who define the modern data platform as characterized by:

  1. “Self-service for diverse users”
  2. “Agile data management”
  3. “Flexible, fast, pay as you go”

These features match well with Cal-ITP’s needs to facilitate collaboration, enable self-service, scale flexibly, and be cost-effective.

With these requirements in mind, Jarvus developed the following data platform:

Graphic showing a modern data stack architecture with tools for ingestion, data modeling, and analysis.
Cal-ITP data platform architecture.

Tools were selected with a preference for open-source tooling, and Google Cloud had already been selected as the cloud provider. However, this architecture can also be formulated in a tool-agnostic way:

Graphic showing a modern data stack architecture showing the roles that different tools play, like orchestrator and business intelligence dashboarding tool.
Modern data stack architecture.

The key benefits here, of separating compute and storage; separating raw data from modeling; and using version control, are not tool-specific.

Outcomes

With Jarvus’ help, Cal-ITP currently ingests and analyzes over 1 million files per day, and makes that data immediately accessible to data engineers, analysts, and the agencies Cal-ITP supports.

Through the shared BigQuery warehouse, Cal-ITP data users have access to a shared source of truth. Analysts can do complex research tasks using flexible Jupyter notebooks in an entirely browser-based workflow, and they can publish those directly to an analysis site to share their insights. Customer success managers can self-serve dashboards in Metabase to analyze their customers’ data quality and identify where support is needed.

Screenshot of cells from a Jupyter Notebook. Screenshot of a map showing variability in bus speeds by road segment in Marin County, California.
A hosted JupyterHub instance allows analysts to work entirely in the browser, without installing packages locally. They can then publish their analyses publicly using a JupyterBook-based workflow. The top screenshot shows a notebook that drives the bus speeds analysis site shown in bottom screenshot.

Peer developers can leverage the robust documentation and developer experience of dbt to contribute their own models to the warehouse. And all of those users are looking at the same underlying data, ensuring that Cal-ITP can speak with a unified voice when answering data questions.

Screenshot of a table showing GTFS quality checks with a mix of green checks for succes, red X's for failures, and whitespace for missing data.
Quality reporting for agencies at reports.calitp.org
Screenshot of a dashboard showing metrics about number of organizations with various outcomes in GTFS data checks.
Metabase dashboard for policymakers at calitp.org/gtfs-dashboard
Screenshot of California open data portal showing a map with high quality transit areas around the San Francisco Bay Area.
Dataset for the public at data.ca.gov/dataset/ca-hq-transit-areas
A sample of Cal-ITP data products built or maintained by Cal-ITP data product owners using the data infrastructure designed by Jarvus.