Transit Operations Data Pipeline

A production data pipeline ingesting GTFS-RT feeds, transforming them with dbt, and surfacing KPIs in a Power BI dashboard. Built for a transit authority in Quebec.

30s

Real-time refresh cycles

99.9%

Pipeline uptime

GitHub

Transit Ops

transit_ops is a portfolio-oriented GTFS / GTFS-RT analytics foundation that starts with STM and is designed to stay provider-ready within the GTFS and GTFS-Realtime standards.

What this project is

This repository is the bootstrap and database foundation for an operations-style transit analytics system. The end state is a clean pipeline that stores GTFS schedule and GTFS-RT data in Neon Postgres, models it into Bronze / Silver / Gold layers, and feeds a downstream Power BI dashboard.

This is intentionally not a startup SaaS. V1 is STM-first, single-provider, and portfolio-focused.

V1 scope

The current implemented slices establish the project skeleton, core database foundation, and provider registry seam:

Python 3.12 project managed with uv
Pydantic settings and logging
Runnable Typer CLI
Alembic migration setup for the base schemas and metadata tables
STM seed data for core.providers and core.feed_endpoints
YAML-backed provider manifest loading for STM
Bronze static GTFS download, checksuming, R2-first raw archiving, and raw metadata registration
Bronze GTFS-RT one-shot snapshot capture, protobuf metadata extraction, R2-first raw archiving, and raw metadata registration
Silver static GTFS normalization into canonical Neon tables with dataset-versioned loads
Silver GTFS-RT normalization from captured Bronze snapshots into canonical Neon tables
Gold route/stop/date dimensions plus realtime snapshot facts and KPI views
explicit static and realtime orchestration commands for the proven Bronze -> Silver -> Gold flow
one long-running realtime worker entrypoint for cloud/container deployment
one GitHub Actions workflow for daily static refreshes plus a Dockerfile for the realtime worker
Architecture and setup documentation

Why STM-first but provider-ready

The implementation starts with STM because the portfolio story is clearer when the first provider is real and well-scoped. The database design and settings model are still provider-ready:

all core metadata carries provider_id
GTFS source identifiers are preserved
feed endpoint registration is normalized in core.feed_endpoints
the schema shape can support more GTFS / GTFS-RT providers later

The abstraction target is GTFS and GTFS-Realtime, not arbitrary transit APIs.

Prompt 2 adds a simple provider registry based on YAML manifests in config/providers/. STM is the only active manifest in V1, but the seam now exists for additional GTFS providers later.

Why Neon is the reporting core

Neon Postgres is the reporting core because this project is meant to highlight SQL-first analytics engineering:

a durable relational source for normalized schedule and realtime data
clean separation between raw ingestion metadata and curated marts
direct support for downstream BI tools such as Power BI
simple local development with a cloud-hosted reporting database

Why Bronze / Silver / Gold exists

The data model is layered on purpose:

Bronze preserves raw source artifacts and ingestion traceability
Silver holds canonical GTFS / GTFS-RT relational tables
Gold exposes BI-friendly facts, dimensions, and KPI views

The current foundation only creates the schemas and base metadata tables needed to grow into that layered design.

Bronze static ingestion

Slice 2 adds the first real ingestion step for STM static GTFS:

the static schedule URL comes from the validated STM provider manifest
the ZIP is downloaded once on demand through the CLI
the file is archived under the configured Bronze storage backend
a SHA-256 checksum and byte size are recorded
one row is written to raw.ingestion_runs
one row is written to raw.ingestion_objects

The current logical Bronze object key pattern is:

provider_id/endpoint_key/ingested_at_utc=YYYY-MM-DD/YYYYMMDDTHHMMSSffffffZ__<checksum12>__<filename>

Example logical key:

stm/static_schedule/ingested_at_utc=2026-03-24/20260324T110203456789Z__aaaaaaaaaaaa__gtfs_stm.zip

Backend behavior:

local mode stores that logical key under BRONZE_LOCAL_ROOT
S3-compatible mode stores that same logical key as the bucket object key
raw.ingestion_objects.storage_path always stores the logical key only, never an absolute local path
the implementation is intended to stay compatible with Cloudflare R2 while remaining generic S3-compatible

Bronze realtime capture

Slice 3 adds one-shot GTFS-RT snapshot capture for STM:

the trip_updates and vehicle_positions URLs come from the validated STM provider manifest
STM realtime access uses the apiKey request header with STM_API_KEY
the current Python transport pins TLS 1.2 for compatibility with api.stm.info
each command performs one on-demand capture only
the raw protobuf payload is archived under the configured Bronze storage backend
a SHA-256 checksum and byte size are recorded
GTFS-RT metadata is extracted from the payload:
- feed header timestamp
- entity count
- endpoint kind (trip_updates or vehicle_positions)
one row is written to raw.ingestion_runs
one row is written to raw.ingestion_objects
one row is written to raw.realtime_snapshot_index

The current realtime logical object key pattern is:

provider_id/endpoint_key/captured_at_utc=YYYY-MM-DD/YYYYMMDDTHHMMSSffffffZ__<checksum12>__<endpoint_key>.pb

Example logical key:

stm/trip_updates/captured_at_utc=2026-03-24/20260324T121516987654Z__bbbbbbbbbbbb__trip_updates.pb

As with static Bronze storage:

local mode stores this logical key under BRONZE_LOCAL_ROOT
S3-compatible mode stores this logical key directly in the configured bucket
the DB continues to record storage_backend, logical storage_path, byte size, checksum, source URL, and ingestion lineage

This slice intentionally starts with one-shot Bronze capture, but the repo now also includes a separate orchestration layer and a long-running realtime worker that can call those captures continuously on a safe cadence.

The STM shared secret is still not used by the current GTFS-RT capture path.

Bronze storage modes

The current Bronze storage abstraction still supports two backends in code:

local
s3

The intended durable Bronze mode is now Cloudflare R2 through the S3-compatible backend:

set BRONZE_STORAGE_BACKEND=s3
set BRONZE_S3_ENDPOINT=https://eccfb9bedd87d413eaf4cac6ae2285d3.r2.cloudflarestorage.com
set BRONZE_S3_BUCKET=transit-raw
set BRONZE_S3_ACCESS_KEY
set BRONZE_S3_SECRET_KEY
set BRONZE_S3_REGION=auto

Important R2 rules:

BRONZE_S3_ENDPOINT must be the account-level endpoint only
do not append /transit-raw or any other path segment to the endpoint
pass the bucket separately as BRONZE_S3_BUCKET=transit-raw
the implementation uses SigV4 signing and path-style addressing for R2 compatibility

Local disk is no longer the intended durable Bronze store. It still exists for:

local temp staging before upload/download
backward-compatible reads for historical local Bronze rows
explicit local-only development workflows if you intentionally set BRONZE_STORAGE_BACKEND=local

The current implementation still downloads each artifact to a local temp file first and then persists it through the configured Bronze backend. The orchestration and worker layers sit on top of this storage path without changing the underlying Bronze key semantics.

Silver static normalization

Slice 4 adds the first Silver normalization step for STM static GTFS:

the loader finds the latest successfully archived Bronze static ZIP for the provider
it reopens the archive through the recorded Bronze storage backend
it validates the required GTFS members inside the archive
it parses the required core files:
- routes.txt
- trips.txt
- stops.txt
- stop_times.txt
- calendar.txt
- calendar_dates.txt
it creates a new core.dataset_versions row for every Silver load
it loads canonical rows into:
- silver.routes
- silver.trips
- silver.stops
- silver.stop_times
- silver.calendar
- silver.calendar_dates

Dataset versioning works like this:

the latest successful Bronze static archive is treated as the source artifact
each Silver load creates a fresh dataset version row
Silver rows are written with both provider_id and dataset_version_id
prior Silver dataset versions are left intact
older dataset version rows are marked is_current = false and the newly loaded version is marked current

This keeps the pipeline append-only at the data level while still letting the repo point to one current static dataset for downstream use.

Silver realtime normalization

Slice 5 adds the first Silver normalization step for GTFS-RT snapshots:

the loader finds the latest successful Bronze realtime snapshot for the provider and endpoint
it reads the archived protobuf through the recorded Bronze storage backend
it parses the payload with gtfs-realtime-bindings
it normalizes a minimal V1 field set into:
- silver.trip_updates
- silver.trip_update_stop_time_updates
- silver.vehicle_positions

The current V1 realtime fields are intentionally narrow:

trip updates store snapshot linkage plus practical trip-level fields such as trip_id, route_id, direction_id, start_date, vehicle_id, trip_schedule_relationship, delay_seconds, and entity_id
stop time updates store only the parent linkage plus practical stop-level fields such as stop_sequence, stop_id, arrival_delay_seconds, arrival_time_utc, departure_delay_seconds, departure_time_utc, and schedule_relationship
vehicle positions store snapshot linkage plus practical location fields such as vehicle_id, trip_id, route_id, stop_id, current_stop_sequence, current_status, occupancy_status, latitude, longitude, bearing, speed, and position_timestamp_utc

Bronze-to-Silver linkage is explicit through realtime_snapshot_id, which connects each Silver realtime row back to:

raw.realtime_snapshot_index
raw.ingestion_runs
raw.ingestion_objects

This remains a one-shot load from already captured snapshots, and the new realtime worker simply automates those same explicit load steps on a fixed cadence.

Gold marts and KPI views

Slice 6 adds the first BI-ready Gold layer for STM:

gold.dim_route
gold.dim_stop
gold.dim_date
gold.fact_vehicle_snapshot
gold.fact_trip_delay_snapshot
gold.latest_vehicle_snapshot
gold.latest_trip_delay_snapshot

The Gold layer is intentionally explicit and narrow:

route and stop dimensions are rebuilt from the current static Silver dataset
the date dimension is rebuilt from the current static service calendar range and exceptions
vehicle and trip delay facts keep full history, but realtime refresh now upserts only the newest loaded snapshots instead of deleting and rebuilding all provider history every cycle
gold.latest_vehicle_snapshot and gold.latest_trip_delay_snapshot keep only the newest snapshot per provider for dashboards, KPI views, and browser inspection
KPI views now query the lightweight latest Gold tables directly instead of re-deriving "latest snapshot" from the large history facts

The current KPI views are:

gold.kpi_active_vehicles_latest
gold.kpi_routes_with_live_vehicles_latest
gold.kpi_avg_trip_delay_latest
gold.kpi_max_trip_delay_latest
gold.kpi_delayed_trip_count_latest

Slice 5 adds 5-minute warm rollup tables for Power BI historical trend pages:

gold.vehicle_summary_5m — vehicle count and observations per 5-minute period and route (warm — 90-day retention)
gold.trip_delay_summary_5m — delay statistics per 5-minute period and route, including avg_delay_seconds_capped (abs ≤ 3600s) and outlier_count (warm — 90-day retention)
gold.warm_rollup_periods — idempotency tracking table for the rollup build

Raw delay_seconds in gold.fact_trip_delay_snapshot is never clamped or modified. The capped variant lives only in the warm rollup table, making Import mode KPI cards outlier-safe without touching the source facts.

gold.fact_trip_delay_snapshot keeps the trip-level GTFS-RT fields when STM provides them, but it now backfills the most important gaps for BI:

vehicle_id falls back to the nearest silver.vehicle_positions row for the same trip_id within a short snapshot-time window
delay_seconds falls back to a derived stop-time delay computed from silver.trip_update_stop_time_updates absolute timestamps versus the current static silver.stop_times schedule for the same trip and stop sequence

That means route-level delay KPIs can still populate even when STM omits the top-level trip delay_seconds field. route_id remains useful for grouping and filtering, but it is not used by itself to infer a single vehicle_id because one route can have many concurrent active vehicles.

Gold refresh is now split across three explicit paths:

build-gold-marts stm — heavy full-history backfill, manual recovery only
refresh-gold-static stm — daily static batch path, replaces only Gold dimensions
refresh-gold-realtime stm — 30s realtime path, upserts latest snapshots only

refresh-gold-static stm is called by run-static-pipeline after each daily Silver static load. It replaces dim_route, dim_stop, and dim_date from the current dataset version without touching fact tables or acquiring a table lock. This eliminates lock contention with the concurrent realtime worker.

refresh-gold-realtime stm is the fast path used by the realtime worker and realtime cycle. It upserts only the current trip and vehicle snapshots into the historical Gold facts and then refreshes the small gold.latest_* tables.

build-gold-marts stm is the explicit full-history backfill command for manual recovery only. It is no longer called by any automated pipeline.

The V1 Power BI operations dashboard is built and published to Power BI Service in DirectQuery mode. Every page load queries Neon live through 15 imported tables (5 KPI views, 2 latest-serving tables, 8 Gold dimensions, facts, and rollup tables). The .pbix file is not checked into this repo. The BI assets under powerbi/ document the semantic design:

powerbi/dashboard-spec.md
powerbi/build-playbook.md
powerbi/field-mapping.md
powerbi/dax-measures.md
powerbi/sql-validation.sql
powerbi/sql-validation.md
powerbi/portfolio-notes.md

The published dashboard exposes four pages: Network Overview, Route Performance, Stop Activity, and Live Ops / Freshness.

Pipeline orchestration and automation

The repo now includes explicit orchestration commands that reuse the already proven Bronze, Silver, and Gold services instead of duplicating business logic:

run-static-pipeline stm
- always runs ingest-static stm (Bronze lineage is always recorded)
- compares the new Bronze checksum to the current core.dataset_versions.content_hash
- if unchanged: skips Silver and Gold steps; result reports static_changed=false
- if changed (or no existing version): runs load-static-silver stm then refresh-gold-static stm
run-realtime-cycle stm
- runs capture-realtime stm trip_updates
- runs capture-realtime stm vehicle_positions
- runs load-realtime-silver stm trip_updates
- runs load-realtime-silver stm vehicle_positions
- runs refresh-gold-realtime stm
- prunes Silver storage according to the configured retention settings

Pausing and resuming the pipeline

Two scripts wrap all three automation surfaces (GH Actions + Railway env var + Railway compute) into a single command:

# Stop everything (GH Actions disabled, Railway worker idles, Railway compute suspended)
bash scripts/pause-pipeline.sh

# Restart everything
bash scripts/resume-pipeline.sh

Railway compute suspension requires a personal API token:

export RAILWAY_TOKEN=<token from https://railway.app/account/tokens>
bash scripts/pause-pipeline.sh

Without RAILWAY_TOKEN, the scripts handle GH Actions and the PIPELINE_PAUSED env var, and print a link to pause Railway compute manually.

The PIPELINE_PAUSED=true env var alone (without compute suspension) makes the Railway worker idle — it sleeps each poll interval without calling STM, Neon, or R2.

Operational rules:

Bronze durable storage remains R2-first through BRONZE_STORAGE_BACKEND=s3
the orchestration commands keep the existing DB lineage and R2 object key behavior intact
run-realtime-cycle stm attempts both realtime endpoints every cycle
if one endpoint fails and the other succeeds, the command reports a partial failure explicitly and exits non-zero
Gold latest tables are refreshed after any successful realtime endpoint load so downstream BI stays current without rewriting the full provider history
old Silver rows are pruned automatically:
- static Silver keeps only the current dataset version by default
- realtime Silver keeps the newest two days of snapshots by default

Continuous realtime worker

The repo also now includes one minimal long-running worker entrypoint:

run-realtime-worker stm

It is intended for container or cloud deployment and:

loops forever
runs one realtime cycle per loop
logs each cycle clearly
exits non-zero on fatal startup/configuration issues
keeps running across per-cycle endpoint failures so one bad pull does not corrupt later cycles

Worker environment variables:

PIPELINE_PAUSED
- default: false
- set to true to make the worker idle (sleeps each interval, no cycles run, no DB/R2/API calls)
- use scripts/pause-pipeline.sh / scripts/resume-pipeline.sh to flip this together with GH Actions and Railway compute
REALTIME_POLL_SECONDS
- default: 300 (production override: 30)
- controls how often one full realtime cycle starts
REALTIME_STARTUP_DELAY_SECONDS
- default: 0
- optional startup delay before the first cycle
STATIC_DATASET_RETENTION_COUNT
- default: 1
- keeps only the newest static Silver dataset version by default
SILVER_REALTIME_RETENTION_DAYS
- default: 2
- keeps only the newest two days of realtime Silver snapshots by default
GOLD_FACT_RETENTION_DAYS
- default: 2
- Gold fact rows (fact_trip_delay_snapshot, fact_vehicle_snapshot) older than this are deleted each realtime cycle
GOLD_WARM_ROLLUP_RETENTION_DAYS
- default: 90
- gold.vehicle_summary_5m, gold.trip_delay_summary_5m, and gold.warm_rollup_periods rows older than this are deleted by prune-warm-rollup-storage
BRONZE_REALTIME_RETENTION_DAYS
- default: 7
- Bronze realtime R2 objects and Neon metadata eligible for deletion after this many days (requires downstream Silver rows to be gone first)
- set to 0 to disable Bronze realtime pruning
BRONZE_STATIC_RETENTION_DAYS
- default: 30
- Bronze static R2 objects and Neon metadata eligible for deletion after this many days (requires core.dataset_versions reference to be gone first)
- set to 0 to disable Bronze static pruning

Deployment artifacts

The repo ships one static batch workflow and one container path for the realtime worker.

GitHub Actions workflows

The included workflow files are:

.github/workflows/daily-static-pipeline.yml
.github/workflows/daily-warm-rollups.yml

Static pipeline workflow

Current behavior:

triggers once per day at 06:00 UTC
06:00 UTC corresponds to 2:00 AM Eastern while EDT is in effect
GitHub Actions cron is UTC-based, so this may need a seasonal UTC adjustment during EST if the desired local run time remains 2:00 AM Eastern
supports manual runs through workflow_dispatch
uses timeout-minutes: 30
uses concurrency to avoid overlapping static runs
keeps GitHub permissions narrowed to contents: read
runs:
- uv sync --locked
- python -m transit_ops.cli init-db
- python -m transit_ops.cli seed-core
- python -m transit_ops.cli run-static-pipeline stm

Exact GitHub Actions secrets required after you push the repo:

NEON_DATABASE_URL
BRONZE_S3_ACCESS_KEY
BRONZE_S3_SECRET_KEY

STM_API_KEY is not required for the daily static workflow because it does not capture GTFS-RT feeds.

Warm rollups workflow

Current behavior:

triggers once per day at 07:00 UTC (one hour after the static pipeline)
supports manual runs through workflow_dispatch
uses timeout-minutes: 15
runs:
- uv sync --locked
- python -m transit_ops.cli build-warm-rollups stm
- python -m transit_ops.cli prune-warm-rollup-storage stm

The 07:00 UTC schedule is intentional: GOLD_FACT_RETENTION_DAYS = 2 means warm rollups must be built at least every 2 days to avoid gaps. Running daily at 07:00 UTC ensures the rollups are built before the nightly Gold fact pruning cycle has a chance to remove the source rows.

Exact GitHub Actions secrets required (same as static workflow):

NEON_DATABASE_URL
BRONZE_S3_ACCESS_KEY
BRONZE_S3_SECRET_KEY

Realtime worker container

The included container path is:

Dockerfile

Current container behavior:

builds from python:3.12-slim
installs the project with uv sync --locked --no-dev
runs as a non-root appuser
excludes .env, local Bronze data, Git history, docs, tests, and common log files from the build context through .dockerignore
uses:
- ENTRYPOINT ["python", "-m", "transit_ops.cli"]
- CMD ["run-realtime-worker", "stm"]

Example local build:

docker build -t transit-ops-worker .

Example bounded local smoke test:

docker run --rm --env-file .env transit-ops-worker run-realtime-worker stm --max-cycles 1

The realtime worker still uses true start-to-start cadence timing in container mode because it reuses the same CLI and orchestration path as local execution.

Hosted realtime worker status

Hosted realtime deployment is now achieved on Railway.

Current hosted target:

project: transit-ops
environment: production
service: realtime-worker

Current runtime path:

Railway detects and builds the existing repo Dockerfile
the container starts with:
- python -m transit_ops.cli run-realtime-worker stm
the service variables include:
- NEON_DATABASE_URL
- STM_API_KEY
- BRONZE_S3_ACCESS_KEY
- BRONZE_S3_SECRET_KEY
- BRONZE_STORAGE_BACKEND=s3
- BRONZE_S3_ENDPOINT=https://eccfb9bedd87d413eaf4cac6ae2285d3.r2.cloudflarestorage.com
- BRONZE_S3_BUCKET=transit-raw
- BRONZE_S3_REGION=auto
- REALTIME_POLL_SECONDS=30
- REALTIME_STARTUP_DELAY_SECONDS=0
- STATIC_DATASET_RETENTION_COUNT=1
- SILVER_REALTIME_RETENTION_DAYS=2
- PROVIDER_TIMEZONE=America/Toronto
- STM_PROVIDER_ID=stm
- APP_ENV=production
- LOG_LEVEL=INFO

What is now proven from the hosted service logs:

the worker starts successfully on Railway
hosted realtime cycles succeed end to end at 30s cadence
Bronze writes remain R2-backed with storage_backend = "s3"
Gold refreshes successfully after hosted realtime cycles
the hosted worker honors true start-to-start cadence:
- observed cycle duration: 6.5–8.5 seconds (stable-state with 2-day retention window populated)
- observed effective_start_to_start_seconds ≈ 30.001
- computed sleep: ~21–24 seconds per cycle (21.9s minimum headroom observed)

The detailed Railway deployment notes live in:

docs/realtime-worker-hosting.md

Freshness and delay expectations

The static and realtime automation paths have different delay expectations:

static GTFS is intended to run once per day through GitHub Actions
the included GitHub Actions workflow currently schedules the static pipeline daily at 06:00 UTC
that currently lines up with 2:00 AM Eastern while EDT is in effect
because GitHub Actions cron is UTC-based, the schedule may need a seasonal UTC adjustment during EST if the desired local run time remains 2:00 AM Eastern
realtime is intended to run continuously through the worker container
the production worker cadence is one full realtime cycle every 30 seconds

For live data, the practical delay is:

one polling interval
plus the actual time to capture both GTFS-RT endpoints
plus the Silver loads
plus the Gold rebuild

In practice, that means the current default operating target is roughly sub-minute to low-minute freshness, not instant streaming. If one realtime endpoint fails, the cycle reports the partial failure explicitly instead of pretending both feeds are fresh.

Intentionally deferred

The following work is intentionally out of scope for V1:

Neon Data API exposure
public packaging and case study write-up under transit.yesid.dev
Power BI "Publish to web" public embed (pending portfolio site update)
database-level ET timezone columns in KPI views (DAX -4/24 workaround in place)

Why provider manifests exist

Provider manifests keep provider metadata and feed definitions out of the CLI and away from hardcoded STM constants. They provide one small extension point for:

provider metadata
GTFS / GTFS-RT feed definitions
refresh cadence metadata
auth metadata shape
future provider expansion without changing the database schema

This is intentionally a small YAML registry, not a plugin framework.

Why STM is the only active provider in V1

V1 stays STM-only on purpose:

it keeps the portfolio story concrete
it avoids building a fake multi-provider platform too early
it keeps ingestion and reporting design grounded in one real provider

The code is provider-ready within GTFS / GTFS-RT, but STM is the only manifest the repo actively ships today.

How future providers can be added

Future GTFS providers can be added by:

creating a new manifest under config/providers/
using the same validated manifest structure
reusing the registry and CLI inspection commands
wiring later ingestion slices against the validated manifest data

No database schema changes are required just to register another GTFS provider.

Install and run

Copy .env.example to .env and fill in:

NEON_DATABASE_URL
BRONZE_S3_ACCESS_KEY
BRONZE_S3_SECRET_KEY
STM_API_KEY if you plan to run live GTFS-RT capture

Install dependencies:

uv sync

Inspect the current configuration:

uv run transit-ops show-config

Inspect provider manifests:

uv run python -m transit_ops.cli list-providers
uv run python -m transit_ops.cli show-provider stm

Test the Neon connection:

uv run transit-ops db-test

Initialize the database schemas and tables:

uv run transit-ops init-db

Seed STM provider metadata:

uv run transit-ops seed-core

Run one Bronze static STM ingestion:

uv run python -m transit_ops.cli ingest-static stm

Run one Bronze realtime STM capture:

uv run python -m transit_ops.cli capture-realtime stm trip_updates
uv run python -m transit_ops.cli capture-realtime stm vehicle_positions

Load the latest Bronze static STM archive into Silver:

uv run python -m transit_ops.cli load-static-silver stm

Load the latest Bronze realtime STM snapshots into Silver:

uv run python -m transit_ops.cli load-realtime-silver stm trip_updates
uv run python -m transit_ops.cli load-realtime-silver stm vehicle_positions

Rebuild the current Gold marts and KPI-ready tables:

uv run python -m transit_ops.cli build-gold-marts stm

Refresh only the latest Gold realtime state without doing a full backfill:

uv run python -m transit_ops.cli refresh-gold-realtime stm

Prune old Silver storage and optionally compact the large tables:

# preview what would be deleted without executing
uv run python -m transit_ops.cli prune-silver-storage stm --dry-run
uv run python -m transit_ops.cli prune-silver-storage stm

# vacuum all maintenance tables, or target specific tables
uv run python -m transit_ops.cli vacuum-storage stm
uv run python -m transit_ops.cli vacuum-storage stm --table silver.trip_updates --table silver.vehicle_positions

# prune Gold fact rows and Bronze R2 objects (standalone operator commands)
uv run python -m transit_ops.cli prune-gold-storage stm --dry-run
uv run python -m transit_ops.cli prune-gold-storage stm
uv run python -m transit_ops.cli prune-bronze-storage stm --dry-run
uv run python -m transit_ops.cli prune-bronze-storage stm

Build warm rollup tables and prune old rollup history:

# build all 5-minute warm rollup periods not yet computed
uv run python -m transit_ops.cli build-warm-rollups stm

# build from a specific date (useful for backfill or recompute)
uv run python -m transit_ops.cli build-warm-rollups stm --since 2026-03-01

# prune warm rollup rows older than GOLD_WARM_ROLLUP_RETENTION_DAYS
uv run python -m transit_ops.cli prune-warm-rollup-storage stm

Run the one-shot orchestration commands:

uv run python -m transit_ops.cli run-static-pipeline stm
uv run python -m transit_ops.cli run-realtime-cycle stm

Run the continuous realtime worker:

uv run python -m transit_ops.cli run-realtime-worker stm

The module entrypoint also works:

uv run python -m transit_ops.cli --help