#
Locomotive: silver layer + BQ governance
This is the data-layer initiative running on top of wave-1 infrastructure. Wave-1 ships the platform (Airflow on GKE, DataHub Phase 2 next). The locomotive ships the canonical surface of data on top of it.
The locomotive locks the BigQuery surface so writes only land through a reviewed PR, builds 13 curated silver tables as the new source of truth for downstream analytics, and routes future changes through a dbt project living in ume-data-engineering with CODEOWNERS and AI review.
Bad business rules in early silver tables are acceptable. They get fixed through the normal model-update PR flow. The point is to ship the canonical surface and the governance pattern, not to land perfect modelling on day one.
#
Initial brief (organized and corrected)
This is the original brief, with corrections applied.
#
Goal
Move fast on a data-layer reset that begins paying off in weeks. Pace is measured by silver tables shipped per week and legacy consumers migrated.
#
Source state
- The current CDC pipeline outputs raw tables in BigQuery from the production Postgres at GCP.
- Today, anyone with BQ access can create tables and views directly on top of these raw CDC tables, with no governance.
- This produced fragmentation, abandoned tables, divergent business rules.
#
What we are building (corrected)
- A new GCP project (immediate, not deferred) for the silver layer.
- 13 curated silver tables built directly from raw CDC. The list is locked (below). These are claimed to cover roughly 80% of analytical use cases.
- All future downstream tables and views must source only from silver, never from raw CDC.
Terminology note: at UME, "silver" = these 13 curated tables. Bronze = raw CDC. This is non-standard medallion mapping. Do not assume the usual meaning.
#
What we are doing to BigQuery
- Cut BQ data write access. Users can no longer create views or tables directly.
- Pulumi was the original idea; corrected to Terraform (the entire infra repo is Terraform; introducing a second IaC chain has no upside).
- Control dataset creation, user and group policies through Terraform.
- Bronze (raw CDC) reads are eventually disabled once consumers are migrated.
#
Developer experience
- Dev-X happens through a dbt monorepo.
- We considered dbt Cloud. Locked decision: stay on Cosmos + dbt-core (already deployed on GKE Airflow). Reconsider dbt Cloud after the first 5 silver tables are live with a measured cost/velocity baseline.
- Reference DAGs repo:
ume-data-dags(template only; will be retired as Cosmos points at the new dbt project). - The dbt monorepo lives in
ume-data-engineeringon a long-livedmigrationbranch (cut fromdev). Brand-new dbt project atapp/datalake/transformations/.
#
Collaboration and PR gates
- Repository skills for automated build of new implementations.
- AI bots that review PRs against those skills.
- Static analysis (SQLFluff, dbt-checkpoint, schema lint).
- PR gates: CODEOWNERS by company area + at least one data-platform reviewer at root.
#
Ingestion blueprints
- Beyond CDC, we need easy ways to ingest jsonl/json logs, csv, FTP.
- Same pattern as silver: PR gates, AI bot, skills.
- Home:
ume-data-engineering/ingestion/.
#
Observability and governance (later, but designed now)
- Separate GCP project for the silver layer (immediate).
- Metadata-driven observability: what was ingested, when, what failed. Alerting to Slack.
- BigQuery usage datamart with users, tables, scanned bytes, cost, last-time-used. Powers cost dashboards and the lifecycle policy.
- Detect things created outside the expected path (audit-log alerts on raw
CREATE TABLE). - Telemetry stack: Metabase only for dashboards and alerts. Looker Studio dropped to keep one tool. Metrics layer = dbt
meta__*models.
#
Domain ownership
- From silver onwards, domain tables are owned by company areas, always committing to the dbt monorepo. Silver itself is owned by the data platform team.
- We are not focusing atendimento-first; the silver tables themselves drive sequencing.
#
Lifecycle
- Tables not used or updated get a lifecycle policy (cold storage or deletion).
- If a table is still referenced in dbt, an automated PR to deprecate it is the trigger.
- Designed in the locomotive, automated in Wave 2.
#
New Metabase instance
- Built like
ume-data-infra(Terraform + Helm + dependabot) so it can be upgraded without downtime. - Migration of golden-standard dashboards follows silver readiness, not a domain-first plan.
#
Governance via DataHub
- DataHub scanning of BigQuery, Airflow, dbt manifests is in scope for the locomotive.
- Similarity scoring + PR-bot to prevent fragmentation (too-similar models, abandoned tables) is Wave 2, with groundwork laid now (DataHub metadata pipeline + simple column-overlap heuristic in PR review).
#
Silver list (locked, 13 tables)
First two to attack: contracts and installments.
#
Locked decisions
#
Earliest visible milestones
Stakeholder-visible news, in the rough order each can land. Each is a short post in #data-platform. Cadence is the morale signal.
- Day 1: silver list pinned (this doc); metrification dashboard wireframe shared.
- First silver project
terraform planclean in CI. - First silver model materialized in BigQuery (target:
contracts) via Cosmos against silver-dev. - First Metabase dashboard reading from
meta__bq_usage: who scanned how much yesterday, who has shadow tables. - First sample PR through every gate in
ume-data-engineering: dbt parse, schema lint, AI similarity check, CODEOWNERS approval. - First Slack alert fires from Metabase on a forced threshold breach.
- DataHub shows lineage from a silver model back to its raw CDC source.
- First sandbox dataset created for a willing pilot team, with TTL working.
- First non-CDC source onboarded through an ingestion blueprint.
- First legacy consumer migrated off raw CDC onto a silver table.
#
Workstreams
Each workstream is sized to be picked up by one specialist. Specialist names match files in docs/infrastructure/agents/ where applicable; new ones are flagged. Exit criteria are binary.
#
WS-1: Silver GCP project (Terraform)
Specialist: infra-terraform (existing).
Scope: new GCP project (ume-data-silver-prod and -dev), VPC reuse, BigQuery datasets per silver group, service accounts for Airflow-writer and Metabase-reader, IAM groups by domain. New env stacks: environments/silver-01-base, environments/silver-02-bigquery. Reuse vpc, iap-oauth; add bq-dataset module if missing.
Exit criteria:
-
terraform applyfrom CI provisions silver-dev end-to-end with no manual steps. - Airflow's existing service account writes into silver-dev via Workload Identity.
-
bq queryfrom a non-allowlisted user against silver-dev returns a permission error. - Same stack applied to silver-prod (manual approval gate in CI).
#
WS-2: BigQuery surface lockdown
Specialist: bq-governance (new).
Scope:
- Inventory writes via
INFORMATION_SCHEMA.JOBS_BY_PROJECTover 90 days; produce a register of shadow tables, owners, access paths. - Communicate the timeline; create per-team
sandbox_<team>datasets with 30-day partition expiration. - Stage revoke of
dataEditoron legacy datasets in waves; gate dataset creation behind Terraform. - Audit-log alert in Cloud Monitoring on any unexpected
CREATE TABLEoutside silver and sandboxes.
Exit criteria:
- Shadow-table register published with ≥90% of writes attributed to a known user/team.
- All sandboxes provisioned with TTL verified (a test row vanishes after expiration).
-
dataEditorremoved from at least one legacy dataset without breaking any consumer (verified via 7-day query log). - Audit-log alert fires on a forced test
CREATE TABLEand posts to the platform Slack channel.
#
WS-3: New dbt project in ume-data-engineering
Specialist: dbt-silver (new; or evolve composer-dags into airflow-dags).
Scope: cut migration branch on ume-data-engineering from dev. Brand-new dbt project at app/datalake/transformations/. models/silver/, models/meta/, sources/cdc.yml pointing at raw CDC. CI: dbt parse, dbt build against silver-dev, schema diff comment on PR. Pre-commit: SQLFluff, dbt-checkpoint. Cosmos config in ume-data-dags (or its successor) points at this folder. First two silver models built: contracts and installments.
Exit criteria:
-
migrationbranch exists with branch protection (required reviews, required status checks). -
dbt parseanddbt buildboth pass in CI on a PR. -
silver.contractsandsilver.installmentsmaterialise inume-data-silver-devvia a Cosmos run triggered from Airflow. - Manifest + run-results uploaded to GCS after every run (input for
meta__dbt_runs).
#
WS-4: Dev-X and PR gates
Specialist: dev-x (new).
Scope: CODEOWNERS by domain folder + a data-platform reviewer at the repo root. Repo skills in .claude/skills/ for "add silver model", "add meta dashboard", "add ingestion blueprint". AI PR-review bot wired with concrete checks: column-overlap heuristic against existing silver models (similarity-prevention groundwork), schema lint, test-coverage threshold, manifest diff summary. Branch protection enforces required checks + required reviewers.
Exit criteria:
- CODEOWNERS file in
migrationbranch covers every silver folder; a push without an approval is blocked. - Test PR with a deliberately too-similar model triggers the AI bot's similarity warning.
- Test PR with a missing dbt test fails the coverage check.
- At least one repo skill produces a working scaffold ("add silver model" generates a PR-ready file).
#
WS-5: Ingestion blueprints
Specialist: data-ingestion (new).
Scope: blueprints for jsonl/json logs, csv, FTP. Each blueprint: a DAG factory, schema contract, landing-zone convention, smoke tests, observability hook (writes to meta__ingestion_runs). Same CODEOWNERS + AI review pattern as WS-4. Home: ume-data-engineering/ingestion/.
Exit criteria:
- Three blueprints (jsonl, csv, ftp) committed with a README and a smoke test each.
- One real non-CDC source onboarded through a blueprint, end-to-end into a landing dataset.
- That source emits a row to
meta__ingestion_runsafter each run.
#
WS-6: Telemetry datamart + Metabase observability
Specialist: dbt-silver for the meta models, infra-terraform for the Metabase wiring.
Scope: dbt meta models in the silver project:
meta__bq_usagefromINFORMATION_SCHEMA.JOBS_BY_PROJECT(user, table, scanned bytes, cost, last_run).meta__table_lifecycle(last_queried, last_modified, query_count_30d, owner). Feeds future lifecycle PRs.meta__ingestion_runs(DAG, table, rows, duration, status, error_class).meta__dbt_runs(manifest + run_results loaded by a post-run hook).
Metabase: collection per area, alerts on freshness lag, ingestion failures, cost spikes, "table not queried in N days". Alerts to Slack via Metabase's native integration.
Exit criteria:
- All four
meta__*models build daily and have at least one dbt test each. - Metabase has a "Cost yesterday" question that loads in <5s.
- At least three Metabase alerts wired to Slack (one freshness, one cost, one ingestion failure).
- A forced threshold breach triggers a Slack message in <10 min.
#
WS-7: New Metabase instance
Specialist: infra-terraform.
Scope: adapt the airflow-helm module pattern in ume-data-infra for Metabase. Cloud SQL Postgres backend (HA), IAP fronting, immutable image tags, dependabot bumping chart and image versions through PR. Workload Identity for the read-only BQ service account.
Exit criteria:
-
metabase.<domain>reachable via IAP for the seed allowlist. - Dependabot opens a successful image-bump PR within the first 2 weeks of deploy.
- Metabase reads silver and
meta__*tables; cannot write anywhere. - Restart of the Cloud SQL Postgres causes <2 min of Metabase downtime.
#
WS-8: Legacy consumer migration
Specialist: bq-governance + domain owners.
Scope: catalogue every table/view sourcing from raw CDC; assign each a target silver source; produce a migration runbook + sunset date per consumer. Tracked in the same metrification dashboard as silver delivery.
Exit criteria:
- Register of legacy consumers complete; each row tagged with target silver model or "to sunset".
- At least 5 consumers repointed at silver and verified (row counts within tolerance vs raw-CDC version).
- Sunset dates communicated to all owners.
- BQ lockdown (WS-2) gates moved forward only after all consumers in a wave are repointed or sunset.
#
WS-9: DataHub scanning of silver + sources
Specialist: datahub-platform (existing).
Scope: ship DataHub Phase 2 per current roadmap. Wire ingestion recipes for BigQuery (silver project + raw CDC), Airflow DAGs, dbt manifests from ume-data-engineering. Tag silver models as the canonical layer. Lineage between raw CDC and silver visible. No similarity-detection logic yet; that is Wave 2.
Exit criteria:
- DataHub running in dev with all three ingestion recipes scheduled in Airflow.
- Every shipped silver model appears in DataHub with its dbt description, tags, and column-level lineage to raw CDC.
- Search for a raw CDC table shows downstream silver models.
- DataHub UI reachable via IAP for the seed allowlist.
#
WS-10: Lifecycle policy (design now, automate Wave 2)
Specialist: dev-x.
Scope: write the rule set (e.g. table not queried in 90 days + not referenced in dbt manifest → deprecation candidate). The read-only report builds in WS-6. Automation (auto-PR to deprecate from ume-data-engineering) is Wave 2.
Exit criteria:
- Lifecycle rules written in
docs/infrastructure/lifecycle.mdand accepted in a PR. - Weekly Metabase report listing deprecation candidates ships.
- First review meeting held; at least one candidate confirmed for deprecation.
#
Metrification
One dashboard fed by the meta__* models. Four counters tracked weekly.
Targets are intentionally absent from the main task list. The curve is the conversation.
#
Wave 2 (acknowledged, not scoped here)
- DataHub similarity scoring + PR-bot that flags new silver models with high column overlap to existing ones.
- Lifecycle automation: auto-PR to
ume-data-engineeringto deprecate cold tables. - dbt Cloud re-evaluation with concrete cost/velocity baseline.
- Per-domain Metabase migrations off the legacy instance.
#
Critical files and paths
ume-data-infra/ (Terraform)
environments/silver-01-base/(new)environments/silver-02-bigquery/(new)environments/dev-04-metabase/(new)modules/bq-dataset/(new, if not present)
ume-data-engineering/ (new migration branch off dev)
app/datalake/transformations/(brand-new dbt project)app/datalake/transformations/dbt_project.yml,profiles.ymlapp/datalake/transformations/models/silver/,models/meta/,sources/cdc.ymlingestion/templates/{cdc,jsonl,csv,ftp}/(WS-5).github/workflows/{pr-ci.yml,build.yml,manifest-upload.yml}.github/CODEOWNERS.claude/skills/(add-silver-model, add-meta-dashboard, add-ingestion-blueprint)
ume-data-dags/
- Stays as the reference template; Cosmos config evolves to point at
ume-data-engineering/app/datalake/transformations.
#
Verification (end-to-end)
terraform planclean across silver-01-base, silver-02-bigquery, dev-04-metabase.- One sample PR on the
migrationbranch triggers all gates: dbt parse, schema lint, AI similarity check, manifest diff, CODEOWNERS approval. - Cosmos in dev runs against
ume-data-silver-devand producessilver.contractsandsilver.installments. meta__bq_usagepopulated; Metabase question shows yesterday's spend; alert fires to Slack on a forced threshold breach.- DataHub shows the silver model with column lineage back to the raw CDC source.
- BQ audit-log alert fires on a manual
CREATE TABLEoutside the silver project.
#
Duration estimate
The main task list is not time-locked. Rough envelope:
- Best case: ~6 weeks to all exit criteria green.
- Realistic: ~10 weeks. WS-2 drags because of communication and consumer migration; WS-9 depends on DataHub Phase 2 stability.
- Worst case: ~16 weeks. DataHub Phase 2 instability, Cloud SQL HA surprises in WS-7, or stakeholder pushback on the lockdown timeline.
Even on the worst-case path, the first 4–5 visible milestones land within the first month.
#
Tasks (flat list, prioritized)
Each task is sized to be picked up independently. Exit criteria are restated for clarity.
#
Foundation (parallelizable, do these first)
- T-01 — Clone
ume-data-engineeringlocally under~/git/. Exit: repo present,devbranch checked out. - T-02 — Cut
migrationbranch offdevonume-data-engineeringand push with branch protection (required reviews, required status checks). Exit: branch live on remote with protection rules. - T-03 — Scaffold
environments/silver-01-base/inume-data-infra(project, VPC reuse, IAM groups). Exit:terraform plansucceeds locally. - T-04 — Scaffold
environments/silver-02-bigquery/(datasets per silver group, SAs for Airflow-writer + Metabase-reader). Exit:terraform plansucceeds locally. - T-05 — Add
modules/bq-dataset/if not already present. Exit: module created, used by silver-02-bigquery. - T-06 — Open infra PR for silver-01-base + silver-02-bigquery. Exit: PR green, CI plan output posted as comment.
- T-07 —
terraform applysilver-dev from CI (manual approval). Exit: project + datasets exist; Airflow SA writes a test row via Workload Identity.
#
dbt project bootstrap
- T-10 — Create
app/datalake/transformations/skeleton onmigrationbranch:dbt_project.yml,profiles.yml,models/silver/,models/meta/,sources/cdc.yml. Exit:dbt parsepasses locally. - T-11 — Wire CI:
dbt parseon PR,dbt buildagainst silver-dev, schema diff comment, SQLFluff + dbt-checkpoint pre-commit. Exit: a sample PR runs all checks green. - T-12 — Build
silver.contractsmodel + dbt tests (not_null, unique on PK). Bad business rules acceptable; doc TODO inline. Exit: model materialises in silver-dev via localdbt run. - T-13 — Build
silver.installmentsmodel + dbt tests. Exit: model materialises in silver-dev. - T-14 — Wire Cosmos in
ume-data-dags(or successor) to point atapp/datalake/transformations. Exit: an Airflow DAG run produces both silver models in silver-dev. - T-15 — Manifest + run-results upload to GCS as a post-run Cosmos hook. Exit: artifacts present in GCS bucket after each run.
#
Governance + dev-X
- T-20 — Add
.github/CODEOWNERScovering each silver folder + a data-platform reviewer at root. Exit: a push without approval is blocked onmigration. - T-21 — Add
.claude/skills/add-silver-modelskill that scaffolds a new model + sources + tests. Exit: invoking the skill produces a PR-ready file set. - T-22 — Wire AI PR-review bot: column-overlap similarity check against existing silver models, schema lint, test-coverage threshold, manifest diff summary. Exit: a deliberately-similar test PR triggers the bot's warning.
#
BQ lockdown (WS-2)
- T-30 — Build the shadow-table inventory query against
INFORMATION_SCHEMA.JOBS_BY_PROJECT(90-day window). Publish results as a register doc. Exit: ≥90% of writes attributed to a known user/team. - T-31 — Provision per-team
sandbox_<team>datasets via Terraform with 30-day partition expiration. Exit: a test row in a sandbox vanishes after expiration. - T-32 — Audit-log alert in Cloud Monitoring on
CREATE TABLEoutside silver + sandboxes; route to Slack. Exit: a forced test fires the alert. - T-33 — Stage-1 revoke of
dataEditoron one pilot legacy dataset. Exit: revoke applied; 7-day query log shows no consumer breakage.
#
Telemetry datamart + Metabase
- T-40 — Build
meta__bq_usage(dbt model) + at least one dbt test. Exit: model builds daily, populated for the prior day. - T-41 — Build
meta__table_lifecycle. Exit: builds daily; rows for every BQ table in scope. - T-42 — Build
meta__ingestion_runs(schema only initially; populated as ingestion blueprints emit). Exit: schema deployed, sample row from a smoke test present. - T-43 — Build
meta__dbt_runsfrom manifest + run-results uploaded by T-15. Exit: rows for every Cosmos run. - T-44 — Scaffold
environments/dev-04-metabase/Terraform: Metabase Helm release, Cloud SQL HA backend, IAP fronting, dependabot. Exit:metabase.<domain>reachable via IAP for seed allowlist. - T-45 — First Metabase question + dashboard reading
meta__bq_usage: "Cost yesterday". Exit: dashboard loads in <5s. - T-46 — Wire 3 Metabase alerts to Slack: freshness lag, cost spike, ingestion failure. Exit: a forced threshold breach posts to Slack in <10 min.
#
Ingestion blueprints
- T-50 — Create
ume-data-engineering/ingestion/with three blueprints: jsonl, csv, ftp. README + smoke test each. Exit: three blueprints committed and CI-green. - T-51 — Onboard one real non-CDC source via a blueprint. Exit: source lands in BigQuery and emits to
meta__ingestion_runs.
#
DataHub scanning (WS-9)
- T-60 — DataHub Phase 2 deployed per current roadmap. Exit: UI reachable via IAP for seed allowlist.
- T-61 — DataHub ingestion recipes: BigQuery (silver + raw CDC), Airflow DAGs, dbt manifests. Scheduled in Airflow. Exit: all three recipes run nightly.
- T-62 — Every shipped silver model appears in DataHub with description, tags, column-level lineage to raw CDC. Exit: searching a raw CDC table shows downstream silver models.
#
Legacy consumer migration (WS-8)
- T-70 — Catalogue all tables/views sourcing from raw CDC. Tag each with a target silver model or "to sunset". Exit: register published.
- T-71 — Repoint at least 5 consumers at silver, verify with row-count tolerance vs raw-CDC version. Exit: 5 consumers green; sunset dates communicated for the rest.
#
Lifecycle (WS-10, design only in locomotive)
- T-80 — Author
docs/infrastructure/lifecycle.mdwith the rule set (e.g. not queried in 90 days + not referenced in dbt manifest → deprecation candidate). Exit: PR merged. - T-81 — Build a weekly Metabase report listing deprecation candidates from
meta__table_lifecycle. Exit: report runs weekly; first review meeting held.
#
Open items still to resolve
- Confirm Metabase domain and IAP allowlist seed.
- Identify 2–3 pilot domains willing to migrate first consumers off legacy CDC.