# Locomotive: silver layer + BQ governance

This is the data-layer initiative running on top of wave-1 infrastructure. Wave-1 ships the platform (Airflow on GKE, DataHub Phase 2 next). The locomotive ships the canonical surface of data on top of it.

The locomotive locks the BigQuery surface so writes only land through a reviewed PR, builds 13 curated silver tables as the new source of truth for downstream analytics, and routes future changes through a dbt project living in ume-data-engineering with CODEOWNERS and AI review.

Bad business rules in early silver tables are acceptable. They get fixed through the normal model-update PR flow. The point is to ship the canonical surface and the governance pattern, not to land perfect modelling on day one.

# Initial brief (organized and corrected)

This is the original brief, with corrections applied.

# Goal

Move fast on a data-layer reset that begins paying off in weeks. Pace is measured by silver tables shipped per week and legacy consumers migrated.

# Source state

  • The current CDC pipeline outputs raw tables in BigQuery from the production Postgres at GCP.
  • Today, anyone with BQ access can create tables and views directly on top of these raw CDC tables, with no governance.
  • This produced fragmentation, abandoned tables, divergent business rules.

# What we are building (corrected)

  • A new GCP project (immediate, not deferred) for the silver layer.
  • 13 curated silver tables built directly from raw CDC. The list is locked (below). These are claimed to cover roughly 80% of analytical use cases.
  • All future downstream tables and views must source only from silver, never from raw CDC.

Terminology note: at UME, "silver" = these 13 curated tables. Bronze = raw CDC. This is non-standard medallion mapping. Do not assume the usual meaning.

# What we are doing to BigQuery

  • Cut BQ data write access. Users can no longer create views or tables directly.
  • Pulumi was the original idea; corrected to Terraform (the entire infra repo is Terraform; introducing a second IaC chain has no upside).
  • Control dataset creation, user and group policies through Terraform.
  • Bronze (raw CDC) reads are eventually disabled once consumers are migrated.

# Developer experience

  • Dev-X happens through a dbt monorepo.
  • We considered dbt Cloud. Locked decision: stay on Cosmos + dbt-core (already deployed on GKE Airflow). Reconsider dbt Cloud after the first 5 silver tables are live with a measured cost/velocity baseline.
  • Reference DAGs repo: ume-data-dags (template only; will be retired as Cosmos points at the new dbt project).
  • The dbt monorepo lives in ume-data-engineering on a long-lived migration branch (cut from dev). Brand-new dbt project at app/datalake/transformations/.

# Collaboration and PR gates

  • Repository skills for automated build of new implementations.
  • AI bots that review PRs against those skills.
  • Static analysis (SQLFluff, dbt-checkpoint, schema lint).
  • PR gates: CODEOWNERS by company area + at least one data-platform reviewer at root.

# Ingestion blueprints

  • Beyond CDC, we need easy ways to ingest jsonl/json logs, csv, FTP.
  • Same pattern as silver: PR gates, AI bot, skills.
  • Home: ume-data-engineering/ingestion/.

# Observability and governance (later, but designed now)

  • Separate GCP project for the silver layer (immediate).
  • Metadata-driven observability: what was ingested, when, what failed. Alerting to Slack.
  • BigQuery usage datamart with users, tables, scanned bytes, cost, last-time-used. Powers cost dashboards and the lifecycle policy.
  • Detect things created outside the expected path (audit-log alerts on raw CREATE TABLE).
  • Telemetry stack: Metabase only for dashboards and alerts. Looker Studio dropped to keep one tool. Metrics layer = dbt meta__* models.

# Domain ownership

  • From silver onwards, domain tables are owned by company areas, always committing to the dbt monorepo. Silver itself is owned by the data platform team.
  • We are not focusing atendimento-first; the silver tables themselves drive sequencing.

# Lifecycle

  • Tables not used or updated get a lifecycle policy (cold storage or deletion).
  • If a table is still referenced in dbt, an automated PR to deprecate it is the trigger.
  • Designed in the locomotive, automated in Wave 2.

# New Metabase instance

  • Built like ume-data-infra (Terraform + Helm + dependabot) so it can be upgraded without downtime.
  • Migration of golden-standard dashboards follows silver readiness, not a domain-first plan.

# Governance via DataHub

  • DataHub scanning of BigQuery, Airflow, dbt manifests is in scope for the locomotive.
  • Similarity scoring + PR-bot to prevent fragmentation (too-similar models, abandoned tables) is Wave 2, with groundwork laid now (DataHub metadata pipeline + simple column-overlap heuristic in PR review).

# Silver list (locked, 13 tables)

First two to attack: contracts and installments.

# Table Notes
1 source_product Regra global
2 contracts First target
3 installments First target
4 borrowers
5 renegotiations
6 renegotiated_installments
7 invoices
8 installments_invoices
9 application
10 payments
11 payment_invoices
12 transaction_engine
13 application_credit_evaluations

# Locked decisions

Decision Choice
IaC Terraform. New silver project ships as a new env stack inside ume-data-infra. No Pulumi.
dbt repo ume-data-engineering, long-lived migration branch cut from dev, dbt project at app/datalake/transformations/.
dbt runtime Cosmos + dbt-core on the existing GKE Airflow. dbt Cloud reconsidered after the first 5 silver tables ship.
Telemetry Metabase only for dashboards and alerts. Metrics = dbt meta__* models in BigQuery.
Data-quality alerts dbt tests + Cosmos Slack callbacks. Elementary deferred.
DataHub in scope Yes, but scanning only (BQ + Airflow + dbt manifests). Similarity scoring is Wave 2.
Ingestion blueprints ume-data-engineering/ingestion/ (jsonl, csv, ftp).
Bad business rules Acceptable in early silver tables. Fixed via normal PR flow.
Sequencing Silver tables drive everything. No atendimento-first.

# Earliest visible milestones

Stakeholder-visible news, in the rough order each can land. Each is a short post in #data-platform. Cadence is the morale signal.

  1. Day 1: silver list pinned (this doc); metrification dashboard wireframe shared.
  2. First silver project terraform plan clean in CI.
  3. First silver model materialized in BigQuery (target: contracts) via Cosmos against silver-dev.
  4. First Metabase dashboard reading from meta__bq_usage: who scanned how much yesterday, who has shadow tables.
  5. First sample PR through every gate in ume-data-engineering: dbt parse, schema lint, AI similarity check, CODEOWNERS approval.
  6. First Slack alert fires from Metabase on a forced threshold breach.
  7. DataHub shows lineage from a silver model back to its raw CDC source.
  8. First sandbox dataset created for a willing pilot team, with TTL working.
  9. First non-CDC source onboarded through an ingestion blueprint.
  10. First legacy consumer migrated off raw CDC onto a silver table.

# Workstreams

Each workstream is sized to be picked up by one specialist. Specialist names match files in docs/infrastructure/agents/ where applicable; new ones are flagged. Exit criteria are binary.

# WS-1: Silver GCP project (Terraform)

Specialist: infra-terraform (existing).

Scope: new GCP project (ume-data-silver-prod and -dev), VPC reuse, BigQuery datasets per silver group, service accounts for Airflow-writer and Metabase-reader, IAM groups by domain. New env stacks: environments/silver-01-base, environments/silver-02-bigquery. Reuse vpc, iap-oauth; add bq-dataset module if missing.

Exit criteria:

  • terraform apply from CI provisions silver-dev end-to-end with no manual steps.
  • Airflow's existing service account writes into silver-dev via Workload Identity.
  • bq query from a non-allowlisted user against silver-dev returns a permission error.
  • Same stack applied to silver-prod (manual approval gate in CI).

# WS-2: BigQuery surface lockdown

Specialist: bq-governance (new).

Scope:

  1. Inventory writes via INFORMATION_SCHEMA.JOBS_BY_PROJECT over 90 days; produce a register of shadow tables, owners, access paths.
  2. Communicate the timeline; create per-team sandbox_<team> datasets with 30-day partition expiration.
  3. Stage revoke of dataEditor on legacy datasets in waves; gate dataset creation behind Terraform.
  4. Audit-log alert in Cloud Monitoring on any unexpected CREATE TABLE outside silver and sandboxes.

Exit criteria:

  • Shadow-table register published with ≥90% of writes attributed to a known user/team.
  • All sandboxes provisioned with TTL verified (a test row vanishes after expiration).
  • dataEditor removed from at least one legacy dataset without breaking any consumer (verified via 7-day query log).
  • Audit-log alert fires on a forced test CREATE TABLE and posts to the platform Slack channel.

# WS-3: New dbt project in ume-data-engineering

Specialist: dbt-silver (new; or evolve composer-dags into airflow-dags).

Scope: cut migration branch on ume-data-engineering from dev. Brand-new dbt project at app/datalake/transformations/. models/silver/, models/meta/, sources/cdc.yml pointing at raw CDC. CI: dbt parse, dbt build against silver-dev, schema diff comment on PR. Pre-commit: SQLFluff, dbt-checkpoint. Cosmos config in ume-data-dags (or its successor) points at this folder. First two silver models built: contracts and installments.

Exit criteria:

  • migration branch exists with branch protection (required reviews, required status checks).
  • dbt parse and dbt build both pass in CI on a PR.
  • silver.contracts and silver.installments materialise in ume-data-silver-dev via a Cosmos run triggered from Airflow.
  • Manifest + run-results uploaded to GCS after every run (input for meta__dbt_runs).

# WS-4: Dev-X and PR gates

Specialist: dev-x (new).

Scope: CODEOWNERS by domain folder + a data-platform reviewer at the repo root. Repo skills in .claude/skills/ for "add silver model", "add meta dashboard", "add ingestion blueprint". AI PR-review bot wired with concrete checks: column-overlap heuristic against existing silver models (similarity-prevention groundwork), schema lint, test-coverage threshold, manifest diff summary. Branch protection enforces required checks + required reviewers.

Exit criteria:

  • CODEOWNERS file in migration branch covers every silver folder; a push without an approval is blocked.
  • Test PR with a deliberately too-similar model triggers the AI bot's similarity warning.
  • Test PR with a missing dbt test fails the coverage check.
  • At least one repo skill produces a working scaffold ("add silver model" generates a PR-ready file).

# WS-5: Ingestion blueprints

Specialist: data-ingestion (new).

Scope: blueprints for jsonl/json logs, csv, FTP. Each blueprint: a DAG factory, schema contract, landing-zone convention, smoke tests, observability hook (writes to meta__ingestion_runs). Same CODEOWNERS + AI review pattern as WS-4. Home: ume-data-engineering/ingestion/.

Exit criteria:

  • Three blueprints (jsonl, csv, ftp) committed with a README and a smoke test each.
  • One real non-CDC source onboarded through a blueprint, end-to-end into a landing dataset.
  • That source emits a row to meta__ingestion_runs after each run.

# WS-6: Telemetry datamart + Metabase observability

Specialist: dbt-silver for the meta models, infra-terraform for the Metabase wiring.

Scope: dbt meta models in the silver project:

  • meta__bq_usage from INFORMATION_SCHEMA.JOBS_BY_PROJECT (user, table, scanned bytes, cost, last_run).
  • meta__table_lifecycle (last_queried, last_modified, query_count_30d, owner). Feeds future lifecycle PRs.
  • meta__ingestion_runs (DAG, table, rows, duration, status, error_class).
  • meta__dbt_runs (manifest + run_results loaded by a post-run hook).

Metabase: collection per area, alerts on freshness lag, ingestion failures, cost spikes, "table not queried in N days". Alerts to Slack via Metabase's native integration.

Exit criteria:

  • All four meta__* models build daily and have at least one dbt test each.
  • Metabase has a "Cost yesterday" question that loads in <5s.
  • At least three Metabase alerts wired to Slack (one freshness, one cost, one ingestion failure).
  • A forced threshold breach triggers a Slack message in <10 min.

# WS-7: New Metabase instance

Specialist: infra-terraform.

Scope: adapt the airflow-helm module pattern in ume-data-infra for Metabase. Cloud SQL Postgres backend (HA), IAP fronting, immutable image tags, dependabot bumping chart and image versions through PR. Workload Identity for the read-only BQ service account.

Exit criteria:

  • metabase.<domain> reachable via IAP for the seed allowlist.
  • Dependabot opens a successful image-bump PR within the first 2 weeks of deploy.
  • Metabase reads silver and meta__* tables; cannot write anywhere.
  • Restart of the Cloud SQL Postgres causes <2 min of Metabase downtime.

# WS-8: Legacy consumer migration

Specialist: bq-governance + domain owners.

Scope: catalogue every table/view sourcing from raw CDC; assign each a target silver source; produce a migration runbook + sunset date per consumer. Tracked in the same metrification dashboard as silver delivery.

Exit criteria:

  • Register of legacy consumers complete; each row tagged with target silver model or "to sunset".
  • At least 5 consumers repointed at silver and verified (row counts within tolerance vs raw-CDC version).
  • Sunset dates communicated to all owners.
  • BQ lockdown (WS-2) gates moved forward only after all consumers in a wave are repointed or sunset.

# WS-9: DataHub scanning of silver + sources

Specialist: datahub-platform (existing).

Scope: ship DataHub Phase 2 per current roadmap. Wire ingestion recipes for BigQuery (silver project + raw CDC), Airflow DAGs, dbt manifests from ume-data-engineering. Tag silver models as the canonical layer. Lineage between raw CDC and silver visible. No similarity-detection logic yet; that is Wave 2.

Exit criteria:

  • DataHub running in dev with all three ingestion recipes scheduled in Airflow.
  • Every shipped silver model appears in DataHub with its dbt description, tags, and column-level lineage to raw CDC.
  • Search for a raw CDC table shows downstream silver models.
  • DataHub UI reachable via IAP for the seed allowlist.

# WS-10: Lifecycle policy (design now, automate Wave 2)

Specialist: dev-x.

Scope: write the rule set (e.g. table not queried in 90 days + not referenced in dbt manifest → deprecation candidate). The read-only report builds in WS-6. Automation (auto-PR to deprecate from ume-data-engineering) is Wave 2.

Exit criteria:

  • Lifecycle rules written in docs/infrastructure/lifecycle.md and accepted in a PR.
  • Weekly Metabase report listing deprecation candidates ships.
  • First review meeting held; at least one candidate confirmed for deprecation.

# Metrification

One dashboard fed by the meta__* models. Four counters tracked weekly.

Metric Source Why it matters
Silver tables live meta__dbt_runs filtered to silver schema Pace of the locomotive
Legacy consumers migrated manual register in dbt repo + meta__bq_usage Progress toward lockdown
BQ scanned cost (silver vs legacy) meta__bq_usage Budget signal
Failed silver runs (7d) meta__dbt_runs Reliability signal

Targets are intentionally absent from the main task list. The curve is the conversation.

# Wave 2 (acknowledged, not scoped here)

  • DataHub similarity scoring + PR-bot that flags new silver models with high column overlap to existing ones.
  • Lifecycle automation: auto-PR to ume-data-engineering to deprecate cold tables.
  • dbt Cloud re-evaluation with concrete cost/velocity baseline.
  • Per-domain Metabase migrations off the legacy instance.

# Critical files and paths

ume-data-infra/ (Terraform)

  • environments/silver-01-base/ (new)
  • environments/silver-02-bigquery/ (new)
  • environments/dev-04-metabase/ (new)
  • modules/bq-dataset/ (new, if not present)

ume-data-engineering/ (new migration branch off dev)

  • app/datalake/transformations/ (brand-new dbt project)
  • app/datalake/transformations/dbt_project.yml, profiles.yml
  • app/datalake/transformations/models/silver/, models/meta/, sources/cdc.yml
  • ingestion/templates/{cdc,jsonl,csv,ftp}/ (WS-5)
  • .github/workflows/{pr-ci.yml,build.yml,manifest-upload.yml}
  • .github/CODEOWNERS
  • .claude/skills/ (add-silver-model, add-meta-dashboard, add-ingestion-blueprint)

ume-data-dags/

  • Stays as the reference template; Cosmos config evolves to point at ume-data-engineering/app/datalake/transformations.

# Verification (end-to-end)

  • terraform plan clean across silver-01-base, silver-02-bigquery, dev-04-metabase.
  • One sample PR on the migration branch triggers all gates: dbt parse, schema lint, AI similarity check, manifest diff, CODEOWNERS approval.
  • Cosmos in dev runs against ume-data-silver-dev and produces silver.contracts and silver.installments.
  • meta__bq_usage populated; Metabase question shows yesterday's spend; alert fires to Slack on a forced threshold breach.
  • DataHub shows the silver model with column lineage back to the raw CDC source.
  • BQ audit-log alert fires on a manual CREATE TABLE outside the silver project.

# Duration estimate

The main task list is not time-locked. Rough envelope:

  • Best case: ~6 weeks to all exit criteria green.
  • Realistic: ~10 weeks. WS-2 drags because of communication and consumer migration; WS-9 depends on DataHub Phase 2 stability.
  • Worst case: ~16 weeks. DataHub Phase 2 instability, Cloud SQL HA surprises in WS-7, or stakeholder pushback on the lockdown timeline.

Even on the worst-case path, the first 4–5 visible milestones land within the first month.

# Tasks (flat list, prioritized)

Each task is sized to be picked up independently. Exit criteria are restated for clarity.

# Foundation (parallelizable, do these first)

  • T-01 — Clone ume-data-engineering locally under ~/git/. Exit: repo present, dev branch checked out.
  • T-02 — Cut migration branch off dev on ume-data-engineering and push with branch protection (required reviews, required status checks). Exit: branch live on remote with protection rules.
  • T-03 — Scaffold environments/silver-01-base/ in ume-data-infra (project, VPC reuse, IAM groups). Exit: terraform plan succeeds locally.
  • T-04 — Scaffold environments/silver-02-bigquery/ (datasets per silver group, SAs for Airflow-writer + Metabase-reader). Exit: terraform plan succeeds locally.
  • T-05 — Add modules/bq-dataset/ if not already present. Exit: module created, used by silver-02-bigquery.
  • T-06 — Open infra PR for silver-01-base + silver-02-bigquery. Exit: PR green, CI plan output posted as comment.
  • T-07terraform apply silver-dev from CI (manual approval). Exit: project + datasets exist; Airflow SA writes a test row via Workload Identity.

# dbt project bootstrap

  • T-10 — Create app/datalake/transformations/ skeleton on migration branch: dbt_project.yml, profiles.yml, models/silver/, models/meta/, sources/cdc.yml. Exit: dbt parse passes locally.
  • T-11 — Wire CI: dbt parse on PR, dbt build against silver-dev, schema diff comment, SQLFluff + dbt-checkpoint pre-commit. Exit: a sample PR runs all checks green.
  • T-12 — Build silver.contracts model + dbt tests (not_null, unique on PK). Bad business rules acceptable; doc TODO inline. Exit: model materialises in silver-dev via local dbt run.
  • T-13 — Build silver.installments model + dbt tests. Exit: model materialises in silver-dev.
  • T-14 — Wire Cosmos in ume-data-dags (or successor) to point at app/datalake/transformations. Exit: an Airflow DAG run produces both silver models in silver-dev.
  • T-15 — Manifest + run-results upload to GCS as a post-run Cosmos hook. Exit: artifacts present in GCS bucket after each run.

# Governance + dev-X

  • T-20 — Add .github/CODEOWNERS covering each silver folder + a data-platform reviewer at root. Exit: a push without approval is blocked on migration.
  • T-21 — Add .claude/skills/add-silver-model skill that scaffolds a new model + sources + tests. Exit: invoking the skill produces a PR-ready file set.
  • T-22 — Wire AI PR-review bot: column-overlap similarity check against existing silver models, schema lint, test-coverage threshold, manifest diff summary. Exit: a deliberately-similar test PR triggers the bot's warning.

# BQ lockdown (WS-2)

  • T-30 — Build the shadow-table inventory query against INFORMATION_SCHEMA.JOBS_BY_PROJECT (90-day window). Publish results as a register doc. Exit: ≥90% of writes attributed to a known user/team.
  • T-31 — Provision per-team sandbox_<team> datasets via Terraform with 30-day partition expiration. Exit: a test row in a sandbox vanishes after expiration.
  • T-32 — Audit-log alert in Cloud Monitoring on CREATE TABLE outside silver + sandboxes; route to Slack. Exit: a forced test fires the alert.
  • T-33 — Stage-1 revoke of dataEditor on one pilot legacy dataset. Exit: revoke applied; 7-day query log shows no consumer breakage.

# Telemetry datamart + Metabase

  • T-40 — Build meta__bq_usage (dbt model) + at least one dbt test. Exit: model builds daily, populated for the prior day.
  • T-41 — Build meta__table_lifecycle. Exit: builds daily; rows for every BQ table in scope.
  • T-42 — Build meta__ingestion_runs (schema only initially; populated as ingestion blueprints emit). Exit: schema deployed, sample row from a smoke test present.
  • T-43 — Build meta__dbt_runs from manifest + run-results uploaded by T-15. Exit: rows for every Cosmos run.
  • T-44 — Scaffold environments/dev-04-metabase/ Terraform: Metabase Helm release, Cloud SQL HA backend, IAP fronting, dependabot. Exit: metabase.<domain> reachable via IAP for seed allowlist.
  • T-45 — First Metabase question + dashboard reading meta__bq_usage: "Cost yesterday". Exit: dashboard loads in <5s.
  • T-46 — Wire 3 Metabase alerts to Slack: freshness lag, cost spike, ingestion failure. Exit: a forced threshold breach posts to Slack in <10 min.

# Ingestion blueprints

  • T-50 — Create ume-data-engineering/ingestion/ with three blueprints: jsonl, csv, ftp. README + smoke test each. Exit: three blueprints committed and CI-green.
  • T-51 — Onboard one real non-CDC source via a blueprint. Exit: source lands in BigQuery and emits to meta__ingestion_runs.

# DataHub scanning (WS-9)

  • T-60 — DataHub Phase 2 deployed per current roadmap. Exit: UI reachable via IAP for seed allowlist.
  • T-61 — DataHub ingestion recipes: BigQuery (silver + raw CDC), Airflow DAGs, dbt manifests. Scheduled in Airflow. Exit: all three recipes run nightly.
  • T-62 — Every shipped silver model appears in DataHub with description, tags, column-level lineage to raw CDC. Exit: searching a raw CDC table shows downstream silver models.

# Legacy consumer migration (WS-8)

  • T-70 — Catalogue all tables/views sourcing from raw CDC. Tag each with a target silver model or "to sunset". Exit: register published.
  • T-71 — Repoint at least 5 consumers at silver, verify with row-count tolerance vs raw-CDC version. Exit: 5 consumers green; sunset dates communicated for the rest.

# Lifecycle (WS-10, design only in locomotive)

  • T-80 — Author docs/infrastructure/lifecycle.md with the rule set (e.g. not queried in 90 days + not referenced in dbt manifest → deprecation candidate). Exit: PR merged.
  • T-81 — Build a weekly Metabase report listing deprecation candidates from meta__table_lifecycle. Exit: report runs weekly; first review meeting held.

# Open items still to resolve

  • Confirm Metabase domain and IAP allowlist seed.
  • Identify 2–3 pilot domains willing to migrate first consumers off legacy CDC.