# Story status

Architecture pivot (2026-04-15): Airflow replaces Cloud Composer — deployed on GKE Standard via the official Helm chart with CeleryExecutor. Stories restructured into Phase 1 (Airflow on GKE, Stories 0-5) and Phase 2 (DataHub, Stories 6-13). Story 4 decomposed into 4a-4d (2026-04-16): scaffolding, Helm release, ingress + auth, custom image. GKE + Cloud SQL are Phase 1 infrastructure (reused in Phase 2). See Deployment Stories for the current sequence.

# Story 0 — Repository scaffold

Status: done (minimal) Date: 2026-04-15

Implemented as a minimal smoke-test scaffold rather than the full pre-baked structure from the spec. Directories and files for layers, environments, and modules will be created as each story needs them.

# What was created

  • .gitignore — standard Terraform ignores
  • .pre-commit-config.yaml — terraform fmt, validate, tflint via antonbabenko/pre-commit-terraform
  • CLAUDE.md — agent instructions, invariants, links to docs
  • README.md — repo overview and getting-started commands
  • .github/workflows/terraform-plan.yml — runs on PR: fmt check, detect changed stacks, init -backend=false, validate
  • .github/workflows/terraform-apply.yml — manual dispatch only until Story 1 wires up WIF
  • .github/workflows/terraform-drift.yml — manual dispatch only until Story 1 wires up WIF
  • scripts/detect-changed-stacks.sh — working script: git diff to stack paths, module-to-stack dependency resolution
  • layers/00-bootstrap/ — single valid stack (versions.tf, variables.tf, outputs.tf, main.tf, backend.hcl)

# What was deferred (created per-story instead)

  • layers/10-platform-shared/ — deferred to Phase 2 (Story 6)
  • environments/dev-*, environments/prod-* — Stories 3+
  • modules/* — created per story as environment-scoped resources are added (module strategy revised in Story 3d)

# Verification

  • terraform fmt -check -recursive passes
  • terraform init -backend=false && terraform validate passes on layers/00-bootstrap
  • All 3 workflow YAML files are syntactically valid
  • scripts/detect-changed-stacks.sh runs without error
  • PR merged to main

# Story 1 — Bootstrap

Status: done Date: 2026-04-15

# What was created

  • layers/00-bootstrap/main.tf — full implementation: API enablement (7+4 APIs via for_each), GCS state bucket (ume-tf-state-poc-ume-data), Artifact Registry (ume-composer-images), WIF pool + provider (ume-datainfra-github / ume-datainfra-github-provider), CI service accounts (ume-datainfra-tf-plan, ume-datainfra-tf-apply), IAM bindings (project-level, bucket-level, WIF-to-SA)
  • layers/00-bootstrap/locals.tf — common labels (env=shared, layer=bootstrap, owner=platform-team, cost_center=data-platform)
  • layers/00-bootstrap/terraform.tfvars — project_id and region (us-east1)
  • layers/00-bootstrap/variables.tf — added github_org, github_repo, environment variables
  • layers/00-bootstrap/outputs.tf — wired all 4 required outputs + SA emails
  • .github/workflows/terraform-plan.yml — WIF auth enabled, plan + PR comment steps active, SA = ume-datainfra-tf-plan
  • .github/workflows/terraform-apply.yml — WIF auth enabled, triggered on push to main, SA = ume-datainfra-tf-apply
  • .github/workflows/terraform-drift.yml — WIF auth enabled, cron schedule active, SA = ume-datainfra-tf-apply
  • Docs updated: us-central1 changed to us-east1 across 06-composer.md, 07-gke-platform.md, 10-operations.md, 11-deployment-stories.md

# Key decisions

  • Direct resources, no modules: AR and WIF are only used by bootstrap (one-off layer, never replicated across environments). Module extraction not needed.
  • roles/editor for tf-apply-sa: PoC project, granular role list documented in main.tf for prod hardening. Inspired by frontera-infra pattern where broad roles are used initially, then tightened.
  • WIF attribute condition: repo-only on the provider (assertion.repository == "1edata/ume-data-infra"). Branch restriction is on the SA binding: tf-apply-sa only allows refs/heads/main, tf-plan-sa allows any branch.
  • SA naming: ume-datainfra-tf-plan / ume-datainfra-tf-apply — repo-specific to avoid collision with other repos' CI SAs.
  • State bucket: versioning enabled, no lifecycle rules (files too small to matter on cost), uniform bucket-level access.
  • Custom role for plan SA state access: roles/storage.objectViewer was insufficient — Terraform needs to create/delete .tflock files. Created a custom project role (tfStateLocker) with get, list, create, delete permissions. This avoids granting objectAdmin which would let the plan SA overwrite state files.
  • Region: us-east1 (changed from us-central1 across all docs).
  • Workflow WIF_PROVIDER: set to FILL_AFTER_BOOTSTRAP_APPLY placeholder. After manual apply, operator grabs the value from terraform output wif_provider_name and updates all 3 workflow files.

# Bootstrap procedure (for brand-new deployment)

cd layers/00-bootstrap
# 1. Comment out backend "gcs" {} in versions.tf
# 2. terraform init && terraform apply
# 3. Restore backend "gcs" {} in versions.tf
# 4. terraform init -backend-config=backend.hcl -migrate-state
# 5. terraform output  (grab wif_provider_name, SA emails)
# 6. Update WIF_PROVIDER in all 3 workflow files
# 7. Push PR

# What was deferred

  • Workflow WIF_PROVIDER values: filled after manual apply (not a code task, operator step)
  • Granular IAM roles for tf-apply-sa: documented target list, implement before prod

# Verification

  • terraform fmt -check -recursive passes
  • terraform init -backend=false && terraform validate passes
  • State bucket exists: gsutil ls gs://ume-tf-state-poc-ume-data/
  • Artifact Registry repo exists: gcloud artifacts repositories list --project=poc-ume-data
  • WIF pool exists (ACTIVE): gcloud iam workload-identity-pools list --location=global --project=poc-ume-data
  • Both SAs exist: ume-datainfra-tf-plan@poc-ume-data.iam and ume-datainfra-tf-apply@poc-ume-data.iam
  • CI plan acquires state lock successfully (custom role tfStateLocker applied)

# Story 1-fix01 — Add missing APIs to bootstrap

Status: done Date: 2026-04-15

Added compute.googleapis.com and servicenetworking.googleapis.com to the required_apis set in layers/00-bootstrap/main.tf. These are required by Stories 3a/3b for VPC provisioning and Private Service Access (Cloud SQL private IP).

# What changed

  • layers/00-bootstrap/main.tf — added 2 APIs to the required_apis local

# Verification

  • terraform plan shows 2 new google_project_service resources
  • terraform apply succeeds
  • gcloud services list --project=poc-ume-data | grep -E 'compute|servicenetworking'

# Story 2 — Platform Shared (Airflow-focused) → Doc Restructure

Status: done Date: 2026-04-15

# What was created

No Terraform resources. This story became a documentation restructure after planning revealed that Airflow SAs are environment-scoped, not shared.

Files modified:

  • docs/infrastructure/11-deployment-stories.md — rewrote Story 2, updated Stories 3/4/6
  • docs/infrastructure/04-terraform-structure.md — updated layers table, SA table, inter-stack contracts
  • docs/infrastructure/06-airflow.md — fixed SA/KSA names in Helm values and WI table
  • docs/infrastructure/07-gke-platform.md — fixed WI bindings table
  • docs/infrastructure/03-architecture.md — updated repo layout description

# Key decisions

  • SAs moved to environments/dev-01-base/: Airflow service accounts are environment-scoped because their Workload Identity bindings reference a specific project's identity pool ({project}.svc.id.goog). In the multi-project future, each project gets its own SAs for its own cluster. layers/ is reserved for resources shared across all environments and projects.
  • layers/10-platform-shared/ deferred to Phase 2: No cross-environment resources exist in Phase 1. Created in Story 6 when DataHub work begins.
  • SA naming: ume-airflow and ume-airflow-kpo (follows the ume-{purpose} convention). All doc references updated.
  • KSA naming: airflow (not airflow-scheduler). The Helm chart applies one KSA to all components (scheduler, worker, webserver, triggerer). A generic name is accurate.
  • storage.objectAdmin project-wide for PoC: Scoping to specific buckets deferred to Story 4 as a hardening task (the log bucket doesn't exist until then).
  • Inter-stack contract simplified: runtime stack reads SA emails from dev-01-base remote state (one source instead of two). Originally named dev-03-runtime, renamed to dev-02-runtime in Story 4 decomposition.

# What was deferred

  • layers/10-platform-shared/ creation — Story 6
  • Bucket-scoped storage.objectAdmin — Story 4 hardening note

# Story 3a — Networking

Status: done Date: 2026-04-15

# What was created

  • environments/dev-01-base/versions.tf — Terraform >= 1.5, google/google-beta ~> 5.0, GCS backend
  • environments/dev-01-base/variables.tf — project_id, region, zone (us-east1-b), environment, state_bucket
  • environments/dev-01-base/locals.tf — common labels (env, layer=base, owner=platform-team, cost_center=data-platform)
  • environments/dev-01-base/terraform.tfvars — poc-ume-data, us-east1, us-east1-b
  • environments/dev-01-base/backend.hcl — state prefix environments/dev-01-base
  • environments/dev-01-base/data.tfterraform_remote_state for 00-bootstrap
  • environments/dev-01-base/networking.tf — VPC (ume-data-dev-vpc), subnet (ume-data-dev-gke-nodes with secondary ranges for pods/services), static IP (ume-data-dev-nat-ip), Cloud Router (ume-data-dev-router), Cloud NAT (ume-data-dev-nat)
  • environments/dev-01-base/outputs.tf — vpc_id, vpc_self_link, subnet_self_link, pod/service range names, nat_ip_address
  • Updated docs/infrastructure/04-terraform-structure.md — new naming convention ume-data-{env}-{purpose}, added Cloud Router/NAT to naming table
  • Updated naming references across 06-airflow.md, 10-operations.md, 11-deployment-stories.md

# Key decisions

  • ume-data-{env} naming prefix: Changed from ume-{env} to avoid generic collisions in shared GCP projects. All environment-scoped resources use this prefix. Global resources (bootstrap) keep the shorter ume- prefix.
  • Direct resources (modularized in Story 3d): Originally used native Terraform resources directly. Extracted into modules/vpc/ in Story 3d via moved blocks.
  • Static NAT IP: Reserved a google_compute_address for predictable egress IP, enabling allowlisting by external services.
  • ALL_SUBNETWORKS_ALL_IP_RANGES for NAT: No public subnets planned. Cloud NAT only affects VMs without external IPs, so this is safe unconditionally.
  • Remote state in data.tf: Separated from networking.tf because it is a stack-level concern shared by Stories 3b-3d.
  • Zone variable in scaffolding: us-east1-b included in variables.tf now for Story 3d's zonal GKE cluster.
  • Subnet CIDRs: nodes 10.0.0.0/20, pods 10.4.0.0/14, services 10.8.0.0/20. Standard GKE VPC-native ranges, no overlaps.

# What was deferred

  • Nothing. Story 3a is self-contained.

# Verification

  • terraform fmt -check -recursive passes
  • terraform init -backend-config=backend.hcl succeeds (GCS backend)
  • terraform validate passes
  • terraform plan shows 5 resources to create (VPC, subnet, static IP, router, NAT)
  • VPC and subnets exist
  • Private Google Access enabled
  • Cloud NAT configured
  • Static IP reserved

# Story 3b — Cloud SQL

Status: done Date: 2026-04-16

# What was created

  • environments/dev-01-base/cloud-sql.tf — PSA peering (ume-data-dev-psa-range, 10.64.0.0/20), PostgreSQL 16 instance (ume-data-dev-airflow-pg, db-g1-small), airflow database, Secret Manager secret shell (ume-data-dev-cloudsql-admin-password)
  • environments/dev-01-base/outputs.tf — added sql_connection_name, sql_private_ip, sql_instance_name
  • docs/infrastructure/04-terraform-structure.md — updated layout to show cloud-sql.tf instead of persistence.tf
  • docs/infrastructure/11-deployment-stories.md — updated Story 3b spec with concrete resource names, design decisions, and refined verification checklist

# Key decisions

  • PostgreSQL 16: Latest GA on Cloud SQL. Improved query performance over 15. Fully supported by Airflow.
  • PSA range /20 at 10.64.0.0: Hardcoded for deterministic plans. /20 (not /24) avoids painful range expansion later — expanding PSA requires deleting/recreating the peering connection (downtime). Zero cost difference.
  • File name cloud-sql.tf (not persistence.tf): More specific, consistent with networking.tf / gke.tf naming pattern. Structure doc updated.
  • airflow database created here: Story 4's Helm chart expects metadataConnection.db: airflow. Creating alongside the instance avoids a manual prerequisite.
  • No Terraform-managed admin user: Default postgres user is created automatically. Break-glass access uses postgres + password from Secret Manager.
  • disk_autoresize_limit = 50: Safety cap for PoC. Prevents runaway growth from unbounded auto-increase.
  • deletion_protection = false: PoC instance. Must be set to true for prod.
  • No labels on PSA range: google_compute_global_address with purpose = VPC_PEERING rejects labels (GCP API limitation). Documented in code.
  • IAM auth flag only, bindings in 3c: The cloudsql.iam_authentication = on flag is set on the instance. The actual google_sql_user (IAM type) and roles/cloudsql.client binding are deferred to Story 3c, which creates the ume-airflow SA.

# What was deferred

  • IAM database user (google_sql_user) and roles/cloudsql.client binding — Story 3c (depends on SA creation)
  • PITR (point-in-time recovery) — prod hardening
  • HA (regional availability) — prod
  • Cloud SQL monitoring alerts — Story 4
  • modules/cloud-sql-postgres/ — done (extracted in Story 3d via moved blocks)

# Verification

  • terraform fmt -check -recursive passes
  • terraform validate passes
  • Cloud SQL running
  • Private IP assigned, no public
  • PSA range allocated
  • airflow database exists
  • Secret shell exists

# Story 3c — Airflow IAM

Status: done Date: 2026-04-16

# What was created

  • environments/dev-01-base/iam.tfume-airflow SA (4 project-level roles), ume-airflow-kpo SA (2 project-level roles), Workload Identity bindings for both (airflow/airflowume-airflow, airflow-kpo/airflow-kpoume-airflow-kpo), Cloud SQL IAM database user for ume-airflow
  • environments/dev-01-base/outputs.tf — added airflow_sa_email, airflow_kpo_sa_email
  • docs/infrastructure/04-terraform-structure.md — added iam.tf to the dev-01-base layout tree
  • docs/infrastructure/11-deployment-stories.md — updated Story 3c spec with design decisions and refined verification checklist
  • layers/00-bootstrap/main.tf — added roles/servicenetworking.networksAdmin to tf-apply-sa (Story 3b fixup, needed for PSA peering)

# Key decisions

  • google_sql_user in iam.tf, not cloud-sql.tf: IAM concern (granting SA database auth). Keeps Story 3c PR self-contained. Cross-file reference to google_sql_database_instance.airflow.name is a normal intra-stack reference.
  • google_project_iam_member (additive): Same pattern as bootstrap. Authoritative (google_project_iam_binding) would revoke other members from shared roles like roles/bigquery.dataEditor.
  • for_each over role sets: 6 role bindings from 2 toset() locals. Keys are full role strings (e.g., google_project_iam_member.airflow["roles/cloudsql.client"]). Adding/removing roles is a one-line change.
  • trimsuffix for SQL user name: GCP API expects the SA email without .gserviceaccount.com. Using trimsuffix(google_service_account.airflow.email, ".gserviceaccount.com") maintains the Terraform dependency graph (if SA name changes, this updates automatically).
  • No labels: None of the resource types (google_service_account, google_project_iam_member, google_service_account_iam_member, google_sql_user) support GCP labels. Not a label-invariant violation.
  • WI bindings depend on GKE: GCP validates the Workload Identity pool ({project}.svc.id.goog) exists — the pool is created when GKE enables Workload Identity. Added depends_on = [module.gke] in Story 3d. GCP does NOT validate KSA existence (Story 4 creates them via Helm).
  • Broad permissions flagged for later scoping: roles/storage.objectAdmin and roles/secretmanager.secretAccessor are project-wide for PoC. Inline TODO(narrow-scope) comments mark these for Story 4 / future hardening.

# What was deferred

  • Scoping roles/storage.objectAdmin to specific buckets — Story 4 (log bucket doesn't exist yet)
  • Scoping roles/secretmanager.secretAccessor to specific secrets — future hardening when secret set stabilizes

# Verification

  • terraform fmt -check -recursive passes
  • terraform validate passes
  • SAs created: gcloud iam service-accounts list --project=poc-ume-data | grep ume-airflow
  • Project IAM role bindings applied (6 bindings)
  • WI bindings exist: gcloud iam service-accounts get-iam-policy ume-airflow@poc-ume-data.iam.gserviceaccount.com
  • WI bindings exist: gcloud iam service-accounts get-iam-policy ume-airflow-kpo@poc-ume-data.iam.gserviceaccount.com
  • Cloud SQL IAM user exists: gcloud sql users list --instance=ume-data-dev-airflow-pg --project=poc-ume-data

# Story 3d — GKE Cluster + Module Extraction

Status: done Date: 2026-04-16

# What was created

Module extraction (applied first via moved blocks):

  • modules/vpc/ — reusable VPC module (main.tf, variables.tf, outputs.tf, versions.tf). Encapsulates VPC, subnet with GKE secondary ranges, Cloud NAT, Cloud Router. Single network_cidr_base (/12) parameter derives all CIDRs via cidrsubnet().
  • modules/cloud-sql-postgres/ — reusable Cloud SQL module (main.tf, variables.tf, outputs.tf, versions.tf). Encapsulates PSA peering, Cloud SQL instance, database, admin password Secret Manager secret. All cost/topology settings exposed as variables.
  • environments/dev-01-base/moved.tf — 10 moved blocks migrating flat resources to module addresses. Removed after successful apply.
  • environments/dev-01-base/networking.tf — replaced 5 flat resources with module.vpc call
  • environments/dev-01-base/cloud-sql.tf — replaced 5 flat resources with module.airflow_sql call
  • environments/dev-01-base/iam.tf — updated google_sql_user reference to module.airflow_sql.instance_name
  • environments/dev-01-base/outputs.tf — updated value expressions from flat refs to module outputs

GKE (applied second):

  • modules/gke-standard/ — reusable GKE module (main.tf, variables.tf, outputs.tf, versions.tf). Encapsulates cluster + dynamic node pools + naming + labels + security defaults.
  • environments/dev-01-base/gke.tf — module call for ume-data-dev-gke (zonal, us-east1-b), node pools via var.gke_node_pools
  • environments/dev-01-base/variables.tf — added gke_node_pools (full type signature), gke_master_authorized_cidr_blocks
  • environments/dev-01-base/terraform.tfvars — node pool definitions (default-pool, kpo-pool)
  • environments/dev-01-base/outputs.tf — added GKE outputs
  • environments/dev-01-base/iam.tf — added depends_on = [module.gke] to WI bindings

Bootstrap fixes:

  • layers/00-bootstrap/main.tf — added custom role tfIamPolicyAdmin with 4 permissions (iam.serviceAccounts.{get,set}IamPolicy, resourcemanager.projects.{get,set}IamPolicy). Narrower than roles/iam.serviceAccountAdmin + roles/resourcemanager.projectIamAdmin.

Docs:

  • docs/infrastructure/04-terraform-structure.md — rewrote module strategy, updated module catalog (vpc + cloud-sql-postgres marked Created)
  • docs/infrastructure/07-gke-platform.md — updated Calico to Dataplane V2, Terraform Configuration section
  • docs/infrastructure/agents/infra-terraform.md — updated invariants
  • CLAUDE.md — updated invariants: replaced "2+ callers" with forward-looking module strategy

# Key decisions

  • Module extraction before GKE: Networking and Cloud SQL resources were already applied as flat resources (Stories 3a-3c). Extracted into modules using Terraform moved blocks — declarative state migration via CI, no manual terraform state mv. All 10 moves applied with zero resource recreation.
  • VPC module uses cidrsubnet(): Single network_cidr_base (/12) parameter. Node subnet, pod range, and service range derived automatically. New environment = change one value for non-overlapping ranges. Pattern inspired by frontera-infra.
  • PSA inside Cloud SQL module, not VPC: PSA's sole purpose is Cloud SQL private networking. Keeping it in the Cloud SQL module means the module handles its own connectivity end-to-end.
  • IAM stays flat: Policy layer, not infrastructure. Roles change per workload, not per environment. A module would just wrap for_each with no encapsulation benefit.
  • Node pools as variables: Moved from inline in gke.tf to var.gke_node_pools + terraform.tfvars. Prod can override machine types, counts, spot settings via tfvars alone.
  • WI bindings depend on GKE: GCP validates the Workload Identity pool ({project}.svc.id.goog) exists. The pool is created by GKE when Workload Identity is enabled. Added depends_on = [module.gke] to both WI binding resources so they're created after the cluster.
  • Custom tfIamPolicyAdmin role: roles/editor omits {get,set}IamPolicy on both projects and service accounts. Rather than granting broad predefined roles (roles/iam.serviceAccountAdmin, roles/resourcemanager.projectIamAdmin), created a custom role with exactly 4 permissions. CI can manage IAM bindings but can't create/delete SAs or escalate its own access.
  • Dataplane V2 over Calico: Irreversible. Cilium/eBPF over iptables. Built-in network policy enforcement.
  • kpo-pool max=3: Tightened from 10 for dev. Limits cost from runaway DAGs.
  • deletion_protection = true: Deliberate two-step teardown.

# What was deferred

  • workload-pool node pool — Phase 2 (Story 7, DataHub)
  • Restricting authorized networks to specific CIDRs — when Cloudflare WARP or VPN is set up
  • Regional cluster — prod
  • Remove moved.tf — follow-up cleanup commit

# Verification

  • terraform fmt -check -recursive passes
  • terraform validate passes
  • Module migration: 10 moved operations, 0 creates, 0 destroys for existing resources
  • GKE cluster running: gcloud container clusters list --project=poc-ume-data
  • Both pools listed: gcloud container node-pools list --cluster=ume-data-dev-gke --zone=us-east1-b
  • Project IAM bindings applied (6 bindings)
  • WI bindings applied (after GKE, via depends_on)
  • Bootstrap custom role tfIamPolicyAdmin applied

# Story 4a — Runtime Stack Scaffolding + GCS Buckets

Status: done Date: 2026-04-17

# What was created

New module — modules/gcs-bucket/:

  • main.tfgoogle_storage_bucket with dynamic lifecycle rules, uniform bucket-level access, configurable versioning and force_destroy
  • variables.tfname, project_id, location, storage_class, versioning, force_destroy, lifecycle_rules (list of objects: Delete/SetStorageClass actions with age, created_before, num_newer_versions, with_state conditions), labels
  • outputs.tfname, url, self_link
  • versions.tf — Terraform >= 1.5, google ~> 5.0

New stack — environments/dev-02-runtime/:

  • versions.tf — Terraform + google + google-beta + kubernetes + helm providers. K8s/Helm auth via google_client_config access token + GKE endpoint/CA from remote state.
  • variables.tf — Active: project_id, environment, region, zone, state_bucket. Commented out for later stories: airflow_image_repository, airflow_image_tag, domain_name, airflow_subdomain.
  • outputs.tfairflow_logs_bucket, airflow_dags_bucket
  • locals.tf — common labels (layer=runtime)
  • data.tfgoogle_client_config + remote state for dev-01-base and 00-bootstrap
  • backend.hcl — GCS backend at environments/dev-02-runtime
  • terraform.tfvars — dev values
  • buckets.tf — Two module calls: ume-airflow-logs-poc-ume-data (90-day delete lifecycle) and ume-airflow-dags-poc-ume-data (versioning, no lifecycle)

Modified — modules/gke-standard/:

  • main.tf — Added addons_config { gcs_fuse_csi_driver_config } block
  • variables.tf — Added gcs_fuse_csi_enabled variable (default true)

Modified — environments/dev-01-base/:

  • outputs.tf — Added gke_cluster_name, gke_endpoint, gke_ca_cert (sensitive) outputs, mapping to module.gke outputs
  • moved.tf — Deleted (moves from Story 3d already applied, file was dead weight)

Modified — layers/00-bootstrap/:

  • main.tf — Added roles/container.viewer IAM binding for plan SA

Docs:

  • docs/infrastructure/11-deployment-stories.md — Updated Story 4a spec with design decisions
  • docs/infrastructure/04-terraform-structure.md — Updated gcs-bucket module status to Created

# Key decisions

  • Full lifecycle rule support: lifecycle_rules variable accepts a list of objects with action type (Delete/SetStorageClass) and multiple conditions (age, created_before, num_newer_versions, with_state). Handles tiering rules from the start rather than refactoring later.
  • force_destroy as variable (default false): Module invariant says expose all configurable settings. Dev can override for easy teardown.
  • roles/container.viewer on plan SA: roles/viewer does not map to any k8s RBAC role. Plan SA needs k8s API read access for terraform plan on kubernetes/helm resources (drift detection). Added to bootstrap as a new IAM binding.
  • Missing GKE outputs fixed: dev-01-base was not exporting gke_cluster_name, gke_endpoint, gke_ca_cert despite story-status for 3d claiming they were added. Fixed as a prerequisite. Output names match the inter-stack contracts table in 04-terraform-structure.md.
  • Provider auth pattern: kubernetes/helm providers use data.google_client_config.default.access_token + endpoint/CA from remote state. No gcloud get-credentials needed. Providers initialize lazily, so Story 4a (no k8s resources) passes validate without cluster connectivity.
  • Two remote state sources in dev-02-runtime: Reads from both dev-01-base (GKE, SQL, SA outputs) and 00-bootstrap (AR URL, state bucket). Clear provenance over pass-through outputs.
  • Commented-out variables: airflow_image_* and domain_name/airflow_subdomain are defined but commented out. Each is wired when its story needs it. Avoids unused-variable noise in validate.

# What was deferred

  • airflow_namespace output — wired in Story 4b when the namespace is created by the Helm release
  • Bucket-scoped IAM for ume-airflow SA — Story 4b hardening note (currently project-wide roles/storage.objectAdmin)
  • Lock file (environments/dev-02-runtime/.terraform.lock.hcl) — generated locally, committed with the PR

# Verification

  • terraform fmt -check -recursive passes across all changed stacks
  • terraform init -backend=false && terraform validate passes on modules/gcs-bucket
  • terraform init -backend=false && terraform validate passes on environments/dev-01-base
  • terraform init -backend=false && terraform validate passes on environments/dev-02-runtime
  • terraform init -backend=false && terraform validate passes on layers/00-bootstrap
  • After CI apply: buckets exist
  • After CI apply: GCS FUSE CSI enabled on cluster
  • After CI apply: plan SA has roles/container.viewer

# Story 4b — Airflow Helm Release (Stock Image, Port-Forward)

Status: done Date: 2026-04-17

# What was created

New module -- modules/airflow-helm/:

  • main.tf -- locals (Auth Proxy sidecar with --private-ip + --auto-iam-authn, native sidecar variant for Jobs, GCS FUSE config with resource overrides, shared service account reference), kubernetes_namespace_v1, kubernetes_service_account_v1 (shared KSA with WI annotation), 6 kubernetes_secret_v1 resources (metadata connection, result backend, API secret key, JWT secret, admin password, SQL admin password), bootstrap kubernetes_job_v1 (grants + migrate), helm_release with Airflow 3 values (CeleryExecutor, apiServer, dagProcessor, Auth Proxy sidecars, GCS FUSE DAG mount, remote GCS logging), standalone cleanup CronJob
  • variables.tf -- all configurable settings with sensible defaults: image (apache/airflow:3.2.0), chart version (1.20.0), per-component resources, GCS FUSE resource overrides, worker replicas, Airflow config overrides, cleanup schedule/retention, admin user, Cloud SQL Auth Proxy image, SQL admin password secret ID
  • outputs.tf -- namespace, release_name, release_status
  • versions.tf -- Terraform >= 1.5, kubernetes ~> 2.35, helm ~> 2.17, random ~> 3.0

Modified -- modules/cloud-sql-postgres/:

  • main.tf -- automated postgres admin password: random_password, google_sql_user for built-in postgres user, google_secret_manager_secret_version to store password
  • outputs.tf -- added admin_password (sensitive)

Modified -- environments/dev-01-base/:

  • iam.tf -- added roles/cloudsql.instanceUser to Airflow SA roles (required for IAM DB auth, separate from roles/cloudsql.client)
  • outputs.tf -- added sql_admin_password (sensitive)
  • terraform.tfvars -- default-pool max_count raised from 2 to 3

Modified -- environments/dev-02-runtime/:

  • airflow.tf -- single module "airflow" call passing remote state refs including sql_admin_password
  • versions.tf -- added hashicorp/random ~> 3.0 provider
  • variables.tf -- uncommented airflow_image_repository, airflow_image_tag; added airflow_chart_version
  • terraform.tfvars -- Airflow 3 values (apache/airflow:3.2.0, chart 1.20.0)
  • outputs.tf -- added airflow_namespace output

# Key decisions

  • Airflow 3.2.0 / chart 1.20.0: Story was written for Airflow 2.10.3 / chart 1.15.0. Jumped to Airflow 3 (latest stable at deployment time), which forced the architectural changes below.
  • apiServer replaces webserver: Chart 1.20.0 uses semver gates -- apiServer templates render for Airflow >= 3.0.0, webserver templates render for < 3.0.0. The webserver block is kept only for defaultUser config consumed by createUserJob.
  • dagProcessor is mandatory in Airflow 3: Standalone component that parses DAG files. Previously handled by the scheduler in Airflow 2.
  • Shared KSA, not per-component: Chart 1.20.0 creates per-component KSAs (airflow-scheduler, airflow-api-server, etc.) by default, none of which carry the Workload Identity annotation. A single kubernetes_service_account_v1 is created in Terraform with the WI annotation, and all components reference it with serviceAccount = { create = false, name = "airflow" }. The base layer's WI binding targets [airflow/airflow].
  • Terraform bootstrap Job: kubernetes_job_v1.db_bootstrap runs before the Helm release (depends_on). Steps: (1) Cloud SQL Auth Proxy native sidecar, (2) grants init container connects as postgres admin and GRANTs privileges to the IAM user, (3) migrate init container runs airflow db migrate. Needed because the chart's migrateDatabaseJob hook runs after the release resources and failed silently when privileges didn't exist. Chart migration job disabled (migrateDatabaseJob.enabled = false).
  • Cloud SQL Auth Proxy --private-ip: The Cloud SQL instance has only a private IP (PSA networking). Without this flag, the proxy defaults to public IP and fails with "instance does not have IP of type PUBLIC".
  • roles/cloudsql.instanceUser: Required for IAM database authentication. roles/cloudsql.client only allows the proxy to connect to the instance; instanceUser provides cloudsql.instances.login for the actual IAM token-based DB login.
  • Automated postgres admin password: cloud-sql-postgres module generates a random_password, sets it on the built-in postgres user via google_sql_user, stores it in Secret Manager. The runtime layer passes the Secret Manager secret ID to the airflow module. Bootstrap Job fetches the password at runtime via Workload Identity -- no credential stored in Kubernetes.
  • GCS FUSE resource overrides: The GKE webhook injects a sidecar requesting 250m CPU / 256Mi memory / 5Gi ephemeral per pod -- way more than a read-only DAG mount needs. Pod annotations override it to 10m / 64Mi / 256Mi, freeing ~960m CPU requests across 4 pods (the difference between fitting on 2 nodes and needing 3).
  • Probe timeout tuning: Chart probes run airflow jobs check, which imports the full Airflow framework every time. Takes >20s on e2-standard-2. Timeouts raised to 60s on scheduler, worker, triggerer, dag-processor. Startup failureThreshold set to 20 on scheduler and api-server (200s total).
  • Scheduler CPU limit 1000m: At the 500m limit the scheduler was throttled during Python import -- zero log output for 4+ minutes. 1000m lets it burst through startup.
  • Node pool max raised to 3: 7 Airflow pods (each with a Cloud SQL proxy sidecar, 4 with GCS FUSE sidecar) are tight on 2x e2-standard-2. Third node gives the autoscaler room during startup when all pods compete for CPU.
  • Pre-built connection Secrets with URL encoding: The IAM DB user ume-airflow@poc-ume-data.iam contains @ which breaks the Helm chart's URI template. Pre-built kubernetes_secret_v1 with urlencode() referenced via data.metadataSecretName.
  • waitForMigrations disabled on all Deployments: The chart places extraInitContainers after the wait-for-airflow-migrations init container, so a native sidecar proxy there wouldn't be running when the migration check executes. Disabled since the Terraform bootstrap Job already handles migrations.
  • Helm timeout 900s: Airflow 3 components are heavy Python apps. On e2-standard-2, the full stack takes 4-5 minutes to start. The default 600s was too tight when combined with the bootstrap Job.

# What was deferred

  • Bucket-scoped IAM for ume-airflow SA (currently project-wide roles/storage.objectAdmin)
  • Hello-world DAG push and end-to-end verification
  • Investigate chart 1.20.0's intended pattern for Cloud SQL IAM auth + private IP (see backlog) -- the Terraform bootstrap Job is a workaround
  • GCS FUSE mount on api-server for the "Code" tab in the UI (not added since dag-processor handles parsing)

# Verification

  • terraform fmt -check -recursive passes
  • terraform init -backend=false && terraform validate passes on dev-02-runtime
  • terraform plan clean (no changes) on both base and runtime stacks
  • All Airflow pods running: api-server 2/2, scheduler 4/4, dag-processor 4/4, triggerer 4/4, worker 4/4, redis 1/1, statsd 1/1
  • Auth Proxy sidecars running in each pod with successful DB connections
  • Bootstrap Job completed: grants applied, migrations ran
  • Airflow UI accessible via kubectl port-forward svc/airflow-api-server 8080:8080 -n airflow
  • DAG sync via GCS FUSE works (hello-world DAG push pending)
  • Logs appear in GCS bucket (pending first DAG run)
  • Cleanup CronJob created (currently disabled, var.cleanup_enabled = false)

# Story 4c — Ingress + TLS + DNS + IAP (Gateway API, three layers)

Status: done Date: 2026-04-17 / 2026-04-18 Shipped as: PRs 1, 2, 3a, 3b.1, 3b.2, 3c, 3c-fix, 3c-fix2, 3c-fix3, 3c-fix4, 3c-fix5, 3c-fix6 on top of Story 4b. (The long tail of fixes is faithfully recorded — each pinned down a different layer of how IAP + Gateway API + Airflow 3 interact in practice.)

# What was built

The story was re-planned mid-execution. The spec called for classic GKE Ingress + Flask-AppBuilder OAuth in webserver_config.py, but Airflow 3's pluggable auth manager made that impractical without pulling the custom image work forward, and "shared IP + wildcard DNS + per-app ingress" forced GKE Gateway API over classic Ingress. A new dev-02-k8s-base platform layer was introduced and dev-02-runtime renamed to dev-03-runtime.

Final layer split:

  • environments/dev-01-base/ owns DNS zone, shared static IP, wildcard Certificate Manager cert, wildcard A record (pure GCP, no k8s providers).
  • environments/dev-02-k8s-base/ (new) owns the shared Gateway namespace, Gateway, and HTTP→HTTPS redirect HTTPRoute.
  • environments/dev-03-runtime/ (renamed from dev-02-runtime) owns apps: Airflow Helm release, per-app HTTPRoute, per-app IAP wiring.

# PRs in order

  • PR 1 — Bootstrap perms. Added dns.googleapis.com + iap.googleapis.com APIs, granted roles/iap.admin to tf-apply-sa.
  • PR 2 — DNS zone. Created google_dns_managed_zone ume-data-dev-zone in dev-02-runtime (temporarily; moved to base in PR 3a). Operator then pasted the 4 Google NS records into GoDaddy under umedev.marpont.es.
  • PR 3a — Base DNS/cert absorb. Moved the DNS zone from runtime to dev-01-base via removed { lifecycle { destroy = false } } + import blocks. Added shared static IP (ume-data-dev-ingress-ip), wildcard A record (*.umedev.marpont.es), Certificate Manager DNS-01 authorization + CNAME + wildcard managed cert + certificate map + entry. Bumped required_version to >= 1.7 repo-wide and CI terraform_version to ~1.7 (needed for removed blocks).
  • PR 3b.1 — Enable Gateway API. Added gateway_api_config { channel } to modules/gke-standard/ with variable default CHANNEL_STANDARD. Non-disruptive cluster update installed Gateway/HTTPRoute v1 CRDs required by the next PR.
  • PR 3b.2 — dev-02-k8s-base stack. New stack: kubernetes + helm providers wired via remote_state from dev-01-base; gateway namespace ume-data-dev-gateway; kubernetes_manifest Gateway (gatewayClassName = gke-l7-global-external-managed, NamedAddress to base's static IP, HTTPS + HTTP listeners with allowedRoutes.namespaces.from = All, networking.gke.io/certmap annotation to base's cert map); HTTPRoute on :80 that 301-redirects every request to https.
  • PR 3c — Rename runtime + IAP + HTTPRoute. git mv environments/dev-02-runtime → environments/dev-03-runtime, state migrated via gsutil cp of default.tfstate to the new prefix. New modules/iap-oauth/ (per-service OAuth client, k8s secret with client_id/client_secret, GCPBackendPolicy targeting the Service, for_each IAM bindings). Extended modules/airflow-helm/ with an optional HTTPRoute (httproute_enabled, gateway_name, gateway_namespace, hostname). Runtime stack wired them together.
  • PR 3c-fix — Manual brand. google_iap_brand was dropped from code after first apply failed with HTTP 400: the IAP brand API rejects programmatic creation for projects outside a Workspace org, and even for in-org projects the IAP OAuth Admin API is being shut down. Operator created the OAuth consent screen manually in Console (Internal audience, supported via ext_marcello.pontes@ume.com.br). Brand name projects/1079167949878/brands/1079167949878 passed in via var.iap_brand_name.
  • PR 3c-fix2 — Per-user allow-list. Two clean retries of google_project_iam_member.iap_access["domain:ume.com.br"] rolled back with "Provider produced inconsistent result after apply" (google provider bug on conditional IAM member create with domain: members). Switched to per-user allow-list: ext_marcello.pontes@ume.com.br, wagner.jorge@ume.com.br, leonardo.luiz@ume.com.br.
  • PR 3c-fix3 — Plan SA IAP read role. CI plan hit 403 refreshing google_iap_client. roles/viewer does not cover clientauthconfig.*. Added a new custom role tfIapReader in bootstrap with clientauthconfig.brands.{get,list} + clientauthconfig.clients.{getWithSecret,listWithSecrets} and bound it to tf-plan-sa. Also added invariant #11 to CLAUDE.md: always verify plan-SA + apply-SA permission coverage before landing new GCP resource types downstream.
  • PR 3c-fix4 — GCPBackendPolicy shape + listener binding. GKE Gateway controller rejected the BackendPolicy with "Oauth2ClientSecret specified without ClientID" and "must have exactly 1 key-value pair in field Data, found 2". Split the OAuth credentials: spec.default.iap.clientID now carries the plain client ID, the referenced kubernetes_secret_v1 holds a single key with only the client secret. Same PR pinned the Airflow HTTPRoute to sectionName = "https" (otherwise it bound to both listeners and beat the redirect HTTPRoute to :80 traffic) and added an explicit / path match to the redirect HTTPRoute so it claims everything on :80.
  • PR 3c-fix5 — Drop IAP IAM condition. Google sign-in succeeded but IAP still denied users with "You don't have access". Reason: IAP's authorization path for Gateway-API backends reads the IAP-resource-level policy on the backend service, not project-level IAM with IAM conditions. The conditional grant was inert. Dropped the condition; per-user allow-list remained the tight scoping.
  • PR 3c-fix6 — Kill double-login. Enabled [core] simple_auth_manager_all_admins = true so SimpleAuthManager treats every request as admin. Pinned [core] auth_manager = airflow.api_fastapi.auth.managers.simple.simple_auth_manager.SimpleAuthManager in the same config block because the default image ships apache-airflow-providers-fab and FAB otherwise wins get_auth_manager() — SimpleAuthManager's middleware then hands a SimpleAuthManagerUser to FAB's serialize_user which crashes on .id. Last piece: auto-disable the chart's createUserJob when all-admins is on, since airflow users create calls FAB's AirflowSecurityManagerV2.find_role which doesn't exist under SimpleAuthManager — Job was crash-looping and blocking the Helm upgrade.

# Key decisions

  • Gateway API over classic Ingress. Classic GKE Ingress creates one GCLB per Ingress — cannot share a static IP across services. Gateway API (gke-l7-global-external-managed) supports one Gateway → one IP → many HTTPRoutes, which fits the shared-IP + wildcard-DNS + per-app-ingress model.
  • Wildcard Certificate Manager cert with DNS-01. ManagedCertificate CRD is HTTP-01 only and doesn't support wildcards. Certificate Manager's DNS-01 challenge runs against our own Cloud DNS zone — activation bounded by zone propagation (minutes), not external registrar propagation. Covers every *.umedev.marpont.es subdomain for both Airflow now and DataHub in Phase 2.
  • New dev-02-k8s-base platform layer. The original spec put Gateway in the runtime stack; that muddled app/platform concerns. Pulled Story 8's layer forward. DataHub and future platform services (Prometheus, CSI) will land in this layer.
  • DNS in dev-01-base. Zero Kubernetes dependency; keeps k8s providers out of the base stack.
  • Rename runtime to dev-03-runtime. Numbering stays monotonic (01 base, 02 k8s-base, 03 runtime). State migrated by gsutil cp of default.tfstate once at the new prefix — no local terraform.
  • Per-service IAP module, brand in stack. IAP brand is project-singleton and in this case a one-time manual Console step; every app consumes it via var.iap_brand_name. modules/iap-oauth/ creates OAuth client, k8s secret, GCPBackendPolicy, and IAM bindings — reusable for DataHub.
  • HTTPRoute inside modules/airflow-helm/. Apps own their ingress wiring. Gateway is shared and passed in by name.
  • Gateway in its own namespace with allowedRoutes.from = All. Avoids ReferenceGrant for cross-namespace HTTPRoute attachment. Backend references stay intra-namespace.
  • Per-user IAP allow-list. Three user members on roles/iap.httpsResourceAccessor — tight enough for PoC without needing IAM conditions (which turned out not to work anyway for Gateway-API backends).
  • No IAM condition on the IAP grant. Tried resource.type == "iap.googleapis.com/WebBackendService" to scope the role; the binding created cleanly but was invisible to IAP at authorization time. IAP for Gateway API reads the IAP-resource-level policy on the backend, not project-level conditional IAM. Dropped the condition. When a second IAP-protected backend lands, switch to google_iap_web_backend_service_iam_member scoped per service.
  • IAP brand created manually in Console. google_iap_brand resource can't create brands outside Workspace orgs and the API is being phased out. Documented as a prerequisite in the stack's iap.tf header.
  • SimpleAuthManager + all-admins, no second login. Once IAP enforces identity at the LB, double-authing through Airflow's login screen adds no security and confuses users. simple_auth_manager_all_admins = true skips it; the module pins auth_manager = SimpleAuthManager automatically and disables the chart's createUserJob so no FAB-specific code paths run.
  • Plan SA needs permissions beyond roles/viewer. tfK8sSecretsReader (from Story 4b era) + new tfIapReader (this story) + roles/container.viewer + roles/secretmanager.secretAccessor. The invariant added to CLAUDE.md says to check this before every new downstream resource type.

# What was deferred

  • Switching roles/iap.httpsResourceAccessor bindings from project-scope to google_iap_web_backend_service_iam_member scoped to each backend service, once more than one IAP-protected backend exists and the Gateway-generated names are stable.
  • Replacing roles/iap.admin on tf-apply-sa with a tighter custom role.
  • Narrowing roles/storage.objectAdmin for ume-airflow to specific buckets (Story 4b hardening note).
  • Enabling the metadata-db cleanup CronJob.
  • Cleaning up the orphan domain:scudra.com binding on roles/iap.httpsResourceAccessor.
  • Deleting the old gs://ume-tf-state-poc-ume-data/environments/dev-02-runtime/ state prefix after a safety period.
  • Promoting the Gateway (currently inline in dev-02-k8s-base) into a modules/gke-gateway/ module when prod forces replication.
  • Flipping admin_user_enabled = false explicitly on the runtime Airflow module call (the module auto-disables when all_admins is on, but belt-and-braces explicitness is nicer).

# Verification

  • terraform fmt -check -recursive passes across all changed stacks and modules
  • terraform validate passes on all four stacks
  • DNS zone delegated; dig NS umedev.marpont.es @8.8.8.8 returns the 4 Google name servers
  • Certificate Manager wildcard cert reaches ACTIVE
  • Gateway PROGRAMMED=True in ume-data-dev-gateway namespace
  • kubernetes_manifest.httproute in airflow namespace accepted
  • GCPBackendPolicy attached to airflow-api-server Service; kubernetes_secret_v1 with client_id/client_secret present
  • IAP brand visible: gcloud iap oauth-brands list --project=poc-ume-data
  • 3 user IAM bindings on roles/iap.httpsResourceAccessor (unconditional, per PR 3c-fix5)
  • curl -sI http://airflow.umedev.marpont.es/ returns 301 to https
  • curl -sI https://airflow.umedev.marpont.es/ returns 302 to accounts.google.com/o/oauth2/v2/auth?client_id=...
  • Browser sign-in as allow-listed ume.com.br user reaches the Airflow UI directly — no second login
  • Port-forward break-glass works (SimpleAuthManager trusts every request as admin)

# Story 4d + 5 — Custom Airflow Image + First Cosmos DAG

Status: done (validated end-to-end; content later moved to ume-data-dags — see below) Date: 2026-04-18 Depends on: Story 4c Bundled: Stories 4d and 5 combined — Story 4d is only meaningfully "done" once Story 5 proves the image works end-to-end. Ships as two PRs because the tfvars airflow_image_tag value needs an image that only exists after PR 1's merge.

# Two-phase deployment

PR 1 (this commit) — image builder, content, IAM, bootstrap, docs:

  • resources/docker/Dockerfile — extends apache/airflow:3.2.0; installs astronomer-cosmos~=1.14, dbt-core~=1.9, dbt-bigquery~=1.9 against Airflow's Python 3.12 constraint set. Build-time guardrails (which dbt, import cosmos, FAB-provider check).
  • resources/docker/requirements.txt, .dockerignore.
  • resources/scripts/build-image.sh — local build helper with identical tag convention to CI.
  • resources/dbt/dbt_project.yml, profiles.yml (BQ OAuth via workload identity), two example models with a ref() edge, schema.yml.
  • resources/dags/cosmos_dbt_dag.py — Cosmos DbtDag in LOCAL mode. schedule=None, is_paused_upon_creation=True, default_args with owner/retries.
  • .github/workflows/airflow-image.yml — build + push on resources/docker/ changes; tags 3.2.0-<merge-sha>; authenticates via existing tf-apply-sa WIF binding (scoped to refs/heads/main).
  • .github/workflows/dag-sync.ymlgcloud storage rsync --delete-unmatched-destination-objects for resources/dags/ and resources/dbt/ on merge to main.
  • .github/workflows/resources-ci.yml — PR lint: hadolint on Dockerfile, python -m py_compile on DAGs, dbt parse on the dbt project. No GCP auth needed.
  • .github/workflows/terraform-apply.yml — new pre-apply step for environments/dev-03-runtime: waits up to 15 min for the expected image tag to appear in AR. Dormant when the runtime still points at apache/airflow (PR 1's state).
  • layers/00-bootstrap/main.tfdocker_config { immutable_tags = true } on the AR repo; tags become tamper-proof.
  • environments/dev-01-base/iam.tfroles/bigquery.jobUser added to both ume-airflow and ume-airflow-kpo. Without it, dbt-bigquery cannot create BigQuery jobs (bigquery.dataEditor doesn't grant bigquery.jobs.create).
  • modules/airflow-helm/variables.tfimage_repository description tightened to call out Artifact Registry paths.
  • Doc fixes: 06-airflow.md Cosmos example (/opt/airflow/dags/dbt + /home/airflow/.local/bin/dbt), 05-ci-cd.md (stale dev-02-runtime reference), agents/composer-dags.md (rewrite to FUSE reality), 11-deployment-stories.md (Story 4d + 5 specs updated to match shipped reality).
  • backlog.md — follow-ups: dedicated content-push SA for prod, scoped storage.objectAdmin, worker-memory monitoring.

PR 2 (operator action after PR 1 merge) — tfvars bump:

  • Grab the tag from the airflow-image.yml run summary on main (format 3.2.0-<sha>).
  • Update environments/dev-03-runtime/terraform.tfvars:

    airflow_image_repository = "us-east1-docker.pkg.dev/poc-ume-data/ume-composer-images/airflow"
    airflow_image_tag        = "3.2.0-<sha-from-ar>"
  • On merge: terraform-apply's wait-for-image gate confirms the tag (instant, since it's been in AR since PR 1), then terraform apply rolls the pods.
  • Un-pause ume_dbt_example in the UI; trigger; verify the Story 5 checklist.

# Key decisions (captured in the master plan)

  • Bundled Stories 4d + 5 into one feature (two PRs) because validating 4d requires running the DAG from 5. The image is only "done" when it runs a real workload.
  • astronomer-cosmos 1.14+ is required for Airflow 3.2 — 1.11 and earlier predate that support. Build-time import cosmos + pip show apache-airflow-providers-fab checks catch drift.
  • Cosmos LOCAL mode over the read-only FUSE mount is safe — Cosmos copies the project to a per-task tmp dir before invoking dbt. No DBT_LOG_PATH/DBT_TARGET_PATH overrides needed.
  • Wait-for-image gate in terraform-apply.yml replaces the original "accept the image-pull race" stance. Fails fast at 15 min if the image workflow didn't produce the expected tag.
  • Reusing tf-apply-sa for image push and DAG sync for now. Prod will get a dedicated ume-datainfra-content-push SA; backlog.md documents the shape.
  • Two-model dbt example (ref() edge) is the minimum that proves Cosmos's task-graph rendering. Single SELECT 1 didn't.
  • docker_config.immutable_tags = true on the AR repo — tags become a one-way door, matching the immutability invariant in the image lifecycle.

# Verification (planned)

After PR 1 merge:

  • airflow-image.yml green — image present in AR (gcloud artifacts docker images list).
  • dag-sync.yml green — gsutil ls gs://ume-airflow-dags-poc-ume-data/{dags,dbt}/.
  • terraform-apply.yml green for bootstrap + dev-01-base. gcloud artifacts repositories describe ume-composer-images --location=us-east1 --format='value(dockerConfig.immutableTags)'True. roles/bigquery.jobUser visible in project IAM.

After PR 2 merge:

  • terraform-apply.yml gate passes; Helm upgrade completes.
  • Pods running the custom image.
  • kubectl exec deploy/airflow-worker -n airflow -c worker -- python -c 'import cosmos; print(cosmos.__version__)' ≥ 1.14.
  • IAP sign-in at https://airflow.umedev.marpont.es/ still lands on Airflow UI (no regression of SimpleAuthManager).
  • Un-pause + trigger ume_dbt_example; both dbt tasks succeed; BQ tables exist; task logs in GCS.

# Gotchas discovered during rollout

  1. Airflow 3.2's constraints file clashes with dbt-core on pathspec and protobuf. Fix: install Cosmos in the Airflow Python env (constrained) and dbt in an isolated /home/airflow/dbt-venv/ (unconstrained). dbt_executable_path updated to /home/airflow/dbt-venv/bin/dbt.
  2. GCS FUSE default ImplicitDirs=false hid bucket prefixes — dag-processor saw 0 files. Fix: mountOptions = "implicit-dirs" on the volume attributes in modules/airflow-helm.
  3. Airflow default dagbag_import_timeout=30s was shorter than Cosmos's first-parse dbt ls (~38 s measured). Fix: raised module default to 180 s, exposed via airflow_config.dagbag_import_timeout.
  4. apache/airflow:3.2.0 base image is Python 3.13, not 3.12.

# Story 4d + 5 migration — content to ume-data-dags

Status: done Date: 2026-04-18

Validated end-to-end from the bundled implementation, the three resources/ subtrees (docker, dags, dbt) plus the three content-side workflows moved out of ume-data-infra into a dedicated ume-data-dags repo. ume-data-infra now only carries the wait-for-image gate and the airflow_image_tag line that bot-PRs bump on every DAGs-repo merge.

# What moved where

  • resources/docker/ume-data-dags/docker/
  • resources/dags/ume-data-dags/dags/
  • resources/dbt/ume-data-dags/dbt/
  • resources/scripts/build-image.shume-data-dags/scripts/
  • .github/workflows/airflow-image.ymlume-data-dags/.github/workflows/image.yml
  • .github/workflows/dag-sync.ymlume-data-dags/.github/workflows/dag-sync.yml
  • .github/workflows/resources-ci.ymlume-data-dags/.github/workflows/pr-ci.yml
  • New in ume-data-dags: bot-pr.yml — uses a fine-grained PAT (INFRA_PR_TOKEN) scoped to ume-data-infra only, to open tfvars-bump PRs on this repo after a successful image build.

# What changed on the infra side

  • Bootstrap: new ume-datainfra-content-push SA with narrow scopes (AR writer on ume-composer-images only, bucket-level storage.objectAdmin on ume-airflow-dags-poc-ume-data only, WIF binding to 1edata/ume-data-dags). WIF provider's attribute_condition updated to accept both repos.

  • Bootstrap: three narrow custom roles for tf-apply-satfWifProviderUpdater, tfCustomRoleManager, tfArRepoIamAdmin. Needed once to break the chicken-and-egg for the new SA + custom role + AR IAM resources, then self-sustaining via CI.

  • terraform-apply.yml: wait-for-image gate retained. Still essential — every future bot-PR merge pokes it.

  • Docs: 06-airflow.md, 05-ci-cd.md, agents/composer-dags.md, 11-deployment-stories.md updated to reference the new repo.

# End-to-end rollout, validated

ume-data-dags commit to main (touching docker/)
    → image.yml pushes 3.2.0-<sha> to AR
    → bot-pr.yml opens PR on ume-data-infra bumping airflow_image_tag
    → human merges the bot-PR
    → terraform-apply wait-for-image gate confirms the tag in AR
    → Helm rolls scheduler / worker / dag-processor / triggerer /
      api-server onto the new image

First real run: 3.2.0-38e8a3d pushed from ume-data-dags, bot-PR #53 opened on ume-data-infra, merged, pods rolled successfully.

# Plan doc

Not in this repo (kept private per request); the migration followed the design captured in the earlier migrate-to-ume-data-dags.md working doc, with the GitHub App replaced by a fine-grained PAT for simpler ops.


# Story 6 — Workload Pool + DataHub SQL + Password Secret

Status: done Date: 2026-04-18 PR: #55 (merge commit 2ca13f0) Plan doc: plans/story-06-workload-pool-datahub-sql.md

Foundation slice of Phase 2 (DataHub): dedicated node pool for stateful workloads, a second logical database on the shared Cloud SQL instance with password-based auth, and the first Cloud SQL observability alert.

# What changed

  • environments/dev-01-base/terraform.tfvars — added workload-pool (e2-standard-4, min 1 / max 4, label pool=workload, no spot, no taint) to the gke_node_pools map.

  • modules/cloud-sql-db/ (new) — wraps the five-resource logical-DB cluster: random_password, google_sql_database, google_sql_user (BUILT_IN), google_secret_manager_secret, and google_secret_manager_secret_version. Explicit hashicorp/random ~> 3.0 in required_providers (the older cloud-sql-postgres gets away with implicit resolution; the new module does it the right way). Outputs database_name, user_name, password_secret_id — deliberately no password output; consumers resolve the secret at runtime.

  • environments/dev-01-base/cloud-sql.tf — single module "datahub_db" call creating DB datahub on module.airflow_sql.instance_name.

  • environments/dev-01-base/outputs.tf — added datahub_db_name, datahub_db_user, datahub_db_host (= module.airflow_sql.private_ip), datahub_db_password_secret_id.

  • environments/dev-02-k8s-base/alerts.tf (new) — first alerts file for this layer. One google_monitoring_alert_policy: Cloud SQL disk utilization > 0.75 for 10 min on ume-data-dev-airflow-pg. Notification channels [] (wired in Story 13).

# Key decisions

  • Wrap DB-bound resources in their own module (cloud-sql-db) instead of inlining them in the stack. Challenged during planning — the original master plan §2 said stack-level. Every new app DB is the same five-resource cookie-cutter, and invariant #8 says env-scoped → module from day one; the new module is the correct level of reuse. Leaves cloud-sql-postgres (instance-level) clean.

  • Password auth, not IAM auth. DataHub's five JVM pods would each need a Cloud SQL Auth Proxy sidecar under IAM — ~1.75 vCPU + 1.1 GiB overhead. Password + private IP removes the sidecar.

  • Shared Cloud SQL instance. DataHub's dev metadata is small; reuses Airflow's db-g1-small. Saves ~$26/mo.

  • Workload pool label-only selector (no taint). Kafka / OS / DataHub pin via nodeSelector: { pool: workload }. Airflow doesn't need a taint keeping it off — Airflow has its own kpo-pool plus default-pool scheduling.

  • Alert in dev-02-k8s-base, not dev-01-base. The metric targets a dev-01 instance, but the plan consolidates all Phase 2 alert policies in one file per the master plan §5 so Stories 9/10/13 extend this file. Remote-state already links the stacks.

# Invariant #11 — bootstrap CI IAM

Walked every new resource type against layers/00-bootstrap/main.tf. All covered: node pool / SQL DB / SQL user / secret / secret version / alert policy already exercised by Airflow + existing grants. Secret version payload reads need roles/secretmanager.secretAccessor on both tf-plan-sa and tf-apply-sa — bootstrap lines 179–183 and 200–204 already grant exactly that. No bootstrap delta this story.

# Gotchas

  • The Story 6 spec referenced module.airflow_sql.private_ip_address; the actual module output is private_ip. Implementation uses the correct name.

  • gh pr merge --squash auto-pulls main locally via rebase, which tripped on pre-existing unstaged changes in the working tree. Merge itself succeeded server-side; local sync done with git merge --ff-only origin/main afterward.

  • gcloud alpha monitoring policies list requires an install prompt on this machine — the GA gcloud monitoring policies list returns the same data without interactive install.

# Verification (post-apply)

  • gcloud container node-pools listworkload-pool, e2-standard-4, min=1, max=4.

  • gcloud sql databases listdatahub (UTF8) alongside airflow + postgres.

  • gcloud sql users listdatahub (BUILT_IN) alongside postgres (BUILT_IN) + ume-airflow@... (CLOUD_IAM_SERVICE_ACCOUNT).

  • gcloud secrets versions list ume-data-dev-datahub-db-password → exactly one enabled version (payload not accessed).

  • gcloud monitoring policies listCloud SQL disk > 75% — ume-data-dev-airflow-pg, threshold 0.75, enabled.

# Then

Story 7 installs the Secrets Store CSI Driver + GCP provider so DataHub pods can mount the password secret as an env var at runtime.


# Story 7 — Secrets Store CSI Driver

Status: done Date: 2026-04-18 PR: #57 (merge commit 16242e9) Plan doc: plans/story-07-secrets-store-csi.md

Platform plumbing slice of Phase 2 (DataHub): the base Secrets Store CSI Driver plus the Google Cloud Secret Manager provider, both as DaemonSets in kube-system. Sets up the runtime path Stories 11 and 12 use to mount Secret Manager secrets as env vars on DataHub pods.

# What changed

  • modules/secrets-store-csi/ (new) — wraps two helm_releases:

    • helm_release.driver installs chart secrets-store-csi-driver v1.5.6 from the public kubernetes-sigs Helm repo. Values: syncSecret.enabled = true, enableSecretRotation = false, rotationPollInterval = 2m. Upstream default tolerations (operator: Exists) kept — no node selector so the DaemonSet runs on every pool.

    • helm_release.gcp_provider installs the vendored chart at chart-gcp-provider/ (pointed at ${path.module}/chart-gcp-provider). Values override: tolerations: [{operator: Exists}] because upstream default is [] and the DaemonSet must schedule on Airflow's tainted kpo-pool.

  • modules/secrets-store-csi/chart-gcp-provider/ (new) — verbatim copy of upstream charts/secrets-store-csi-driver-provider-gcp/ at tag v1.12.0 (appVersion 1.12.0, chart version 0.1.0). 7 files: Chart.yaml, values.yaml, templates/{_helpers.tpl, serviceaccount, clusterrole, clusterrolebinding, daemonset}.yaml.

  • modules/secrets-store-csi/README.md (new) — upstream sync procedure, upgrade notes for both charts, Helm v3 CRD-upgrade caveat.

  • environments/dev-02-k8s-base/secrets-store-csi.tf (new) — one-line module "secrets_store_csi" call with labels = local.common_labels.

  • backlog.md — added chart-drift watcher entry (scheduled workflow to open PRs syncing the vendored chart on new upstream tags).

# Key decisions

  • Vendor the GCP provider chart. Discovery during planning: the upstream URL in the original Story 7 spec (https://googlecloudplatform.github.io/secrets-store-csi-driver-provider-gcp) 404s — Google does not publish a Helm repo or an OCI chart for the provider. The chart lives only inside the git tree. Copying it in at a pinned tag is the clean way to stay on the native Helm install pattern; drift is mitigated by the backlog-tracked watcher.

  • Module-first from day one. Two releases + a vendored chart directory warrant encapsulation; the caller stays a one-liner. Prod-02-k8s-base will replicate this call unchanged.

  • No nodeSelector on the driver. Pinning to workload-pool would break CSI mounts for any future Airflow pod on default-pool or kpo-pool. Default tolerations already tolerate every taint.

  • Explicit tolerations on the GCP provider. Upstream default is []; without the operator: Exists override the provider DaemonSet skips tainted pools and renders Secret Manager reads unavailable to pods on those pools.

  • syncSecret.enabled = true. DataHub's chart expects env-var references via secretKeyRef on a native k8s Secret; CSI mounts alone would not satisfy the chart. Sync mode projects mounts into real Secrets.

  • Rotation off. DataHub password is Terraform-generated and stable. Revisit in Story 13.

# Invariant #11 — bootstrap CI IAM

Walked through before PR open. No delta required.

  • Helm / Kubernetes providers authenticate to the GKE API with the existing data.google_client_config.default.access_token + remote-state endpoint pathway already wired in environments/dev-02-k8s-base/versions.tf.

  • Once authenticated, every Helm-installed object (DaemonSet, SA, ClusterRole, ClusterRoleBinding, CRD) is authorized by k8s RBAC, not GCP IAM. tf-plan-sa's roles/viewer + roles/container.viewer and tf-apply-sa's roles/editor + roles/container.admin are sufficient — the Airflow Helm release and Gateway manifests already exercise the same pathway in CI.

# Gotchas

  • Helm repo search confusion. Initial attempt to verify upstream versions turned up only the base driver chart in the standard Helm repo. The GCP provider is a different project at a different repo; helm repo add against any GoogleCloudPlatform URL returns 404. This is the signal that forced the vendoring decision.

  • GKE cluster zone vs. region. gcloud container clusters get-credentials ume-data-dev-gke --region us-east1 404s — the cluster is zonal, not regional. Use --zone us-east1-b.

  • Workload-pool has zero nodes right now. min=1 is the autoscaler floor, not a permanent 1-node baseline. The DaemonSets schedule on whichever pools actually have nodes (2 default-pool nodes at apply time → 2 driver pods + 2 provider pods).

# Verification (post-apply)

  • kubectl -n kube-system get pods -l app=secrets-store-csi-driver → 2 pods, 3/3 Running per pod (driver + node-driver-registrar + liveness-probe containers).

  • kubectl -n kube-system get pods -l app=csi-secrets-store-provider-gcp → 2 pods, 1/1 Running per pod.

  • kubectl get crd secretproviderclasses.secrets-store.csi.x-k8s.io → present (v1).

  • kubectl -n kube-system get ds secrets-store-csi-driver -o jsonpath='{.status.numberReady}/{.status.desiredNumberScheduled}'2/2.

  • kubectl -n kube-system get ds csi-secrets-store-provider-gcp ...2/2.

# Then

Story 8 installs the Strimzi Kafka operator in strimzi-system, cluster-scoped watch.


# Story 8 — Strimzi Kafka Operator

Status: done Date: 2026-04-18 PR: #59 (merge commit 83deb1b) Plan doc: plans/story-08-strimzi-operator.md

Platform prerequisite for Phase 2's Kafka cluster. Installs the Strimzi cluster operator on GKE with cluster-wide watch, pinned to workload-pool. Establishes the kafka.strimzi.io CRDs Story 9 needs to declare the Kafka CR.

# What changed

  • modules/strimzi-kafka-operator/ (new) — wraps one kubernetes_namespace_v1 + one helm_release:
    • kubernetes_namespace_v1.strimzistrimzi-system namespace with labels = merge(common, { service = "kafka" }).

    • helm_release.operator — chart strimzi-kafka-operator v0.51.0 from https://strimzi.io/charts/. Values: watchAnyNamespace = true, nodeSelector = { pool = "workload" }, tolerations = []. atomic, cleanup_on_fail, wait = true, timeout = 600s.

    • Variables cover namespace, chart_version, watch scope, node selector, tolerations, timeout — all defaulted to the master plan §4 shape; prod will replicate the caller unchanged.

    • Outputs: namespace, chart_version (audit).

  • modules/strimzi-kafka-operator/README.md (new) — chart source, inputs/outputs tables, CRD list, Helm-v3 CRD-upgrade caveat with the manual kubectl apply -f crds/ procedure for schema-changing bumps.
  • environments/dev-02-k8s-base/strimzi.tf (new) — one-line module "strimzi_kafka_operator" call with labels = local.common_labels.

# Key decisions

  • Module, not flat Helm release. Invariant #8 + prior correction on flat Airflow releases: env-scoped platform components ship as modules from day one when prod replication is on the horizon. The caller stays one line.

  • Chart 0.51.0 confirmed live. curl https://strimzi.io/charts/index.yaml returned HTTP 200 / 70 KB before any code change. Values schema (watchAnyNamespace, nodeSelector, tolerations, resources) confirmed against the pulled 0.51.0 tarball, not guessed.

  • No resources override. Upstream default (requests: {200m, 384Mi}, limits: {1000m, 384Mi}) already matches master plan §4 sizing; overriding would be diff noise.

  • Cluster-wide watch. watchAnyNamespace = true. Story 9 can place the Kafka CR in its own kafka namespace without bouncing the operator or editing a watchNamespaces list.

  • Operator pinned to workload-pool. Keeps default-pool reserved for Airflow / platform addons. workload-pool has no taint → empty tolerations.

# Invariant #11 — bootstrap CI IAM

Walked before PR open. No delta required. Same pathway as Stories 4b (Airflow Helm), 4c (Gateway), 7 (CSI driver):

  • Helm + Kubernetes providers authenticate via data.google_client_config.default.access_token + data.terraform_remote_state.base.outputs.gke_endpoint already wired in environments/dev-02-k8s-base/versions.tf.

  • Once authenticated, namespace creation, Helm release lifecycle, Deployment / ClusterRole / ClusterRoleBinding / CRD installs are all k8s RBAC authorized by the cluster — roles/container.viewer (tf-plan-sa, bootstrap line 162) and roles/container.admin (tf-apply-sa, bootstrap line 209) are sufficient.

# Gotchas

  • tf plan job is named validate in CI. The workflow lists one check per stack labelled validate (environments/<stack>) that actually runs terraform plan. Easy to misread as "plan didn't run". The log confirms Plan: 2 to add, 0 to change, 0 to destroy.

  • Helm release creation is slow on cold workload-pool. First apply took 2m01s for helm_release.operator because the pool had zero nodes at the time and the autoscaler had to land one. Within the 600s timeout but worth flagging — subsequent Story 9/10 applies will see similar waits if the pool scales to zero between stories.

  • Strimzi chart serves tarballs from GitHub releases, not from the repo host. https://strimzi.io/charts/index.yaml is the index; the tarball URL inside it points at github.com/strimzi/strimzi-kafka-operator/releases/download/.... Helm resolves this transparently; caller only passes repository.

# Verification (post-apply)

  • kubectl get ns strimzi-system --show-labelsActive, labels env=dev, layer=k8s-base, owner=platform-team, cost_center=data-platform, service=kafka.

  • kubectl -n strimzi-system get pods -o widestrimzi-cluster-operator-8686cb4f64-w8tnv 1/1 Running on gke-ume-data-dev-gke-workload-pool-f8275362-gdjc.

  • kubectl get node gke-ume-data-dev-gke-workload-pool-... -o jsonpath='{.metadata.labels.pool}'workload.

  • kubectl -n strimzi-system get deploy strimzi-cluster-operator -o jsonpath='{.spec.template.spec.nodeSelector}'{"pool":"workload"}.

  • kubectl get crd -o name | grep strimzi.io → 10 CRDs including kafkas, kafkanodepools, kafkatopics, kafkausers, strimzipodsets, kafkaconnects, kafkarebalances, kafkamirrormaker2s, kafkabridges, kafkaconnectors.

  • ✓ Operator logs: Starting ClusterOperator for namespace *, followed by Opened watch for Kafka/KafkaConnect/KafkaBridge/ KafkaMirrorMaker2/KafkaRebalance/KafkaNodePool operator.

# Then

Story 9 adds modules/strimzi-kafka/ and declares the Kafka cluster CR (KRaft, 3 controllers + 2 brokers) in dev-03-runtime, plus the PV-utilisation alert in dev-02-k8s-base/alerts.tf.


# Story 9 — Kafka Cluster (KRaft, 3 Controllers + 2 Brokers)

Status: done Date: 2026-04-18 PRs: #61 (merge commit 8c69310) + follow-up #62 (merge commit 36584ab) Plan doc: plans/story-09-kafka-cluster.md

Declared the event bus for DataHub: a KRaft Kafka cluster with 3 dedicated controllers + 2 brokers in a new kafka namespace, managed by the Strimzi operator that landed in Story 8. Also added the broker-PVC utilisation alert the master plan §5 called for.

# What changed

  • modules/strimzi-kafka/ (new) — environment-scoped module wrapping kubernetes_namespace_v1.kafka + 2 x kubernetes_manifest (KafkaNodePool, roles controller and broker) + 1 x kubernetes_manifest (Kafka CR).

    • Kafka 4.2.0 (Strimzi 0.51.0 default, verified against upstreamkafka-versions.yaml), metadata version 4.2.
    • Controllers: 3 replicas, 100m/256Mi requests, 1 GiBstandard-rwo PVs. Brokers: 2 replicas, 500m/1.5Gi,10 GiB premium-rwo.
    • Cluster config: default.replication.factor=2,min.insync.replicas=1, log.retention.hours=72,log.retention.bytes=8589934592 (8 GiB cap),log.segment.bytes=536870912, auto.create.topics.enable=false.
    • Internal plaintext listener on 9092; no TLS / SASL (DataHub isthe only consumer, same cluster).
    • entityOperator.topicOperator on (no userOperator) — smalloverhead, future-optional topic-as-code.
    • Soft anti-affinity on kubernetes.io/hostname for both pools.
    • Variables cover every knob the story spec called for pluscontroller_storage_class (defaults to standard-rwo) andbroker_storage_class (premium-rwo).
  • modules/strimzi-kafka/README.md (new) — topology table, inputs, upgrade notes for Kafka version bumps + PVC expansion + broker scale-out, links to the pinned kafka-versions.yaml.

  • environments/dev-03-runtime/kafka.tf (new) — one-line module call, cluster_name = ume-data-dev-kafka, namespace kafka.

  • environments/dev-02-k8s-base/alerts.tf — appended google_monitoring_alert_policy.kafka_broker_pv. Metric kubernetes.io/pod/volume/utilization filtered by namespace_name=kafka, threshold 0.70 for 10 minutes. Notification channels empty — wired in Story 13.

# Key decisions

  • Dedicated controllers + KRaft. Three controllers give an odd-quorum that tolerates one outage; combined-role with 2 brokers would have left the cluster unable to elect a leader on a single controller pod failure. Added ~0.8 GiB RAM total for a major availability win (master plan §4).

  • Module from day one. Invariant #8 plus precedent from the Airflow-module correction: env-scoped platform components get a module even with a single current caller, because prod-03-runtime will replicate the call unchanged.

  • PD-SSD for brokers, pd-balanced for controllers. Kafka is IOPS-sensitive on retention sweeps; KRaft controllers write tiny sequential metadata and do not need SSD. Cuts cost on the 3 controllers while keeping broker IO responsive.

  • RF=2 with min.insync.replicas=1. Hedged against the user's prior 1-replica CNPG incident: one broker can be offline during a rolling upgrade without losing write availability. Prod bumps to RF=3 with 3 brokers (backlog).

  • Internal plaintext listener. DataHub is the only consumer and runs in the same cluster. TLS/SASL adds cert rotation plumbing for no threat-model gain at this stage — deferred to Story 13.

  • deleteClaim = false on both node pools. Explicit guard against a terraform destroy wiping Kafka data. Same principle as prevent_destroy on stateful GCP resources.

  • auto.create.topics.enable=false. DataHub's kafka-setup Job creates its topics explicitly; auto-create masks config mistakes.

  • Alert threshold at 70%. log.retention.bytes=8 GiB caps broker PVC growth near 80%; firing at 70% gives time to bump PVC size before retention stops reclaiming.

  • Ship the initial Kafka CR even with no pods yet. Strimzi reports READY=True on first reconciliation after spec validation; pods land seconds later once the operator generates the StrimziPodSets. Pattern matches Story 8's 2m first-apply note on a cold workload-pool.

# Invariant #11 — bootstrap CI IAM

Walked before PR open. No delta required.

  • kubernetes_manifest resources targeting Strimzi CRDs use the same Helm/Kubernetes provider pathway as Story 4c (Gateway) and Story 4b (Airflow HTTPRoute): data.google_client_config token

    • remote-state GKE endpoint + kubernetes_manifest object-levelauthorization via k8s RBAC.
  • roles/container.viewer (plan SA) suffices for GET + CRD-schema discovery on plan. roles/container.admin (apply SA) covers CREATE/PATCH on the kafkas.kafka.strimzi.io + kafkanodepools.kafka.strimzi.io custom resources, confirmed empirically by the CI apply succeeding on first run.

  • google_monitoring_alert_policy is a GCP resource — covered by plan-SA roles/viewer + apply-SA roles/editor, same pathway as Story 6's Cloud SQL disk alert.

# Gotchas

  • Strimzi reserves every strimzi.io/* label on objects the operator creates. First apply looked clean (CR READY=True, 5 PVCs Pending as expected with WaitForFirstConsumer storage classes) but no pods ever scheduled. Operator log showed InvalidResourceException: User provided labels or annotations includes a Strimzi annotation: [strimzi.io/cluster] on every reconciliation. Cause: strimzi.io/cluster was present in the pod template metadata.labels via a shared local.node_pool_labels. Fix (PR #62): split the locals — pool_binding_labels (with strimzi.io/cluster) stays on KafkaNodePool top-level metadata only; pod templates get kafka_labels without any strimzi.io/* key. Cluster reconciled in under 2 minutes after the fix apply.

  • spec.kafka.replicas + spec.kafka.storage are no longer required. Initial Kafka CR carried replicas: 1 + storage: { type: ephemeral } stubs on the theory that the CRD schema still required them (older Strimzi docs say so). In 0.51.0 these fields emit DeprecatedFields warnings and are otherwise ignored when KafkaNodePools drive topology. Removed in the follow-up PR.

  • Fully rolling workload-pool minimum is at play. With min=1 the pool was warm from Story 8; broker pods landed immediately without waiting on autoscaler. The five Kafka pods +1 entity-operator all scheduled on the single workload-pool node. Budget still matches master plan §1 resource table.

  • validate job is the plan job (same as Story 8). Plan output in the validate (environments/...) logs confirmed Plan: 4 to add, 0 to change, 0 to destroy for dev-03-runtime and Plan: 1 to add, 0 to change, 0 to destroy for dev-02-k8s-base on PR #61. Follow-up #62 showed Plan: 0 to add, 3 to change, 0 to destroy on dev-03-runtime — the expected in-place update to the three CRs.

# Verification (post-apply)

  • kubectl get ns kafka --show-labelsActive with labels env=dev, layer=runtime, service=kafka, owner=platform-team, cost_center=data-platform.
  • kubectl -n kafka get kafkaume-data-dev-kafka READY=True (no warnings).
  • kubectl -n kafka get kafkanodepoolbrokers DESIRED=2 ROLES=[broker] NODEIDS=[0,1], controllers DESIRED=3 ROLES=[controller] NODEIDS=[2,3,4].
  • kubectl -n kafka get strimzipodsetume-data-dev-kafka-brokers 2/2, ume-data-dev-kafka-controllers 3/3.
  • kubectl -n kafka get pods → 3 controllers + 2 brokers + 1 entity-operator, all 1/1 Running, 0 restarts.
  • kubectl -n kafka get pvc → 5 PVCs Bound: data-ume-data-dev-kafka-brokers-0,1 (10Gi premium-rwo) and data-ume-data-dev-kafka-controllers-2,3,4 (1Gi standard-rwo).
  • kubectl -n kafka get pod ume-data-dev-kafka-brokers-0 -o jsonpath='{.spec.nodeName}'gke-ume-data-dev-gke-workload-pool-f8275362-gdjc — workload-pool placement confirmed.
  • kubectl -n kafka get svcume-data-dev-kafka-kafka-bootstrap ClusterIP on 9091/9092, plus the headless ume-data-dev-kafka-kafka-brokers service.
  • ✓ Alert policy landed via CI apply on environments/dev-02-k8s-base (Plan: 1 to add, 0 to change, 0 to destroy — the google_monitoring_alert_policy.kafka_broker_pv resource). No notification channels — wired in Story 13.

# Then

Story 10 provisions the OpenSearch operator in dev-02-k8s-base, the single-node OpenSearch cluster + snapshot CronJob in dev-03-runtime, and the snapshot bucket + ume-opensearch-snapshot GSA in dev-01-base.


# Story 10 — OpenSearch operator + cluster (snapshot scaffolding)

Status: done Date: 2026-04-18 PRs: #64 (operator + scaffolding) + #65 (webhook) + #66 (cluster CR) + #67 (API group) + #68 (bootstrap env) + #69 + #70 (force_conflicts) + #71 (gotchas doc) + #72 + #73 (self-bootstrap) + this one (status entry) Plan doc: plans/story-10-opensearch.md

Landed the metadata search + graph-index backend for DataHub: a single-node OpenSearch 2.19.5 cluster in a new opensearch namespace, managed by the opensearch-k8s-operator 2.8.4 Helm chart. Also shipped the snapshot bucket, ume-opensearch-snapshot GSA, bucket-scoped roles/storage.objectAdmin, and a Workload Identity binding to opensearch/opensearch-snapshot KSA — all as scaffolding for a future snapshot CronJob. The CronJob itself is out of scope this story (see below).

# What changed

  • modules/opensearch-operator/ (new) — environment-scoped module wrapping kubernetes_namespace_v1.operator + helm_release.opensearch_operator.
    • Chart opensearch-operator 2.8.4 from https://opensearch-project.github.io/opensearch-k8s-operator/.

    • webhook.enabled = false to skip the cert-manager-backed ValidatingWebhookConfiguration (we don't run cert-manager).

    • Operator pinned to the workload pool via manager.nodeSelector.

  • modules/opensearch-cluster/ (new) — wraps the opensearch namespace + the snapshot KSA (WI annotation to the GSA) + the OpenSearchCluster CR + the OpenSearchISMPolicy CR.
    • OpenSearch 2.19.5, 1 data node (cluster_manager + data + ingest), JVM heap 512m, 5Gi premium-rwo PVC, security plugin disabled (plugins.security.disabled = "true" in additionalConfig + env on both bootstrap and data pods).
    • Dashboards off (DataHub has its own UI).
    • field_manager.force_conflicts = true on both CRs because the operator owns spec.nodePools, spec.bootstrap.diskSize, and spec.states after create.
    • cluster.initial_master_nodes overridden via nodePools[0].env to ${cluster}-nodes-0 so the data node self-bootstraps as initial master (see Gotchas).
    • ISM policy ume-retention deletes indices older than 30 days.
  • environments/dev-01-base/buckets.tf (new) + environments/dev-01-base/iam.tf (append) + environments/dev-01-base/outputs.tf (append) — snapshot bucket ume-opensearch-snapshots-poc-ume-data (35d lifecycle delete, versioning off), ume-opensearch-snapshot GSA, bucket-scoped roles/storage.objectAdmin, WI binding to opensearch/opensearch-snapshot KSA, two new outputs.
  • environments/dev-02-k8s-base/opensearch.tf (new) — one-line module call for the operator.
  • environments/dev-02-k8s-base/alerts.tf (append) — google_monitoring_alert_policy.opensearch_pv: metric kubernetes.io/pod/volume/utilization filtered by namespace_name=opensearch, threshold 0.70 for 10 minutes.
  • environments/dev-03-runtime/opensearch.tf (new) — one-line module call consuming opensearch_snapshot_sa_email from remote state.
  • modules/opensearch-cluster/README.md documents the three non-obvious requirements learned during bring-up: force_conflicts, mirrored bootstrap env, opensearch.org API group.

# Key decisions

  • Snapshots deferred. repository-gcs does not support Workload Identity upstream — it requires an SA JSON key in the OpenSearch keystore, which conflicts with invariant #5 ("no service-account key files"). Ship the bucket + GSA + WI binding + KSA as scaffolding so a future story can wire a credential path (SM-CSI-mounted JSON, external-dump CronJob, …) without a new IAM change. The ISM policy handles local retention; indices are rebuildable from Kafka MAE replay / BigQuery lineage for the PoC-scale dataset.

  • Two-PR split for operator + cluster (mirrors Story 8 → 9). kubernetes_manifest validates against cluster-live CRD schemas at plan time; the operator's CRDs land only after its Helm release applies on main, so a single PR would fail plan-on-PR for dev-03-runtime.

  • opensearch.org API group, not opensearch.opster.io. The operator ships both; the opster.io group is deprecated and runs a migration-only controller that sits idle until an opensearch.org CR exists. Initial CRs were on opster.io; PR #67 switched them.

  • Security plugin off, no Dashboards. Dev-only; hardening in Story 13 (basic auth or mTLS). Dashboards saves a pod + GSA + route since DataHub has its own UI.

  • Module-from-day-one, two modules. opensearch-operator + opensearch-cluster parallel the strimzi split. Prod-02-k8s-base and prod-03-runtime will call them unchanged per invariant #8.

  • premium-rwo for data. 5Gi pd-ssd; OpenSearch indexing is IOPS-sensitive.

# Invariant #11 — bootstrap CI IAM

Walked before PR open. No delta required.

  • google_storage_bucket + google_storage_bucket_iam_member: plan covered by tfResourceIamReader.storage.buckets.getIamPolicy; apply by roles/editor. Precedent: Story 4's Airflow buckets.

  • google_service_account + google_service_account_iam_member (WI binding): plan-SA refresh works today on the airflow + airflow-kpo WI bindings so the pathway is proven; apply covered by the tfIamPolicyAdmin custom role on iam.serviceAccounts.setIamPolicy.

  • helm_release + kubernetes_manifest + kubernetes_namespace_v1 + kubernetes_service_account_v1: roles/container.viewer + roles/container.admin, same as Stories 4/7/8/9.

  • google_monitoring_alert_policy: covered by roles/viewer + roles/editor, same as the Kafka alert.

# Gotchas

  • Chart 2.8.x requires cert-manager by default. First apply failed with no matches for kind "Certificate" in version "cert-manager.io/v1" from the operator's ValidatingWebhookConfiguration. Fix (PR #65): webhook.enabled = false. Trade-off is loss of admission validation on OpenSearchCluster + OpenSearchISMPolicy — safe because Terraform is the only client mutating them.

  • opensearch.opster.io is deprecated and runs migration-only. Initial CRs applied under opster.io/v1; operator logs read "DEPRECATION WARNING: opensearch.opster.io API group is deprecated"

    • "Old cluster is not ready, skipping migration" and no primarycontroller ran. Fix (PR #67): switch apiVersion toopensearch.org/v1. The two groups share CRD schemas and kindnames exactly.
  • kubernetes_manifest schema-merge fails across apiVersion changes. Plan errored with "Failed to update proposed state from prior state" on the apiVersion change. Worked around by renaming the Terraform resource addresses (opensearch_clustercluster, opensearch_ism_retentionism_retention) to force a destroy+create rather than an in-place update.

  • Kind names are case-sensitive on ISM. OpenSearchISMPolicy (not OpensearchISMPolicy). kubectl get crd confirms via spec.names.kind — consult that before writing the manifest.

  • general.additionalConfig applies to ALL pods, but nodePools[].env does NOT cover the bootstrap pod. Without DISABLE_INSTALL_DEMO_CONFIG on the bootstrap pod, docker-entrypoint runs the demo-security setup, tripped over the disabled security plugin, and the pod died before registering as cluster-manager. Fix (PR #68): mirror the env onto spec.bootstrap.env and pin spec.bootstrap.nodeSelector = { pool = "workload" }.

  • Operator claims ownership of several subfields via SSA. Apply failed on field-manager conflicts for spec.nodePools, spec.bootstrap.diskSize (cluster CR, PR #69) and spec.states (ISM CR, PR #70). Added field_manager { force_conflicts = true } on both resources so Terraform re-asserts its declared shape without oscillating.

  • discovery.type: single-node is incompatible with the operator-injected cluster.initial_master_nodes env. OpenSearch explicitly errors out with setting [cluster.initial_master_nodes] is not allowed when [discovery.type] is set to [single-node]. Fix (PR #73): drop the single-node setting and override the env to ${cluster}-nodes-0 — duplicate env vars resolve last-write-wins, so the override beats the operator's default and the data pod self-bootstraps as initial master.

  • The operator kills the bootstrap pod prematurely on a single-node cluster. The operator creates bootstrap-0, waits only for the first StatefulSet replica to be Ready (tcp probe), then deletes bootstrap. The data pod then loops forever on cluster_manager_not_discovered_exception because its env still references bootstrap-0. The self-bootstrap env override above sidesteps the race entirely. With 3+ data nodes the operator's bootstrap flow works, but for a 1-node cluster you must self-bootstrap.

  • Recovery from wedged state requires CR deletion. Once the operator marks status.phase=RUNNING, initialized=true, it won't re-run bootstrap. Deleting the StatefulSet + PVC alone doesn't help; the operator recreates them with the same stale env. The only clean recovery is kubectl delete opensearchcluster.opensearch.org/<name>, let Terraform recreate on next apply.

  • ISM CR managedCluster reference is sticky. After the cluster CR is destroyed and recreated, the ISM CR's status.managedCluster still points at the old UID and the operator errors with "cannot change the cluster a resource refers to". Cleared by kubectl delete opensearchismpolicy.opensearch.org/<name>; next apply recreates the CR bound to the current cluster.

  • ConfigMap mount propagation has kubelet lag. Changing general.additionalConfig updates the ume-data-dev-opensearch-config ConfigMap, but the running pod's mounted opensearch.yml can be ~1 minute stale. If the change requires a new pod (env change, not just YAML), kubelet sync lag can mean the restarted pod reads stale YAML for its first boot.

# Verification (post-apply)

  • gcloud storage buckets describe gs://ume-opensearch-snapshots-poc-ume-data → location us-east1, 35d delete lifecycle, versioning off, 5 labels
    • service=opensearch.
  • gcloud iam service-accounts get-iam-policy ume-opensearch-snapshot@poc-ume-data.iam.gserviceaccount.comroles/iam.workloadIdentityUser member serviceAccount:poc-ume-data.svc.id.goog[opensearch/opensearch-snapshot].
  • gcloud storage buckets get-iam-policy gs://ume-opensearch-snapshots-poc-ume-dataume-opensearch-snapshot@… bound as roles/storage.objectAdmin.
  • kubectl get ns opensearch-operator opensearch --show-labels → both Active with the 6 mandatory labels.
  • kubectl -n opensearch-operator get pods -o wide → operator pod Running 1/1 on a pool=workload node.
  • kubectl get crd | grep opensearch → 20+ CRDs including opensearchclusters.opensearch.org and opensearchismpolicies.opensearch.org.
  • kubectl -n opensearch get opensearchcluster.opensearch.orgHEALTH=green, NODES=1, VERSION=2.19.5, PHASE=RUNNING.
  • kubectl -n opensearch exec ume-data-dev-opensearch-nodes-0 -c opensearch -- curl -s localhost:9200/_cluster/healthstatus=green, number_of_nodes=1, discovered_cluster_manager=true, active_primary_shards=3, active_shards_percent_as_number=100.0.
  • kubectl -n opensearch get pods -o wide → 1 data pod Running 1/1 on pool=workload.
  • kubectl -n opensearch get pvc → 1 PVC bound, 5Gi, premium-rwo.
  • kubectl -n opensearch get sa opensearch-snapshot -o yaml → annotation iam.gke.io/gcp-service-account = ume-opensearch-snapshot@….
  • kubectl -n opensearch get opensearchismpolicy.opensearch.orgume-retention present (recreated clean after UID drift cleanup).
  • gcloud monitoring policies list --project=poc-ume-dataOpenSearch PV > 70% — opensearch namespace enabled=True.

# Then

Story 11 lands DataHub via modules/datahub-helm/ in dev-03-runtime, reading bootstrap_servers from modules/strimzi-kafka and the OpenSearch service_host from modules/opensearch-cluster, plus the Cloud SQL password from Story 6's Secret Manager CSI mount. Story 10's snapshot scaffolding stays inert until a dedicated follow-up story wires a credential path for repository-gcs or swaps to an external-dump CronJob.


# Story 11 — DataHub Dry-Run (no IAP)

Status: done Date: 2026-04-19 PRs: #76 (preflight — narrow OpenSearch ISM to DataHub time-series indices) + #77 (main — modules/datahub-helm/, dev-03-runtime/datahub.tf, OpenSearch 3-node migration) + #78 (fix — ZooKeeper placeholder for the chart's kafka-setup template) + #79 (fix — pin elasticsearch-setup image tag v1.4.0.3; chart default v1.5.0.1 unpublished) + #80 (fix — implementation: "opensearch" + USE_AWS_ELASTICSEARCH=true so DataHub targets ISM not ILM) + #83 (fix — ume-datahub GSA + WI binding in dev-01-base) + #84 (fix — annotate datahub/datahub KSA with the GSA) + #85 (fix — mounter Job must run as datahub KSA, not default) + #86 (fix — global.sql.datasource.host must be host:port for the chart's tcp wait) + #87 (fix — point every DataHub subchart at the WI-annotated KSA)

  • this one (status entry).

Plan doc: plans/story-11-datahub-dryrun.md

Landed DataHub v1.5.0 on the existing cluster via chart 0.9.10 (datahub from helm.datahubproject.io). GMS + frontend serving over kubectl port-forward; the stock JAAS datahub/datahub login gates the UI (no IAP, no OIDC yet — Story 12). System-update + setup jobs completed cleanly; Postgres, Kafka (KRaft), and OpenSearch 2.19.5 are all wired. Also bundled the long-overdue OpenSearch 1 → 3 node migration so the self-bootstrap env hack is gone before steady-state ingestion.

# What changed

  • modules/datahub-helm/ (new) — wraps the upstream datahub chart.

    • kubernetes_namespace_v1.datahub + kubernetes_service_account_v1.datahub (annotated with iam.gke.io/gcp-service-account = ume-datahub@… so the Secrets Store CSI driver's datahub-db-password fetch resolves under Workload Identity).

    • kubernetes_manifest.datahub_db_secret_provider_class — CSI SecretProviderClass with syncSecret enabled so the driver materialises a k8s Secret named datahub-db-password the first time a pod mounts it.

    • kubernetes_job_v1.db_secret_mounter — tiny one-shot Job that mounts the SPC as the datahub KSA. wait_for_completion = true makes Terraform hold helm_release.datahub until the k8s Secret exists — the chart's datahub-system-update pre-install hook reads the password via secretKeyRef and wedges in CreateContainerConfigError otherwise.

    • helm_release.datahub — pinned to chart 0.9.10 (appVersion v1.5.0, verified against helm.datahubproject.io/index.yaml on 2026-04-19). Key value overrides baked into the module:

      • global.sql.datasource.host = "<ip>:5432" (chart convention — see quickstart values; the upgrade image parses it as a tcp target), hostForMysqlClient host-only, port separate, url a full JDBC URL, driver = org.postgresql.Driver, password.secretRef + secretKey pointing at the CSI-synced Secret.

      • global.kafka.bootstrap.server = <strimzi bootstrap> plus global.kafka.zookeeper.server = <same placeholder> — the chart's kafka-setup-job.yml template unconditionally dereferences zookeeper.server at render time even in KRaft mode.

      • global.elasticsearch: host, port: 9200, useSSL: false, skipcheck: true, implementation: "opensearch" (GMS + consumer side).

      • elasticsearchSetupJob: enabled: true, image.tag: "v1.4.0.3" (chart default v1.5.0.1 was never pushed to acryldata/datahub-elasticsearch-setup), extraEnvs: USE_AWS_ELASTICSEARCH=true so the setup targets ISM not ILM.

      • kafkaSetupJob: enabled: true (chart default tag v1.2.0.1 is fine).

      • Every subchart (datahub-gms, datahub-frontend, datahub-mae-consumer, datahub-mce-consumer): replicaCount: 1, nodeSelector: { pool = "workload" }, and serviceAccount: { create: false, name: datahub } — each subchart's SA default is create: true with no WI annotation, which broke the CSI mount on GMS.

      • datahub-gms additionally mounts the SPC via extraVolumes + extraVolumeMounts so the mount triggers syncSecret on the first GMS start too (redundant with the mounter Job but harmless).

      • datahub-ingestion-cron: enabled: false and acryl-datahub-actions: enabled: false — Story 12/13.

  • environments/dev-03-runtime/datahub.tf (new) — single module "datahub" call wiring SQL + Kafka + OpenSearch from remote state.

  • modules/opensearch-cluster/ 1 → 3 node migrationdata_replicas default 1 → 3, drop the duplicate-env self-bootstrap override on spec.nodePools[0].env. README + 10-operations.md "Current shape (dev)" block updated to match.

  • environments/dev-01-base/iam.tf — new ume-datahub GSA, project-scoped roles/secretmanager.secretAccessor, WI binding from datahub/datahub KSA. Output datahub_sa_email exported for dev-03-runtime.

  • modules/opensearch-cluster/main.tf preflight (PR #76)ume-retention ISM indexPatterns narrowed from ["*"] to ["datahub_usage_event*", "*_timeseries_v1*"]. Backlog "URGENT" entry retired in the same PR, matching known-issue bullet removed from 10-operations.md.

  • backlog.md — 3-node migration + URGENT ISM entries retired; new entry added for scoping the DataHub GSA's secretmanager.secretAccessor binding to the specific secret (currently project-wide for parity with Airflow).

# Key decisions

  • Preflight PR for the ISM scope fix. Narrowing indexPatterns was a 3-line change that blocked Story 11 ingestion safety. Shipped separately so the DataHub PR review surface stayed focused and the ISM-breaking risk was off the DataHub critical path.

  • OpenSearch 3-node migration bundled. The single-node cluster.initial_master_nodes env hack was fragile and diverged from the operator's happy path. Landing the migration inside Story 11 meant DataHub met a prod-shaped cluster on day one; the rolling restart completed cleanly without hitting the cluster_manager_not_discovered wedge.

  • Module-first. modules/datahub-helm/ ships every knob as a variable (invariant #9); dev-03-runtime/datahub.tf is a single module "datahub" block. Prod replication is the justification, not future callers (invariant #8).

  • Chart-native password.secretRef + secretKey instead of the spec's extraEnvs form — propagates through all subcharts via the chart's datasource stanza and keeps the password out of Helm values rendering.

  • Mounter-Job pattern for CSI sync. DataHub's datahub-system-update pre-install hook reads the SQL password via secretKeyRef — the k8s Secret has to exist BEFORE Helm starts the install. A busybox kubernetes_job_v1 mounts the SPC, verifies the file, exits; Terraform waits for completion, then helm_release fires.

  • JAAS login for the dry-run. oidcAuthentication.enabled: false (chart default). Reachable only via port-forward; OIDC + IAP land in Story 12.

# Invariant #11 — bootstrap CI IAM

Walked before each PR opened. One gap found and closed:

  • google_service_account + google_service_account_iam_member (WI binding) — already covered (Airflow precedent). Plan-SA via project-level refresh; apply-SA via the tfIamPolicyAdmin custom role.

  • google_project_iam_member (roles/secretmanager.secretAccessor) — covered by the existing roles/editor grant on apply-SA (secretmanager admin subset) and roles/viewer on plan-SA. No delta.

  • kubernetes_namespace_v1, kubernetes_service_account_v1, kubernetes_manifest (SecretProviderClass), kubernetes_job_v1, helm_release — covered by roles/container.viewer + tfK8sSecretsReader on plan-SA; roles/container.admin on apply-SA. Same coverage that Stories 4 / 7 / 8 / 9 / 10 use.

No layers/00-bootstrap/ changes required.

# Gotchas (the long list)

Every one of these cost at least one PR to uncover:

  • Chart template requires global.kafka.zookeeper.server even when the target Kafka is KRaft. The kafka-setup-job.yml template dereferences it at render time; an omitted field throws nil pointer evaluating interface {}.server. Setting it to the bootstrap address is a safe placeholder — the setup image routes everything through the Kafka Admin API.

  • acryldata/datahub-elasticsearch-setup has no v1.5.x semver tag. Chart 0.9.10 defaults elasticsearchSetupJob.image.tag to global.datahub.version (v1.5.0.1), which was never pushed for this image. Latest published semver is v1.4.0.3. The chart's own kafkaSetupJob.image.tag is hard-pinned to v1.2.0.1 for the same reason. Fix: explicit elasticsearchSetupJob.image.tag: "v1.4.0.3" override.

  • DataHub speaks Elasticsearch ILM by default, not OpenSearch ISM. The setup job hits GET _ilm/policy/datahub_usage_event_policy which 400s on OpenSearch. Two switches needed: global.elasticsearch.implementation: "opensearch" (GMS + consumers) and USE_AWS_ELASTICSEARCH=true on the setup job (extraEnvs). The AWS in the name is misleading — it's the chart-wide OpenSearch toggle.

  • The Secrets Store CSI driver needs Workload Identity on the mounting pod's KSA, not on some shared driver identity. Without a GSA binding on the datahub/datahub KSA the driver falls back to the node GSA and hits secretmanager.versions.access denied. Module now requires gsa_email.

  • Pod service_account_name has to be set explicitly on the mounter Job. kubernetes_job_v1 without service_account_name runs pods as the namespace default KSA — the WI annotation on datahub KSA never applies and the CSI driver hits the same 403. Had to explicitly set spec.template.spec.service_account_name in the Job.

  • Each DataHub subchart creates its own serviceAccount by default. Setting the chart-top-level serviceAccount.name only affects the chart's own templates (setup jobs, mounter-free paths). The datahub-gms, datahub-frontend, datahub-mae-consumer, and datahub-mce-consumer subcharts each default to create: true with no annotation. Fix: override serviceAccount inside each subchart's block.

  • global.sql.datasource.host must be host:port, not just host. The upgrade / system-update image uses the value directly as a tcp target (go-dockerize style); just an IP gives dial tcp: address 10.64.0.3: missing port in address and the pre-install hook hangs. The chart's quickstart values show the convention (host: "prerequisites-mysql:3306").

  • Cancelling a Helm install mid-flight leaves the release pending-install forever. Subsequent applies fail with another operation (install/upgrade/rollback) is in progress. Recovery: helm uninstall datahub -n datahub (leaves Terraform-owned resources alone — namespace, KSA, SPC, mounter Job, CSI-synced Secret); the next apply starts clean. Happened three times this session.

  • Terraform cancellation inside a Helm wait can orphan a GCS state lock. gsutil rm gs://ume-tf-state-poc-ume-data/environments/dev-03-runtime/default.tflock is the recovery; the state file itself is untouched as long as the Helm wait was the only operation in flight.

  • Cross-stack remote_state outputs block plan. Splitting PR #83 (dev-01-base output) from #84 (dev-03-runtime consumer) was mandatory — terraform-plan.yml reads remote state from GCS, which only contains outputs after the producer stack applies. Same pattern as Story 10's two-PR split for operator vs. cluster.

# Verification (post-apply)

  • gh run view <apply>conclusion=success.
  • helm -n datahub listdatahub revision 1 deployed.
  • kubectl -n datahub get pods: datahub-datahub-frontend and datahub-datahub-gms Running 1/1 on pool=workload; datahub-db-password-mounter, datahub-elasticsearch-setup-job, datahub-kafka-setup-job, datahub-system-update, datahub-system-update-nonblk all Completed.
  • kubectl -n datahub logs deploy/datahub-datahub-gmsReady: tcp://10.64.0.3:5432 (SQL) and Ready: tcp://ume-data-dev-kafka-kafka-bootstrap.kafka.svc:9092 (Kafka) early in the startup sequence; no ILM / CSI / secretKeyRef errors.
  • kubectl -n opensearch get opensearchcluster.opensearch.orgHEALTH=green, NODES=3, VERSION=2.19.5 after the 1 → 3 migration.
  • kubectl -n opensearch get pods -o wide → 3 data pods Running 1/1 on pool=workload; 3 PVCs bound 5 Gi premium-rwo.
  • kubectl -n datahub get sa datahub -o jsonpath='{.metadata.annotations}'{"iam.gke.io/gcp-service-account":"ume-datahub@poc-ume-data.iam.gserviceaccount.com"}.
  • gcloud iam service-accounts get-iam-policy ume-datahub@…roles/iam.workloadIdentityUser member serviceAccount:poc-ume-data.svc.id.goog[datahub/datahub].
  • kubectl -n opensearch get opensearchismpolicy.opensearch.org ume-retention -o jsonpath='{.spec.ismTemplate.indexPatterns}' → narrowed list; status.state=CREATED.
  • Port-forward + browser verification is operator sign-off — not scripted here.

# Then

Story 12 wires IAP + HTTPRoute on the shared Gateway + DataHub OIDC against Google + the groups-and-policies bootstrap, replacing the datahub/datahub JAAS fallback. Story 13 hardens cost + ops (label audit, budget alerts, PDB verification, runbook drill).


# Story 12 — DataHub IAP + HTTPRoute + OIDC Auth

Status: done Date: 2026-04-22 PRs: #89 (main — modules/datahub-helm adds HTTPRoute + OIDC surface, dev-03-runtime wires datahub_iap, imports the OIDC secret container) + #90 (fix — real chart service name is datahub-datahub-frontend; flip frontend + GMS Services to ClusterIP so IAP is the only way in) + this one (status entry).

DataHub is now reachable at https://datahub.umedev.marpont.es behind IAP (perimeter) and DataHub's own Google OIDC (in-app identity). JAAS stays on so the built-in datahub user can still bootstrap the first Admin over port-forward — the proper groups / policies / admin-promotion bootstrap lives in Story 13. Frontend and GMS Services flipped from LoadBalancer to ClusterIP in the same round, so the two public IPs the chart provisioned by default are gone.

# What changed

  • modules/datahub-helm/:

    • httproute.tf (new) — optional HTTPRoute attached to the shared Gateway, targeting the real chart-generated Service. Mirrors modules/airflow-helm/httproute.tf in shape.

    • oidc.tf (new) — SecretProviderClass for the OIDC client secret + a kubernetes_job_v1 mounter that forces the CSI driver to materialise the backing k8s Secret before datahub-frontend starts. Same pre-install-hook-avoidance pattern as the DB password mounter from Story 11.

    • main.tflocal.frontend_service_name = "${var.release_name}-datahub-frontend" so both the HTTPRoute backend and the IAP target reference the chart's actual Service name (see gotcha below). Subchart overrides add service = { type = "ClusterIP" } on datahub-frontend and datahub-gms; chart default is LoadBalancer, which gives each pod a public IP that sits outside IAP. extraEnvs, extraVolumes, and extraVolumeMounts on datahub-frontend are populated unconditionally (see gotcha on the tuple-length flag drop).

    • variables.tfhttproute_enabled + gateway_* + hostname (mirrors the Airflow module). OIDC vars (oidc_client_id, oidc_client_secret_secret_id, oidc_base_url, oidc_discovery_uri default to Google, oidc_user_name_claim=email, oidc_scopes=openid profile email, oidc_extract_groups_enabled=false for Phase 1). Client ID is not sensitive — shipped as a plain env var; client secret flows through CSI.

    • outputs.tffrontend_service_name now returns the derived <release>-datahub-frontend instead of the lie from Story 11.

  • environments/dev-03-runtime/:

    • datahub.tfmodule "datahub" gets httproute_enabled = true

      • the gateway refs from dev-02-k8s-base remote state + the OIDCinputs. A google_secret_manager_secret.datahub_oidc_client_secretresource + import block adopts the human-created Secret Managercontainer on first apply (labels + replication reconcile, versionstays out-of-band forever). oidc_client_secret_secret_id readsfrom this resource so the module call tracks the tf-managed name.
    • iap.tfmodule "datahub_iap" mirrors airflow_iap verbatim (same modules/iap-oauth/, same brand, same allow-list). Target service reads from module.datahub.frontend_service_name, not a hardcoded string.

    • variables.tf + terraform.tfvarsdatahub_subdomain, datahub_oidc_client_id, datahub_oidc_client_secret_secret_id.

    • outputs.tfdatahub_url, datahub_namespace, datahub_iap_client_id (parity with the airflow outputs).

# Key decisions

  • IAP at the perimeter, DataHub OIDC inside. IAP alone collapses to all-admin-or-all-reader; DataHub's role + policy + ownership layer does per-user and per-dataset work. Two OAuth clients on the same brand — IAP client managed by modules/iap-oauth/, DataHub OIDC client created manually in the Console (Story 12 spec).

  • Reused modules/iap-oauth/ verbatim. Same brand, same IAM grants, same GCPBackendPolicy shape — no new bootstrap work. iap_allowed_users stays shared with Airflow until there's a reason to diverge.

  • Terraform adopts the Secret Manager container, not the version. import block brings the human-created container under tf management so labels + replication drift is caught. The value is never read — feedback_never_fetch_secrets.md stands.

  • Client ID as a plain env var, not through CSI. OAuth 2.0 client IDs are public (visible in every redirect URL); no reason to make them mount-time dependencies.

  • OIDC always on in the module; no oidc_enabled feature flag. First shipped with a var.oidc_enabled ? [...] : [] conditional; terraform rejected it because the two branches are tuples of different lengths and can't be unified. Dropped the flag since every caller enables OIDC anyway — re-introduce as list-typed locals if a no-auth dry-run path is ever needed again.

  • Both frontend and GMS flipped to ClusterIP. Discovered mid-apply that the chart provisions type: LoadBalancer for both services, which allocates public IPs that completely bypass IAP. Closed the hole in PR #90 — only ingress path is now Gateway API → HTTPRoute → IAP → ClusterIP Service.

  • JAAS left enabled for Story 12. The built-in datahub local user still works via port-forward — it's the only way to bootstrap the first Admin in a fresh install. Story 13 replaces this with a policies-as-code Admin grant and can then turn JAAS off.

# Invariant #11 — bootstrap CI IAM

No layers/00-bootstrap/ changes required. DataHub IAP reuses the same resource types as the Airflow precedent:

  • google_iap_client + google_project_iam_member (iap.httpsResourceAccessor) — covered by tf_apply_iap_admin (roles/iap.admin) on apply-SA and tfIapReader on plan-SA (now reads two IAP clients instead of one; same permissions cover both).

  • kubernetes_secret_v1 (IAP OAuth secret), kubernetes_manifest (GCPBackendPolicy, HTTPRoute, SecretProviderClass), kubernetes_job_v1 (OIDC mounter) — covered by roles/container.admin on apply-SA and tfK8sSecretsReader on plan-SA.

  • google_secret_manager_secret (OIDC client secret container, via import block) — roles/editor on apply-SA covers secret create/update, roles/secretmanager.secretAccessor on plan-SA covers refresh.

# Gotchas

  • DataHub chart prefixes Service names with the release name. The real Service is datahub-datahub-frontend, not datahub-frontend. First apply landed HTTPRoute + GCPBackendPolicy pointing at the short name; kubectl describe gcpbackendpolicy showed TargetNotFound, the L7 LB returned 404, and the frontend never saw a request. Airflow happens to name its Service airflow-api-server (no release prefix) so the same pattern worked there blind. Fix in PR #90: derive the name from var.release_name inside the module instead of hardcoding.

  • type: LoadBalancer is the chart's default for frontend and GMS. Those two public IPs were provisioned from Story 11 onwards and sit outside the IAP perimeter entirely. Fixed in PR #90 by injecting service = { type = "ClusterIP" } on both subcharts. gcloud compute forwarding-rules list is worth running after any chart-based service landing in this repo.

  • Terraform tuples can't be unified across different lengths. A var.oidc_enabled ? [... 8 envs ...] : [] ternary hit Inconsistent conditional result types: tuple length 8 vs 0 in validate. tolist() needs homogeneous element types; our envs mix value and valueFrom fields. Cleanest fix was to drop the feature flag — every caller turns OIDC on anyway.

  • fault filter abort / 500 for ~2 minutes after the IAP policy replacement. Renaming the IAP OAuth client (display name embeds the service name) forces a google_iap_client replacement, which also re-creates the k8s Secret and GCPBackendPolicy. While the L7 LB's Envoy config reshuffles, unauthenticated requests get 500 with body fault filter abort instead of the expected 302 to accounts.google.com. Cleared on its own around T+2 min; no config knob to twiddle — just wait.

  • depends_on = [helm_release.datahub] on the HTTPRoute means ServiceName drift hides behind the longest Helm step. When PR #90's apply ran, the GCPBackendPolicy and IAP client recreated immediately (new name) while the HTTPRoute waited for helm to finish upgrading — so for ~7 minutes the policy reported GatewayNotFound against a still-old HTTPRoute backendRef. Not a bug, just a noisy log window. Checking kubectl -n datahub get httproute datahub -o jsonpath='{.spec.rules[0].backendRefs[0].name}' pins down whether the flip has happened yet.

  • Manual OIDC client creation is unavoidable today. google_iap_client only creates IAP-flavoured OAuth clients (no control over redirect URI). DataHub's own OIDC needs a "Web application" client with a redirect URI under the DataHub hostname, which is a Console-only step on a brand outside a Workspace org. Steps codified in the header of environments/dev-03-runtime/datahub.tf. Same shape as the IAP brand prerequisite in iap.tf.

# Verification (post-apply)

  • gh run view <apply>conclusion=success for both PRs.
  • helm list -n datahubdatahub revision 3 deployed after PR #90. Revision 2 rolled frontend with OIDC envs; revision 3 added the service-type flip.
  • kubectl -n datahub get httproute datahubAccepted=True, backend datahub-datahub-frontend.
  • kubectl -n datahub describe gcpbackendpolicy datahub-datahub-frontend-iapAttached=True.
  • gcloud compute backend-services list --format='table(name,iap.enabled)'iap.enabled=True on gkegw1-89a1-datahub-datahub-datahub-frontend-9002-*.
  • kubectl -n datahub get svc → both datahub-datahub-frontend and datahub-datahub-gms type=ClusterIP, no EXTERNAL-IP.
  • kubectl -n datahub get pod <frontend> -o jsonpath='{.spec.containers[0].env[?(@.valueFrom)]}'AUTH_OIDC_CLIENT_SECRET wired via secretKeyRef {name: datahub-oidc-secret, key: client_secret}.
  • curl -sI http://datahub.umedev.marpont.es/ → 301 to https.
  • curl -sI https://datahub.umedev.marpont.es/ → 302 to accounts.google.com with the DataHub IAP client_id in the consent URL.
  • Browser sign-in (operator sign-off) and non-allowlisted 403 — not scripted here.

# Then

Story 13 hardens cost + ops (label audit, budget alerts, PDB verification, maintenance window drill) and lands the groups / domains / policies-as-code bootstrap that replaces the manual Admin-promotion step — at which point the local datahub JAAS user can be retired.