#
Story status
Architecture pivot (2026-04-15): Airflow replaces Cloud Composer — deployed on GKE Standard via the official Helm chart with CeleryExecutor. Stories restructured into Phase 1 (Airflow on GKE, Stories 0-5) and Phase 2 (DataHub, Stories 6-13). Story 4 decomposed into 4a-4d (2026-04-16): scaffolding, Helm release, ingress + auth, custom image. GKE + Cloud SQL are Phase 1 infrastructure (reused in Phase 2). See Deployment Stories for the current sequence.
#
Story 0 — Repository scaffold
Status: done (minimal) Date: 2026-04-15
Implemented as a minimal smoke-test scaffold rather than the full pre-baked structure from the spec. Directories and files for layers, environments, and modules will be created as each story needs them.
#
What was created
.gitignore— standard Terraform ignores.pre-commit-config.yaml— terraform fmt, validate, tflint viaantonbabenko/pre-commit-terraformCLAUDE.md— agent instructions, invariants, links to docsREADME.md— repo overview and getting-started commands.github/workflows/terraform-plan.yml— runs on PR: fmt check, detect changed stacks,init -backend=false, validate.github/workflows/terraform-apply.yml— manual dispatch only until Story 1 wires up WIF.github/workflows/terraform-drift.yml— manual dispatch only until Story 1 wires up WIFscripts/detect-changed-stacks.sh— working script: git diff to stack paths, module-to-stack dependency resolutionlayers/00-bootstrap/— single valid stack (versions.tf, variables.tf, outputs.tf, main.tf, backend.hcl)
#
What was deferred (created per-story instead)
layers/10-platform-shared/— deferred to Phase 2 (Story 6)environments/dev-*,environments/prod-*— Stories 3+modules/*— created per story as environment-scoped resources are added (module strategy revised in Story 3d)
#
Verification
-
terraform fmt -check -recursivepasses -
terraform init -backend=false && terraform validatepasses onlayers/00-bootstrap - All 3 workflow YAML files are syntactically valid
-
scripts/detect-changed-stacks.shruns without error - PR merged to
main
#
Story 1 — Bootstrap
Status: done Date: 2026-04-15
#
What was created
layers/00-bootstrap/main.tf— full implementation: API enablement (7+4 APIs viafor_each), GCS state bucket (ume-tf-state-poc-ume-data), Artifact Registry (ume-composer-images), WIF pool + provider (ume-datainfra-github/ume-datainfra-github-provider), CI service accounts (ume-datainfra-tf-plan,ume-datainfra-tf-apply), IAM bindings (project-level, bucket-level, WIF-to-SA)layers/00-bootstrap/locals.tf— common labels (env=shared, layer=bootstrap, owner=platform-team, cost_center=data-platform)layers/00-bootstrap/terraform.tfvars— project_id and region (us-east1)layers/00-bootstrap/variables.tf— added github_org, github_repo, environment variableslayers/00-bootstrap/outputs.tf— wired all 4 required outputs + SA emails.github/workflows/terraform-plan.yml— WIF auth enabled, plan + PR comment steps active, SA = ume-datainfra-tf-plan.github/workflows/terraform-apply.yml— WIF auth enabled, triggered on push to main, SA = ume-datainfra-tf-apply.github/workflows/terraform-drift.yml— WIF auth enabled, cron schedule active, SA = ume-datainfra-tf-apply- Docs updated: us-central1 changed to us-east1 across 06-composer.md, 07-gke-platform.md, 10-operations.md, 11-deployment-stories.md
#
Key decisions
- Direct resources, no modules: AR and WIF are only used by bootstrap (one-off layer, never replicated across environments). Module extraction not needed.
- roles/editor for tf-apply-sa: PoC project, granular role list documented in main.tf for prod hardening. Inspired by frontera-infra pattern where broad roles are used initially, then tightened.
- WIF attribute condition: repo-only on the provider (
assertion.repository == "1edata/ume-data-infra"). Branch restriction is on the SA binding: tf-apply-sa only allowsrefs/heads/main, tf-plan-sa allows any branch. - SA naming:
ume-datainfra-tf-plan/ume-datainfra-tf-apply— repo-specific to avoid collision with other repos' CI SAs. - State bucket: versioning enabled, no lifecycle rules (files too small to matter on cost), uniform bucket-level access.
- Custom role for plan SA state access:
roles/storage.objectViewerwas insufficient — Terraform needs to create/delete.tflockfiles. Created a custom project role (tfStateLocker) withget,list,create,deletepermissions. This avoids grantingobjectAdminwhich would let the plan SA overwrite state files. - Region: us-east1 (changed from us-central1 across all docs).
- Workflow WIF_PROVIDER: set to
FILL_AFTER_BOOTSTRAP_APPLYplaceholder. After manual apply, operator grabs the value fromterraform output wif_provider_nameand updates all 3 workflow files.
#
Bootstrap procedure (for brand-new deployment)
cd layers/00-bootstrap
# 1. Comment out backend "gcs" {} in versions.tf
# 2. terraform init && terraform apply
# 3. Restore backend "gcs" {} in versions.tf
# 4. terraform init -backend-config=backend.hcl -migrate-state
# 5. terraform output (grab wif_provider_name, SA emails)
# 6. Update WIF_PROVIDER in all 3 workflow files
# 7. Push PR
#
What was deferred
- Workflow WIF_PROVIDER values: filled after manual apply (not a code task, operator step)
- Granular IAM roles for tf-apply-sa: documented target list, implement before prod
#
Verification
-
terraform fmt -check -recursivepasses -
terraform init -backend=false && terraform validatepasses - State bucket exists:
gsutil ls gs://ume-tf-state-poc-ume-data/ - Artifact Registry repo exists:
gcloud artifacts repositories list --project=poc-ume-data - WIF pool exists (ACTIVE):
gcloud iam workload-identity-pools list --location=global --project=poc-ume-data - Both SAs exist:
ume-datainfra-tf-plan@poc-ume-data.iamandume-datainfra-tf-apply@poc-ume-data.iam - CI plan acquires state lock successfully (custom role
tfStateLockerapplied)
#
Story 1-fix01 — Add missing APIs to bootstrap
Status: done Date: 2026-04-15
Added compute.googleapis.com and servicenetworking.googleapis.com to the required_apis set in layers/00-bootstrap/main.tf. These are required by Stories 3a/3b for VPC provisioning and Private Service Access (Cloud SQL private IP).
#
What changed
layers/00-bootstrap/main.tf— added 2 APIs to therequired_apislocal
#
Verification
-
terraform planshows 2 newgoogle_project_serviceresources -
terraform applysucceeds -
gcloud services list --project=poc-ume-data | grep -E 'compute|servicenetworking'
#
Story 2 — Platform Shared (Airflow-focused) → Doc Restructure
Status: done Date: 2026-04-15
#
What was created
No Terraform resources. This story became a documentation restructure after planning revealed that Airflow SAs are environment-scoped, not shared.
Files modified:
docs/infrastructure/11-deployment-stories.md— rewrote Story 2, updated Stories 3/4/6docs/infrastructure/04-terraform-structure.md— updated layers table, SA table, inter-stack contractsdocs/infrastructure/06-airflow.md— fixed SA/KSA names in Helm values and WI tabledocs/infrastructure/07-gke-platform.md— fixed WI bindings tabledocs/infrastructure/03-architecture.md— updated repo layout description
#
Key decisions
- SAs moved to
environments/dev-01-base/: Airflow service accounts are environment-scoped because their Workload Identity bindings reference a specific project's identity pool ({project}.svc.id.goog). In the multi-project future, each project gets its own SAs for its own cluster.layers/is reserved for resources shared across all environments and projects. layers/10-platform-shared/deferred to Phase 2: No cross-environment resources exist in Phase 1. Created in Story 6 when DataHub work begins.- SA naming:
ume-airflowandume-airflow-kpo(follows theume-{purpose}convention). All doc references updated. - KSA naming:
airflow(notairflow-scheduler). The Helm chart applies one KSA to all components (scheduler, worker, webserver, triggerer). A generic name is accurate. storage.objectAdminproject-wide for PoC: Scoping to specific buckets deferred to Story 4 as a hardening task (the log bucket doesn't exist until then).- Inter-stack contract simplified: runtime stack reads SA emails from
dev-01-baseremote state (one source instead of two). Originally nameddev-03-runtime, renamed todev-02-runtimein Story 4 decomposition.
#
What was deferred
layers/10-platform-shared/creation — Story 6- Bucket-scoped
storage.objectAdmin— Story 4 hardening note
#
Story 3a — Networking
Status: done Date: 2026-04-15
#
What was created
environments/dev-01-base/versions.tf— Terraform >= 1.5, google/google-beta ~> 5.0, GCS backendenvironments/dev-01-base/variables.tf— project_id, region, zone (us-east1-b), environment, state_bucketenvironments/dev-01-base/locals.tf— common labels (env, layer=base, owner=platform-team, cost_center=data-platform)environments/dev-01-base/terraform.tfvars— poc-ume-data, us-east1, us-east1-benvironments/dev-01-base/backend.hcl— state prefixenvironments/dev-01-baseenvironments/dev-01-base/data.tf—terraform_remote_statefor 00-bootstrapenvironments/dev-01-base/networking.tf— VPC (ume-data-dev-vpc), subnet (ume-data-dev-gke-nodeswith secondary ranges for pods/services), static IP (ume-data-dev-nat-ip), Cloud Router (ume-data-dev-router), Cloud NAT (ume-data-dev-nat)environments/dev-01-base/outputs.tf— vpc_id, vpc_self_link, subnet_self_link, pod/service range names, nat_ip_address- Updated
docs/infrastructure/04-terraform-structure.md— new naming conventionume-data-{env}-{purpose}, added Cloud Router/NAT to naming table - Updated naming references across
06-airflow.md,10-operations.md,11-deployment-stories.md
#
Key decisions
ume-data-{env}naming prefix: Changed fromume-{env}to avoid generic collisions in shared GCP projects. All environment-scoped resources use this prefix. Global resources (bootstrap) keep the shorterume-prefix.- Direct resources (modularized in Story 3d): Originally used native Terraform resources directly. Extracted into
modules/vpc/in Story 3d viamovedblocks. - Static NAT IP: Reserved a
google_compute_addressfor predictable egress IP, enabling allowlisting by external services. - ALL_SUBNETWORKS_ALL_IP_RANGES for NAT: No public subnets planned. Cloud NAT only affects VMs without external IPs, so this is safe unconditionally.
- Remote state in
data.tf: Separated from networking.tf because it is a stack-level concern shared by Stories 3b-3d. - Zone variable in scaffolding:
us-east1-bincluded in variables.tf now for Story 3d's zonal GKE cluster. - Subnet CIDRs: nodes
10.0.0.0/20, pods10.4.0.0/14, services10.8.0.0/20. Standard GKE VPC-native ranges, no overlaps.
#
What was deferred
- Nothing. Story 3a is self-contained.
#
Verification
-
terraform fmt -check -recursivepasses -
terraform init -backend-config=backend.hclsucceeds (GCS backend) -
terraform validatepasses -
terraform planshows 5 resources to create (VPC, subnet, static IP, router, NAT) - VPC and subnets exist
- Private Google Access enabled
- Cloud NAT configured
- Static IP reserved
#
Story 3b — Cloud SQL
Status: done Date: 2026-04-16
#
What was created
environments/dev-01-base/cloud-sql.tf— PSA peering (ume-data-dev-psa-range,10.64.0.0/20), PostgreSQL 16 instance (ume-data-dev-airflow-pg,db-g1-small),airflowdatabase, Secret Manager secret shell (ume-data-dev-cloudsql-admin-password)environments/dev-01-base/outputs.tf— addedsql_connection_name,sql_private_ip,sql_instance_namedocs/infrastructure/04-terraform-structure.md— updated layout to showcloud-sql.tfinstead ofpersistence.tfdocs/infrastructure/11-deployment-stories.md— updated Story 3b spec with concrete resource names, design decisions, and refined verification checklist
#
Key decisions
- PostgreSQL 16: Latest GA on Cloud SQL. Improved query performance over 15. Fully supported by Airflow.
- PSA range /20 at
10.64.0.0: Hardcoded for deterministic plans. /20 (not /24) avoids painful range expansion later — expanding PSA requires deleting/recreating the peering connection (downtime). Zero cost difference. - File name
cloud-sql.tf(notpersistence.tf): More specific, consistent withnetworking.tf/gke.tfnaming pattern. Structure doc updated. airflowdatabase created here: Story 4's Helm chart expectsmetadataConnection.db: airflow. Creating alongside the instance avoids a manual prerequisite.- No Terraform-managed admin user: Default
postgresuser is created automatically. Break-glass access usespostgres+ password from Secret Manager. disk_autoresize_limit = 50: Safety cap for PoC. Prevents runaway growth from unbounded auto-increase.deletion_protection = false: PoC instance. Must be set totruefor prod.- No labels on PSA range:
google_compute_global_addresswithpurpose = VPC_PEERINGrejects labels (GCP API limitation). Documented in code. - IAM auth flag only, bindings in 3c: The
cloudsql.iam_authentication = onflag is set on the instance. The actualgoogle_sql_user(IAM type) androles/cloudsql.clientbinding are deferred to Story 3c, which creates theume-airflowSA.
#
What was deferred
- IAM database user (
google_sql_user) androles/cloudsql.clientbinding — Story 3c (depends on SA creation) - PITR (point-in-time recovery) — prod hardening
- HA (regional availability) — prod
- Cloud SQL monitoring alerts — Story 4
modules/cloud-sql-postgres/— done (extracted in Story 3d viamovedblocks)
#
Verification
-
terraform fmt -check -recursivepasses -
terraform validatepasses - Cloud SQL running
- Private IP assigned, no public
- PSA range allocated
-
airflowdatabase exists - Secret shell exists
#
Story 3c — Airflow IAM
Status: done Date: 2026-04-16
#
What was created
environments/dev-01-base/iam.tf—ume-airflowSA (4 project-level roles),ume-airflow-kpoSA (2 project-level roles), Workload Identity bindings for both (airflow/airflow→ume-airflow,airflow-kpo/airflow-kpo→ume-airflow-kpo), Cloud SQL IAM database user forume-airflowenvironments/dev-01-base/outputs.tf— addedairflow_sa_email,airflow_kpo_sa_emaildocs/infrastructure/04-terraform-structure.md— addediam.tfto the dev-01-base layout treedocs/infrastructure/11-deployment-stories.md— updated Story 3c spec with design decisions and refined verification checklistlayers/00-bootstrap/main.tf— addedroles/servicenetworking.networksAdmintotf-apply-sa(Story 3b fixup, needed for PSA peering)
#
Key decisions
google_sql_useriniam.tf, notcloud-sql.tf: IAM concern (granting SA database auth). Keeps Story 3c PR self-contained. Cross-file reference togoogle_sql_database_instance.airflow.nameis a normal intra-stack reference.google_project_iam_member(additive): Same pattern as bootstrap. Authoritative (google_project_iam_binding) would revoke other members from shared roles likeroles/bigquery.dataEditor.for_eachover role sets: 6 role bindings from 2toset()locals. Keys are full role strings (e.g.,google_project_iam_member.airflow["roles/cloudsql.client"]). Adding/removing roles is a one-line change.trimsuffixfor SQL user name: GCP API expects the SA email without.gserviceaccount.com. Usingtrimsuffix(google_service_account.airflow.email, ".gserviceaccount.com")maintains the Terraform dependency graph (if SA name changes, this updates automatically).- No labels: None of the resource types (
google_service_account,google_project_iam_member,google_service_account_iam_member,google_sql_user) support GCP labels. Not a label-invariant violation. - WI bindings depend on GKE: GCP validates the Workload Identity pool (
{project}.svc.id.goog) exists — the pool is created when GKE enables Workload Identity. Addeddepends_on = [module.gke]in Story 3d. GCP does NOT validate KSA existence (Story 4 creates them via Helm). - Broad permissions flagged for later scoping:
roles/storage.objectAdminandroles/secretmanager.secretAccessorare project-wide for PoC. InlineTODO(narrow-scope)comments mark these for Story 4 / future hardening.
#
What was deferred
- Scoping
roles/storage.objectAdminto specific buckets — Story 4 (log bucket doesn't exist yet) - Scoping
roles/secretmanager.secretAccessorto specific secrets — future hardening when secret set stabilizes
#
Verification
-
terraform fmt -check -recursivepasses -
terraform validatepasses - SAs created:
gcloud iam service-accounts list --project=poc-ume-data | grep ume-airflow - Project IAM role bindings applied (6 bindings)
- WI bindings exist:
gcloud iam service-accounts get-iam-policy ume-airflow@poc-ume-data.iam.gserviceaccount.com - WI bindings exist:
gcloud iam service-accounts get-iam-policy ume-airflow-kpo@poc-ume-data.iam.gserviceaccount.com - Cloud SQL IAM user exists:
gcloud sql users list --instance=ume-data-dev-airflow-pg --project=poc-ume-data
#
Story 3d — GKE Cluster + Module Extraction
Status: done Date: 2026-04-16
#
What was created
Module extraction (applied first via moved blocks):
modules/vpc/— reusable VPC module (main.tf, variables.tf, outputs.tf, versions.tf). Encapsulates VPC, subnet with GKE secondary ranges, Cloud NAT, Cloud Router. Singlenetwork_cidr_base(/12) parameter derives all CIDRs viacidrsubnet().modules/cloud-sql-postgres/— reusable Cloud SQL module (main.tf, variables.tf, outputs.tf, versions.tf). Encapsulates PSA peering, Cloud SQL instance, database, admin password Secret Manager secret. All cost/topology settings exposed as variables.environments/dev-01-base/moved.tf— 10movedblocks migrating flat resources to module addresses. Removed after successful apply.environments/dev-01-base/networking.tf— replaced 5 flat resources withmodule.vpccallenvironments/dev-01-base/cloud-sql.tf— replaced 5 flat resources withmodule.airflow_sqlcallenvironments/dev-01-base/iam.tf— updatedgoogle_sql_userreference tomodule.airflow_sql.instance_nameenvironments/dev-01-base/outputs.tf— updated value expressions from flat refs to module outputs
GKE (applied second):
modules/gke-standard/— reusable GKE module (main.tf, variables.tf, outputs.tf, versions.tf). Encapsulates cluster + dynamic node pools + naming + labels + security defaults.environments/dev-01-base/gke.tf— module call forume-data-dev-gke(zonal, us-east1-b), node pools viavar.gke_node_poolsenvironments/dev-01-base/variables.tf— addedgke_node_pools(full type signature),gke_master_authorized_cidr_blocksenvironments/dev-01-base/terraform.tfvars— node pool definitions (default-pool,kpo-pool)environments/dev-01-base/outputs.tf— added GKE outputsenvironments/dev-01-base/iam.tf— addeddepends_on = [module.gke]to WI bindings
Bootstrap fixes:
layers/00-bootstrap/main.tf— added custom roletfIamPolicyAdminwith 4 permissions (iam.serviceAccounts.{get,set}IamPolicy,resourcemanager.projects.{get,set}IamPolicy). Narrower thanroles/iam.serviceAccountAdmin+roles/resourcemanager.projectIamAdmin.
Docs:
docs/infrastructure/04-terraform-structure.md— rewrote module strategy, updated module catalog (vpc + cloud-sql-postgres marked Created)docs/infrastructure/07-gke-platform.md— updated Calico to Dataplane V2, Terraform Configuration sectiondocs/infrastructure/agents/infra-terraform.md— updated invariantsCLAUDE.md— updated invariants: replaced "2+ callers" with forward-looking module strategy
#
Key decisions
- Module extraction before GKE: Networking and Cloud SQL resources were already applied as flat resources (Stories 3a-3c). Extracted into modules using Terraform
movedblocks — declarative state migration via CI, no manualterraform state mv. All 10 moves applied with zero resource recreation. - VPC module uses
cidrsubnet(): Singlenetwork_cidr_base(/12) parameter. Node subnet, pod range, and service range derived automatically. New environment = change one value for non-overlapping ranges. Pattern inspired by frontera-infra. - PSA inside Cloud SQL module, not VPC: PSA's sole purpose is Cloud SQL private networking. Keeping it in the Cloud SQL module means the module handles its own connectivity end-to-end.
- IAM stays flat: Policy layer, not infrastructure. Roles change per workload, not per environment. A module would just wrap
for_eachwith no encapsulation benefit. - Node pools as variables: Moved from inline in
gke.tftovar.gke_node_pools+terraform.tfvars. Prod can override machine types, counts, spot settings via tfvars alone. - WI bindings depend on GKE: GCP validates the Workload Identity pool (
{project}.svc.id.goog) exists. The pool is created by GKE when Workload Identity is enabled. Addeddepends_on = [module.gke]to both WI binding resources so they're created after the cluster. - Custom
tfIamPolicyAdminrole:roles/editoromits{get,set}IamPolicyon both projects and service accounts. Rather than granting broad predefined roles (roles/iam.serviceAccountAdmin,roles/resourcemanager.projectIamAdmin), created a custom role with exactly 4 permissions. CI can manage IAM bindings but can't create/delete SAs or escalate its own access. - Dataplane V2 over Calico: Irreversible. Cilium/eBPF over iptables. Built-in network policy enforcement.
- kpo-pool max=3: Tightened from 10 for dev. Limits cost from runaway DAGs.
deletion_protection = true: Deliberate two-step teardown.
#
What was deferred
workload-poolnode pool — Phase 2 (Story 7, DataHub)- Restricting authorized networks to specific CIDRs — when Cloudflare WARP or VPN is set up
- Regional cluster — prod
- Remove
moved.tf— follow-up cleanup commit
#
Verification
-
terraform fmt -check -recursivepasses -
terraform validatepasses - Module migration: 10 moved operations, 0 creates, 0 destroys for existing resources
- GKE cluster running:
gcloud container clusters list --project=poc-ume-data - Both pools listed:
gcloud container node-pools list --cluster=ume-data-dev-gke --zone=us-east1-b - Project IAM bindings applied (6 bindings)
- WI bindings applied (after GKE, via
depends_on) - Bootstrap custom role
tfIamPolicyAdminapplied
#
Story 4a — Runtime Stack Scaffolding + GCS Buckets
Status: done Date: 2026-04-17
#
What was created
New module — modules/gcs-bucket/:
main.tf—google_storage_bucketwith dynamic lifecycle rules, uniform bucket-level access, configurable versioning and force_destroyvariables.tf—name,project_id,location,storage_class,versioning,force_destroy,lifecycle_rules(list of objects: Delete/SetStorageClass actions with age, created_before, num_newer_versions, with_state conditions),labelsoutputs.tf—name,url,self_linkversions.tf— Terraform >= 1.5, google ~> 5.0
New stack — environments/dev-02-runtime/:
versions.tf— Terraform + google + google-beta + kubernetes + helm providers. K8s/Helm auth viagoogle_client_configaccess token + GKE endpoint/CA from remote state.variables.tf— Active:project_id,environment,region,zone,state_bucket. Commented out for later stories:airflow_image_repository,airflow_image_tag,domain_name,airflow_subdomain.outputs.tf—airflow_logs_bucket,airflow_dags_bucketlocals.tf— common labels (layer=runtime)data.tf—google_client_config+ remote state fordev-01-baseand00-bootstrapbackend.hcl— GCS backend atenvironments/dev-02-runtimeterraform.tfvars— dev valuesbuckets.tf— Two module calls:ume-airflow-logs-poc-ume-data(90-day delete lifecycle) andume-airflow-dags-poc-ume-data(versioning, no lifecycle)
Modified — modules/gke-standard/:
main.tf— Addedaddons_config { gcs_fuse_csi_driver_config }blockvariables.tf— Addedgcs_fuse_csi_enabledvariable (defaulttrue)
Modified — environments/dev-01-base/:
outputs.tf— Addedgke_cluster_name,gke_endpoint,gke_ca_cert(sensitive) outputs, mapping to module.gke outputsmoved.tf— Deleted (moves from Story 3d already applied, file was dead weight)
Modified — layers/00-bootstrap/:
main.tf— Addedroles/container.viewerIAM binding for plan SA
Docs:
docs/infrastructure/11-deployment-stories.md— Updated Story 4a spec with design decisionsdocs/infrastructure/04-terraform-structure.md— Updated gcs-bucket module status to Created
#
Key decisions
- Full lifecycle rule support:
lifecycle_rulesvariable accepts a list of objects with action type (Delete/SetStorageClass) and multiple conditions (age, created_before, num_newer_versions, with_state). Handles tiering rules from the start rather than refactoring later. force_destroyas variable (default false): Module invariant says expose all configurable settings. Dev can override for easy teardown.roles/container.vieweron plan SA:roles/viewerdoes not map to any k8s RBAC role. Plan SA needs k8s API read access forterraform planon kubernetes/helm resources (drift detection). Added to bootstrap as a new IAM binding.- Missing GKE outputs fixed:
dev-01-basewas not exportinggke_cluster_name,gke_endpoint,gke_ca_certdespite story-status for 3d claiming they were added. Fixed as a prerequisite. Output names match the inter-stack contracts table in 04-terraform-structure.md. - Provider auth pattern: kubernetes/helm providers use
data.google_client_config.default.access_token+ endpoint/CA from remote state. Nogcloud get-credentialsneeded. Providers initialize lazily, so Story 4a (no k8s resources) passes validate without cluster connectivity. - Two remote state sources in dev-02-runtime: Reads from both
dev-01-base(GKE, SQL, SA outputs) and00-bootstrap(AR URL, state bucket). Clear provenance over pass-through outputs. - Commented-out variables:
airflow_image_*anddomain_name/airflow_subdomainare defined but commented out. Each is wired when its story needs it. Avoids unused-variable noise in validate.
#
What was deferred
airflow_namespaceoutput — wired in Story 4b when the namespace is created by the Helm release- Bucket-scoped IAM for
ume-airflowSA — Story 4b hardening note (currently project-wideroles/storage.objectAdmin) - Lock file (
environments/dev-02-runtime/.terraform.lock.hcl) — generated locally, committed with the PR
#
Verification
-
terraform fmt -check -recursivepasses across all changed stacks -
terraform init -backend=false && terraform validatepasses on modules/gcs-bucket -
terraform init -backend=false && terraform validatepasses on environments/dev-01-base -
terraform init -backend=false && terraform validatepasses on environments/dev-02-runtime -
terraform init -backend=false && terraform validatepasses on layers/00-bootstrap - After CI apply: buckets exist
- After CI apply: GCS FUSE CSI enabled on cluster
- After CI apply: plan SA has
roles/container.viewer
#
Story 4b — Airflow Helm Release (Stock Image, Port-Forward)
Status: done Date: 2026-04-17
#
What was created
New module -- modules/airflow-helm/:
main.tf-- locals (Auth Proxy sidecar with--private-ip+--auto-iam-authn, native sidecar variant for Jobs, GCS FUSE config with resource overrides, shared service account reference),kubernetes_namespace_v1,kubernetes_service_account_v1(shared KSA with WI annotation), 6kubernetes_secret_v1resources (metadata connection, result backend, API secret key, JWT secret, admin password, SQL admin password), bootstrapkubernetes_job_v1(grants + migrate),helm_releasewith Airflow 3 values (CeleryExecutor, apiServer, dagProcessor, Auth Proxy sidecars, GCS FUSE DAG mount, remote GCS logging), standalone cleanup CronJobvariables.tf-- all configurable settings with sensible defaults: image (apache/airflow:3.2.0), chart version (1.20.0), per-component resources, GCS FUSE resource overrides, worker replicas, Airflow config overrides, cleanup schedule/retention, admin user, Cloud SQL Auth Proxy image, SQL admin password secret IDoutputs.tf--namespace,release_name,release_statusversions.tf-- Terraform >= 1.5, kubernetes ~> 2.35, helm ~> 2.17, random ~> 3.0
Modified -- modules/cloud-sql-postgres/:
main.tf-- automated postgres admin password:random_password,google_sql_userfor built-in postgres user,google_secret_manager_secret_versionto store passwordoutputs.tf-- addedadmin_password(sensitive)
Modified -- environments/dev-01-base/:
iam.tf-- addedroles/cloudsql.instanceUserto Airflow SA roles (required for IAM DB auth, separate fromroles/cloudsql.client)outputs.tf-- addedsql_admin_password(sensitive)terraform.tfvars-- default-poolmax_countraised from 2 to 3
Modified -- environments/dev-02-runtime/:
airflow.tf-- singlemodule "airflow"call passing remote state refs includingsql_admin_passwordversions.tf-- addedhashicorp/random ~> 3.0providervariables.tf-- uncommentedairflow_image_repository,airflow_image_tag; addedairflow_chart_versionterraform.tfvars-- Airflow 3 values (apache/airflow:3.2.0, chart1.20.0)outputs.tf-- addedairflow_namespaceoutput
#
Key decisions
- Airflow 3.2.0 / chart 1.20.0: Story was written for Airflow 2.10.3 / chart 1.15.0. Jumped to Airflow 3 (latest stable at deployment time), which forced the architectural changes below.
apiServerreplaceswebserver: Chart 1.20.0 uses semver gates --apiServertemplates render for Airflow >= 3.0.0,webservertemplates render for < 3.0.0. Thewebserverblock is kept only fordefaultUserconfig consumed bycreateUserJob.dagProcessoris mandatory in Airflow 3: Standalone component that parses DAG files. Previously handled by the scheduler in Airflow 2.- Shared KSA, not per-component: Chart 1.20.0 creates per-component KSAs (
airflow-scheduler,airflow-api-server, etc.) by default, none of which carry the Workload Identity annotation. A singlekubernetes_service_account_v1is created in Terraform with the WI annotation, and all components reference it withserviceAccount = { create = false, name = "airflow" }. The base layer's WI binding targets[airflow/airflow]. - Terraform bootstrap Job:
kubernetes_job_v1.db_bootstrapruns before the Helm release (depends_on). Steps: (1) Cloud SQL Auth Proxy native sidecar, (2)grantsinit container connects as postgres admin and GRANTs privileges to the IAM user, (3)migrateinit container runsairflow db migrate. Needed because the chart'smigrateDatabaseJobhook runs after the release resources and failed silently when privileges didn't exist. Chart migration job disabled (migrateDatabaseJob.enabled = false). - Cloud SQL Auth Proxy
--private-ip: The Cloud SQL instance has only a private IP (PSA networking). Without this flag, the proxy defaults to public IP and fails with "instance does not have IP of type PUBLIC". roles/cloudsql.instanceUser: Required for IAM database authentication.roles/cloudsql.clientonly allows the proxy to connect to the instance;instanceUserprovidescloudsql.instances.loginfor the actual IAM token-based DB login.- Automated postgres admin password:
cloud-sql-postgresmodule generates arandom_password, sets it on the built-in postgres user viagoogle_sql_user, stores it in Secret Manager. The runtime layer passes the Secret Manager secret ID to the airflow module. Bootstrap Job fetches the password at runtime via Workload Identity -- no credential stored in Kubernetes. - GCS FUSE resource overrides: The GKE webhook injects a sidecar requesting 250m CPU / 256Mi memory / 5Gi ephemeral per pod -- way more than a read-only DAG mount needs. Pod annotations override it to 10m / 64Mi / 256Mi, freeing ~960m CPU requests across 4 pods (the difference between fitting on 2 nodes and needing 3).
- Probe timeout tuning: Chart probes run
airflow jobs check, which imports the full Airflow framework every time. Takes >20s on e2-standard-2. Timeouts raised to 60s on scheduler, worker, triggerer, dag-processor. StartupfailureThresholdset to 20 on scheduler and api-server (200s total). - Scheduler CPU limit 1000m: At the 500m limit the scheduler was throttled during Python import -- zero log output for 4+ minutes. 1000m lets it burst through startup.
- Node pool max raised to 3: 7 Airflow pods (each with a Cloud SQL proxy sidecar, 4 with GCS FUSE sidecar) are tight on 2x e2-standard-2. Third node gives the autoscaler room during startup when all pods compete for CPU.
- Pre-built connection Secrets with URL encoding: The IAM DB user
ume-airflow@poc-ume-data.iamcontains@which breaks the Helm chart's URI template. Pre-builtkubernetes_secret_v1withurlencode()referenced viadata.metadataSecretName. waitForMigrationsdisabled on all Deployments: The chart placesextraInitContainersafter thewait-for-airflow-migrationsinit container, so a native sidecar proxy there wouldn't be running when the migration check executes. Disabled since the Terraform bootstrap Job already handles migrations.- Helm timeout 900s: Airflow 3 components are heavy Python apps. On e2-standard-2, the full stack takes 4-5 minutes to start. The default 600s was too tight when combined with the bootstrap Job.
#
What was deferred
- Bucket-scoped IAM for
ume-airflowSA (currently project-wideroles/storage.objectAdmin) - Hello-world DAG push and end-to-end verification
- Investigate chart 1.20.0's intended pattern for Cloud SQL IAM auth + private IP (see backlog) -- the Terraform bootstrap Job is a workaround
- GCS FUSE mount on api-server for the "Code" tab in the UI (not added since dag-processor handles parsing)
#
Verification
-
terraform fmt -check -recursivepasses -
terraform init -backend=false && terraform validatepasses on dev-02-runtime -
terraform planclean (no changes) on both base and runtime stacks - All Airflow pods running: api-server 2/2, scheduler 4/4, dag-processor 4/4, triggerer 4/4, worker 4/4, redis 1/1, statsd 1/1
- Auth Proxy sidecars running in each pod with successful DB connections
- Bootstrap Job completed: grants applied, migrations ran
- Airflow UI accessible via
kubectl port-forward svc/airflow-api-server 8080:8080 -n airflow - DAG sync via GCS FUSE works (hello-world DAG push pending)
- Logs appear in GCS bucket (pending first DAG run)
- Cleanup CronJob created (currently disabled,
var.cleanup_enabled = false)
#
Story 4c — Ingress + TLS + DNS + IAP (Gateway API, three layers)
Status: done Date: 2026-04-17 / 2026-04-18 Shipped as: PRs 1, 2, 3a, 3b.1, 3b.2, 3c, 3c-fix, 3c-fix2, 3c-fix3, 3c-fix4, 3c-fix5, 3c-fix6 on top of Story 4b. (The long tail of fixes is faithfully recorded — each pinned down a different layer of how IAP + Gateway API + Airflow 3 interact in practice.)
#
What was built
The story was re-planned mid-execution. The spec called for classic GKE Ingress + Flask-AppBuilder OAuth in webserver_config.py, but Airflow 3's pluggable auth manager made that impractical without pulling the custom image work forward, and "shared IP + wildcard DNS + per-app ingress" forced GKE Gateway API over classic Ingress. A new dev-02-k8s-base platform layer was introduced and dev-02-runtime renamed to dev-03-runtime.
Final layer split:
environments/dev-01-base/owns DNS zone, shared static IP, wildcard Certificate Manager cert, wildcard A record (pure GCP, no k8s providers).environments/dev-02-k8s-base/(new) owns the shared Gateway namespace, Gateway, and HTTP→HTTPS redirect HTTPRoute.environments/dev-03-runtime/(renamed from dev-02-runtime) owns apps: Airflow Helm release, per-app HTTPRoute, per-app IAP wiring.
#
PRs in order
- PR 1 — Bootstrap perms. Added
dns.googleapis.com+iap.googleapis.comAPIs, grantedroles/iap.admintotf-apply-sa. - PR 2 — DNS zone. Created
google_dns_managed_zoneume-data-dev-zonein dev-02-runtime (temporarily; moved to base in PR 3a). Operator then pasted the 4 Google NS records into GoDaddy underumedev.marpont.es. - PR 3a — Base DNS/cert absorb. Moved the DNS zone from runtime to dev-01-base via
removed { lifecycle { destroy = false } }+importblocks. Added shared static IP (ume-data-dev-ingress-ip), wildcard A record (*.umedev.marpont.es), Certificate Manager DNS-01 authorization + CNAME + wildcard managed cert + certificate map + entry. Bumpedrequired_versionto>= 1.7repo-wide and CIterraform_versionto~1.7(needed forremovedblocks). - PR 3b.1 — Enable Gateway API. Added
gateway_api_config { channel }tomodules/gke-standard/with variable defaultCHANNEL_STANDARD. Non-disruptive cluster update installed Gateway/HTTPRoute v1 CRDs required by the next PR. - PR 3b.2 —
dev-02-k8s-basestack. New stack: kubernetes + helm providers wired via remote_state from dev-01-base; gateway namespaceume-data-dev-gateway;kubernetes_manifestGateway (gatewayClassName = gke-l7-global-external-managed,NamedAddressto base's static IP, HTTPS + HTTP listeners withallowedRoutes.namespaces.from = All,networking.gke.io/certmapannotation to base's cert map); HTTPRoute on :80 that 301-redirects every request to https. - PR 3c — Rename runtime + IAP + HTTPRoute.
git mv environments/dev-02-runtime → environments/dev-03-runtime, state migrated viagsutil cpofdefault.tfstateto the new prefix. Newmodules/iap-oauth/(per-service OAuth client, k8s secret withclient_id/client_secret, GCPBackendPolicy targeting the Service,for_eachIAM bindings). Extendedmodules/airflow-helm/with an optional HTTPRoute (httproute_enabled,gateway_name,gateway_namespace,hostname). Runtime stack wired them together. - PR 3c-fix — Manual brand.
google_iap_brandwas dropped from code after first apply failed with HTTP 400: the IAP brand API rejects programmatic creation for projects outside a Workspace org, and even for in-org projects the IAP OAuth Admin API is being shut down. Operator created the OAuth consent screen manually in Console (Internal audience, supported viaext_marcello.pontes@ume.com.br). Brand nameprojects/1079167949878/brands/1079167949878passed in viavar.iap_brand_name. - PR 3c-fix2 — Per-user allow-list. Two clean retries of
google_project_iam_member.iap_access["domain:ume.com.br"]rolled back with "Provider produced inconsistent result after apply" (google provider bug on conditional IAM member create withdomain:members). Switched to per-user allow-list:ext_marcello.pontes@ume.com.br,wagner.jorge@ume.com.br,leonardo.luiz@ume.com.br. - PR 3c-fix3 — Plan SA IAP read role. CI plan hit 403 refreshing
google_iap_client.roles/viewerdoes not coverclientauthconfig.*. Added a new custom roletfIapReaderin bootstrap withclientauthconfig.brands.{get,list}+clientauthconfig.clients.{getWithSecret,listWithSecrets}and bound it totf-plan-sa. Also added invariant #11 toCLAUDE.md: always verify plan-SA + apply-SA permission coverage before landing new GCP resource types downstream. - PR 3c-fix4 — GCPBackendPolicy shape + listener binding. GKE Gateway controller rejected the BackendPolicy with "Oauth2ClientSecret specified without ClientID" and "must have exactly 1 key-value pair in field Data, found 2". Split the OAuth credentials:
spec.default.iap.clientIDnow carries the plain client ID, the referencedkubernetes_secret_v1holds a single key with only the client secret. Same PR pinned the Airflow HTTPRoute tosectionName = "https"(otherwise it bound to both listeners and beat the redirect HTTPRoute to:80traffic) and added an explicit/path match to the redirect HTTPRoute so it claims everything on:80. - PR 3c-fix5 — Drop IAP IAM condition. Google sign-in succeeded but IAP still denied users with "You don't have access". Reason: IAP's authorization path for Gateway-API backends reads the IAP-resource-level policy on the backend service, not project-level IAM with IAM conditions. The conditional grant was inert. Dropped the condition; per-user allow-list remained the tight scoping.
- PR 3c-fix6 — Kill double-login. Enabled
[core] simple_auth_manager_all_admins = trueso SimpleAuthManager treats every request as admin. Pinned[core] auth_manager = airflow.api_fastapi.auth.managers.simple.simple_auth_manager.SimpleAuthManagerin the same config block because the default image shipsapache-airflow-providers-faband FAB otherwise winsget_auth_manager()— SimpleAuthManager's middleware then hands aSimpleAuthManagerUserto FAB'sserialize_userwhich crashes on.id. Last piece: auto-disable the chart'screateUserJobwhen all-admins is on, sinceairflow users createcalls FAB'sAirflowSecurityManagerV2.find_rolewhich doesn't exist under SimpleAuthManager — Job was crash-looping and blocking the Helm upgrade.
#
Key decisions
- Gateway API over classic Ingress. Classic GKE Ingress creates one GCLB per Ingress — cannot share a static IP across services. Gateway API (
gke-l7-global-external-managed) supports one Gateway → one IP → many HTTPRoutes, which fits the shared-IP + wildcard-DNS + per-app-ingress model. - Wildcard Certificate Manager cert with DNS-01.
ManagedCertificateCRD is HTTP-01 only and doesn't support wildcards. Certificate Manager's DNS-01 challenge runs against our own Cloud DNS zone — activation bounded by zone propagation (minutes), not external registrar propagation. Covers every*.umedev.marpont.essubdomain for both Airflow now and DataHub in Phase 2. - New
dev-02-k8s-baseplatform layer. The original spec put Gateway in the runtime stack; that muddled app/platform concerns. Pulled Story 8's layer forward. DataHub and future platform services (Prometheus, CSI) will land in this layer. - DNS in dev-01-base. Zero Kubernetes dependency; keeps k8s providers out of the base stack.
- Rename runtime to dev-03-runtime. Numbering stays monotonic (01 base, 02 k8s-base, 03 runtime). State migrated by
gsutil cpofdefault.tfstateonce at the new prefix — no local terraform. - Per-service IAP module, brand in stack. IAP brand is project-singleton and in this case a one-time manual Console step; every app consumes it via
var.iap_brand_name.modules/iap-oauth/creates OAuth client, k8s secret, GCPBackendPolicy, and IAM bindings — reusable for DataHub. - HTTPRoute inside
modules/airflow-helm/. Apps own their ingress wiring. Gateway is shared and passed in by name. - Gateway in its own namespace with
allowedRoutes.from = All. AvoidsReferenceGrantfor cross-namespace HTTPRoute attachment. Backend references stay intra-namespace. - Per-user IAP allow-list. Three user members on
roles/iap.httpsResourceAccessor— tight enough for PoC without needing IAM conditions (which turned out not to work anyway for Gateway-API backends). - No IAM condition on the IAP grant. Tried
resource.type == "iap.googleapis.com/WebBackendService"to scope the role; the binding created cleanly but was invisible to IAP at authorization time. IAP for Gateway API reads the IAP-resource-level policy on the backend, not project-level conditional IAM. Dropped the condition. When a second IAP-protected backend lands, switch togoogle_iap_web_backend_service_iam_memberscoped per service. - IAP brand created manually in Console.
google_iap_brandresource can't create brands outside Workspace orgs and the API is being phased out. Documented as a prerequisite in the stack'siap.tfheader. - SimpleAuthManager + all-admins, no second login. Once IAP enforces identity at the LB, double-authing through Airflow's login screen adds no security and confuses users.
simple_auth_manager_all_admins = trueskips it; the module pinsauth_manager = SimpleAuthManagerautomatically and disables the chart'screateUserJobso no FAB-specific code paths run. - Plan SA needs permissions beyond
roles/viewer.tfK8sSecretsReader(from Story 4b era) + newtfIapReader(this story) +roles/container.viewer+roles/secretmanager.secretAccessor. The invariant added to CLAUDE.md says to check this before every new downstream resource type.
#
What was deferred
- Switching
roles/iap.httpsResourceAccessorbindings from project-scope togoogle_iap_web_backend_service_iam_memberscoped to each backend service, once more than one IAP-protected backend exists and the Gateway-generated names are stable. - Replacing
roles/iap.adminontf-apply-sawith a tighter custom role. - Narrowing
roles/storage.objectAdminforume-airflowto specific buckets (Story 4b hardening note). - Enabling the metadata-db cleanup CronJob.
- Cleaning up the orphan
domain:scudra.combinding onroles/iap.httpsResourceAccessor. - Deleting the old
gs://ume-tf-state-poc-ume-data/environments/dev-02-runtime/state prefix after a safety period. - Promoting the Gateway (currently inline in
dev-02-k8s-base) into amodules/gke-gateway/module when prod forces replication. - Flipping
admin_user_enabled = falseexplicitly on the runtime Airflow module call (the module auto-disables whenall_adminsis on, but belt-and-braces explicitness is nicer).
#
Verification
-
terraform fmt -check -recursivepasses across all changed stacks and modules -
terraform validatepasses on all four stacks - DNS zone delegated;
dig NS umedev.marpont.es @8.8.8.8returns the 4 Google name servers - Certificate Manager wildcard cert reaches
ACTIVE - Gateway
PROGRAMMED=Truein ume-data-dev-gateway namespace -
kubernetes_manifest.httproutein airflow namespace accepted -
GCPBackendPolicyattached toairflow-api-serverService;kubernetes_secret_v1with client_id/client_secret present - IAP brand visible:
gcloud iap oauth-brands list --project=poc-ume-data - 3 user IAM bindings on
roles/iap.httpsResourceAccessor(unconditional, per PR 3c-fix5) -
curl -sI http://airflow.umedev.marpont.es/returns 301 to https -
curl -sI https://airflow.umedev.marpont.es/returns 302 toaccounts.google.com/o/oauth2/v2/auth?client_id=... - Browser sign-in as allow-listed
ume.com.bruser reaches the Airflow UI directly — no second login - Port-forward break-glass works (SimpleAuthManager trusts every request as admin)
#
Story 4d + 5 — Custom Airflow Image + First Cosmos DAG
Status: done (validated end-to-end; content later moved to ume-data-dags — see below)
Date: 2026-04-18
Depends on: Story 4c
Bundled: Stories 4d and 5 combined — Story 4d is only meaningfully "done" once Story 5 proves the image works end-to-end. Ships as two PRs because the tfvars airflow_image_tag value needs an image that only exists after PR 1's merge.
#
Two-phase deployment
PR 1 (this commit) — image builder, content, IAM, bootstrap, docs:
resources/docker/Dockerfile— extendsapache/airflow:3.2.0; installsastronomer-cosmos~=1.14,dbt-core~=1.9,dbt-bigquery~=1.9against Airflow's Python 3.12 constraint set. Build-time guardrails (which dbt,import cosmos, FAB-provider check).resources/docker/requirements.txt,.dockerignore.resources/scripts/build-image.sh— local build helper with identical tag convention to CI.resources/dbt/—dbt_project.yml,profiles.yml(BQ OAuth via workload identity), two example models with aref()edge,schema.yml.resources/dags/cosmos_dbt_dag.py— Cosmos DbtDag in LOCAL mode.schedule=None,is_paused_upon_creation=True,default_argswith owner/retries..github/workflows/airflow-image.yml— build + push onresources/docker/changes; tags3.2.0-<merge-sha>; authenticates via existingtf-apply-saWIF binding (scoped torefs/heads/main)..github/workflows/dag-sync.yml—gcloud storage rsync --delete-unmatched-destination-objectsforresources/dags/andresources/dbt/on merge to main..github/workflows/resources-ci.yml— PR lint: hadolint on Dockerfile,python -m py_compileon DAGs,dbt parseon the dbt project. No GCP auth needed..github/workflows/terraform-apply.yml— new pre-apply step forenvironments/dev-03-runtime: waits up to 15 min for the expected image tag to appear in AR. Dormant when the runtime still points atapache/airflow(PR 1's state).layers/00-bootstrap/main.tf—docker_config { immutable_tags = true }on the AR repo; tags become tamper-proof.environments/dev-01-base/iam.tf—roles/bigquery.jobUseradded to bothume-airflowandume-airflow-kpo. Without it, dbt-bigquery cannot create BigQuery jobs (bigquery.dataEditordoesn't grantbigquery.jobs.create).modules/airflow-helm/variables.tf—image_repositorydescription tightened to call out Artifact Registry paths.- Doc fixes:
06-airflow.mdCosmos example (/opt/airflow/dags/dbt+/home/airflow/.local/bin/dbt),05-ci-cd.md(staledev-02-runtimereference),agents/composer-dags.md(rewrite to FUSE reality),11-deployment-stories.md(Story 4d + 5 specs updated to match shipped reality). backlog.md— follow-ups: dedicated content-push SA for prod, scoped storage.objectAdmin, worker-memory monitoring.
PR 2 (operator action after PR 1 merge) — tfvars bump:
- Grab the tag from the
airflow-image.ymlrun summary on main (format3.2.0-<sha>). Update
environments/dev-03-runtime/terraform.tfvars:airflow_image_repository = "us-east1-docker.pkg.dev/poc-ume-data/ume-composer-images/airflow" airflow_image_tag = "3.2.0-<sha-from-ar>"- On merge: terraform-apply's wait-for-image gate confirms the tag (instant, since it's been in AR since PR 1), then
terraform applyrolls the pods. - Un-pause
ume_dbt_examplein the UI; trigger; verify the Story 5 checklist.
#
Key decisions (captured in the master plan)
- Bundled Stories 4d + 5 into one feature (two PRs) because validating 4d requires running the DAG from 5. The image is only "done" when it runs a real workload.
- astronomer-cosmos 1.14+ is required for Airflow 3.2 — 1.11 and earlier predate that support. Build-time
import cosmos+pip show apache-airflow-providers-fabchecks catch drift. - Cosmos LOCAL mode over the read-only FUSE mount is safe — Cosmos copies the project to a per-task tmp dir before invoking dbt. No
DBT_LOG_PATH/DBT_TARGET_PATHoverrides needed. - Wait-for-image gate in
terraform-apply.ymlreplaces the original "accept the image-pull race" stance. Fails fast at 15 min if the image workflow didn't produce the expected tag. - Reusing
tf-apply-safor image push and DAG sync for now. Prod will get a dedicatedume-datainfra-content-pushSA;backlog.mddocuments the shape. - Two-model dbt example (
ref()edge) is the minimum that proves Cosmos's task-graph rendering. SingleSELECT 1didn't. docker_config.immutable_tags = trueon the AR repo — tags become a one-way door, matching the immutability invariant in the image lifecycle.
#
Verification (planned)
After PR 1 merge:
-
airflow-image.ymlgreen — image present in AR (gcloud artifacts docker images list). -
dag-sync.ymlgreen —gsutil ls gs://ume-airflow-dags-poc-ume-data/{dags,dbt}/. -
terraform-apply.ymlgreen for bootstrap + dev-01-base.gcloud artifacts repositories describe ume-composer-images --location=us-east1 --format='value(dockerConfig.immutableTags)'→True.roles/bigquery.jobUservisible in project IAM.
After PR 2 merge:
-
terraform-apply.ymlgate passes; Helm upgrade completes. - Pods running the custom image.
-
kubectl exec deploy/airflow-worker -n airflow -c worker -- python -c 'import cosmos; print(cosmos.__version__)'≥ 1.14. - IAP sign-in at
https://airflow.umedev.marpont.es/still lands on Airflow UI (no regression of SimpleAuthManager). - Un-pause + trigger
ume_dbt_example; both dbt tasks succeed; BQ tables exist; task logs in GCS.
#
Gotchas discovered during rollout
- Airflow 3.2's constraints file clashes with dbt-core on
pathspecandprotobuf. Fix: install Cosmos in the Airflow Python env (constrained) and dbt in an isolated/home/airflow/dbt-venv/(unconstrained).dbt_executable_pathupdated to/home/airflow/dbt-venv/bin/dbt. - GCS FUSE default
ImplicitDirs=falsehid bucket prefixes — dag-processor saw 0 files. Fix:mountOptions = "implicit-dirs"on the volume attributes inmodules/airflow-helm. - Airflow default
dagbag_import_timeout=30swas shorter than Cosmos's first-parsedbt ls(~38 s measured). Fix: raised module default to 180 s, exposed viaairflow_config.dagbag_import_timeout. apache/airflow:3.2.0base image is Python 3.13, not 3.12.
#
Story 4d + 5 migration — content to ume-data-dags
Status: done Date: 2026-04-18
Validated end-to-end from the bundled implementation, the three
resources/ subtrees (docker, dags, dbt) plus the three content-side
workflows moved out of ume-data-infra into a dedicated
ume-data-dags repo.
ume-data-infra now only carries the wait-for-image gate and the
airflow_image_tag line that bot-PRs bump on every DAGs-repo merge.
#
What moved where
resources/docker/→ume-data-dags/docker/resources/dags/→ume-data-dags/dags/resources/dbt/→ume-data-dags/dbt/resources/scripts/build-image.sh→ume-data-dags/scripts/.github/workflows/airflow-image.yml→ume-data-dags/.github/workflows/image.yml.github/workflows/dag-sync.yml→ume-data-dags/.github/workflows/dag-sync.yml.github/workflows/resources-ci.yml→ume-data-dags/.github/workflows/pr-ci.yml- New in
ume-data-dags:bot-pr.yml— uses a fine-grained PAT (INFRA_PR_TOKEN) scoped toume-data-infraonly, to open tfvars-bump PRs on this repo after a successful image build.
#
What changed on the infra side
Bootstrap: new
ume-datainfra-content-pushSA with narrow scopes (AR writer onume-composer-imagesonly, bucket-levelstorage.objectAdminonume-airflow-dags-poc-ume-dataonly, WIF binding to1edata/ume-data-dags). WIF provider'sattribute_conditionupdated to accept both repos.Bootstrap: three narrow custom roles for
tf-apply-sa—tfWifProviderUpdater,tfCustomRoleManager,tfArRepoIamAdmin. Needed once to break the chicken-and-egg for the new SA + custom role + AR IAM resources, then self-sustaining via CI.terraform-apply.yml: wait-for-image gate retained. Still essential — every future bot-PR merge pokes it.Docs:
06-airflow.md,05-ci-cd.md,agents/composer-dags.md,11-deployment-stories.mdupdated to reference the new repo.
#
End-to-end rollout, validated
ume-data-dags commit to main (touching docker/)
→ image.yml pushes 3.2.0-<sha> to AR
→ bot-pr.yml opens PR on ume-data-infra bumping airflow_image_tag
→ human merges the bot-PR
→ terraform-apply wait-for-image gate confirms the tag in AR
→ Helm rolls scheduler / worker / dag-processor / triggerer /
api-server onto the new image
First real run: 3.2.0-38e8a3d pushed from ume-data-dags, bot-PR #53
opened on ume-data-infra, merged, pods rolled successfully.
#
Plan doc
Not in this repo (kept private per request); the migration followed the
design captured in the earlier migrate-to-ume-data-dags.md working
doc, with the GitHub App replaced by a fine-grained PAT for simpler
ops.
#
Story 6 — Workload Pool + DataHub SQL + Password Secret
Status: done
Date: 2026-04-18
PR: #55 (merge commit 2ca13f0)
Plan doc: plans/story-06-workload-pool-datahub-sql.md
Foundation slice of Phase 2 (DataHub): dedicated node pool for stateful workloads, a second logical database on the shared Cloud SQL instance with password-based auth, and the first Cloud SQL observability alert.
#
What changed
environments/dev-01-base/terraform.tfvars— addedworkload-pool(e2-standard-4, min 1 / max 4, labelpool=workload, no spot, no taint) to thegke_node_poolsmap.modules/cloud-sql-db/(new) — wraps the five-resource logical-DB cluster:random_password,google_sql_database,google_sql_user(BUILT_IN),google_secret_manager_secret, andgoogle_secret_manager_secret_version. Explicithashicorp/random ~> 3.0inrequired_providers(the oldercloud-sql-postgresgets away with implicit resolution; the new module does it the right way). Outputsdatabase_name,user_name,password_secret_id— deliberately nopasswordoutput; consumers resolve the secret at runtime.environments/dev-01-base/cloud-sql.tf— singlemodule "datahub_db"call creating DBdatahubonmodule.airflow_sql.instance_name.environments/dev-01-base/outputs.tf— addeddatahub_db_name,datahub_db_user,datahub_db_host(=module.airflow_sql.private_ip),datahub_db_password_secret_id.environments/dev-02-k8s-base/alerts.tf(new) — first alerts file for this layer. Onegoogle_monitoring_alert_policy: Cloud SQL disk utilization > 0.75 for 10 min onume-data-dev-airflow-pg. Notification channels[](wired in Story 13).
#
Key decisions
Wrap DB-bound resources in their own module (
cloud-sql-db) instead of inlining them in the stack. Challenged during planning — the original master plan §2 said stack-level. Every new app DB is the same five-resource cookie-cutter, and invariant #8 says env-scoped → module from day one; the new module is the correct level of reuse. Leavescloud-sql-postgres(instance-level) clean.Password auth, not IAM auth. DataHub's five JVM pods would each need a Cloud SQL Auth Proxy sidecar under IAM — ~1.75 vCPU + 1.1 GiB overhead. Password + private IP removes the sidecar.
Shared Cloud SQL instance. DataHub's dev metadata is small; reuses Airflow's
db-g1-small. Saves ~$26/mo.Workload pool label-only selector (no taint). Kafka / OS / DataHub pin via
nodeSelector: { pool: workload }. Airflow doesn't need a taint keeping it off — Airflow has its ownkpo-poolplusdefault-poolscheduling.Alert in
dev-02-k8s-base, notdev-01-base. The metric targets a dev-01 instance, but the plan consolidates all Phase 2 alert policies in one file per the master plan §5 so Stories 9/10/13 extend this file. Remote-state already links the stacks.
#
Invariant #11 — bootstrap CI IAM
Walked every new resource type against layers/00-bootstrap/main.tf.
All covered: node pool / SQL DB / SQL user / secret / secret version /
alert policy already exercised by Airflow + existing grants. Secret
version payload reads need roles/secretmanager.secretAccessor on both
tf-plan-sa and tf-apply-sa — bootstrap lines 179–183 and 200–204
already grant exactly that. No bootstrap delta this story.
#
Gotchas
The Story 6 spec referenced
module.airflow_sql.private_ip_address; the actual module output isprivate_ip. Implementation uses the correct name.gh pr merge --squashauto-pulls main locally via rebase, which tripped on pre-existing unstaged changes in the working tree. Merge itself succeeded server-side; local sync done withgit merge --ff-only origin/mainafterward.gcloud alpha monitoring policies listrequires an install prompt on this machine — the GAgcloud monitoring policies listreturns the same data without interactive install.
#
Verification (post-apply)
✓
gcloud container node-pools list→workload-pool,e2-standard-4,min=1,max=4.✓
gcloud sql databases list→datahub(UTF8) alongsideairflow+postgres.✓
gcloud sql users list→datahub(BUILT_IN) alongsidepostgres(BUILT_IN) +ume-airflow@...(CLOUD_IAM_SERVICE_ACCOUNT).✓
gcloud secrets versions list ume-data-dev-datahub-db-password→ exactly oneenabledversion (payload not accessed).✓
gcloud monitoring policies list→Cloud SQL disk > 75% — ume-data-dev-airflow-pg, threshold 0.75, enabled.
#
Then
Story 7 installs the Secrets Store CSI Driver + GCP provider so DataHub pods can mount the password secret as an env var at runtime.
#
Story 7 — Secrets Store CSI Driver
Status: done
Date: 2026-04-18
PR: #57 (merge commit 16242e9)
Plan doc: plans/story-07-secrets-store-csi.md
Platform plumbing slice of Phase 2 (DataHub): the base Secrets Store CSI
Driver plus the Google Cloud Secret Manager provider, both as DaemonSets
in kube-system. Sets up the runtime path Stories 11 and 12 use to
mount Secret Manager secrets as env vars on DataHub pods.
#
What changed
modules/secrets-store-csi/(new) — wraps twohelm_releases:helm_release.driverinstalls chartsecrets-store-csi-driverv1.5.6 from the public kubernetes-sigs Helm repo. Values:syncSecret.enabled = true,enableSecretRotation = false,rotationPollInterval = 2m. Upstream default tolerations (operator: Exists) kept — no node selector so the DaemonSet runs on every pool.helm_release.gcp_providerinstalls the vendored chart atchart-gcp-provider/(pointed at${path.module}/chart-gcp-provider). Values override:tolerations: [{operator: Exists}]because upstream default is[]and the DaemonSet must schedule on Airflow's taintedkpo-pool.
modules/secrets-store-csi/chart-gcp-provider/(new) — verbatim copy of upstreamcharts/secrets-store-csi-driver-provider-gcp/at tagv1.12.0(appVersion 1.12.0, chart version 0.1.0). 7 files: Chart.yaml, values.yaml, templates/{_helpers.tpl, serviceaccount, clusterrole, clusterrolebinding, daemonset}.yaml.modules/secrets-store-csi/README.md(new) — upstream sync procedure, upgrade notes for both charts, Helm v3 CRD-upgrade caveat.environments/dev-02-k8s-base/secrets-store-csi.tf(new) — one-linemodule "secrets_store_csi"call withlabels = local.common_labels.backlog.md— added chart-drift watcher entry (scheduled workflow to open PRs syncing the vendored chart on new upstream tags).
#
Key decisions
Vendor the GCP provider chart. Discovery during planning: the upstream URL in the original Story 7 spec (
https://googlecloudplatform.github.io/secrets-store-csi-driver-provider-gcp) 404s — Google does not publish a Helm repo or an OCI chart for the provider. The chart lives only inside the git tree. Copying it in at a pinned tag is the clean way to stay on the native Helm install pattern; drift is mitigated by the backlog-tracked watcher.Module-first from day one. Two releases + a vendored chart directory warrant encapsulation; the caller stays a one-liner. Prod-02-k8s-base will replicate this call unchanged.
No nodeSelector on the driver. Pinning to workload-pool would break CSI mounts for any future Airflow pod on default-pool or kpo-pool. Default tolerations already tolerate every taint.
Explicit tolerations on the GCP provider. Upstream default is
[]; without theoperator: Existsoverride the provider DaemonSet skips tainted pools and renders Secret Manager reads unavailable to pods on those pools.syncSecret.enabled = true. DataHub's chart expects env-var references viasecretKeyRefon a native k8s Secret; CSI mounts alone would not satisfy the chart. Sync mode projects mounts into real Secrets.Rotation off. DataHub password is Terraform-generated and stable. Revisit in Story 13.
#
Invariant #11 — bootstrap CI IAM
Walked through before PR open. No delta required.
Helm / Kubernetes providers authenticate to the GKE API with the existing
data.google_client_config.default.access_token+ remote-state endpoint pathway already wired inenvironments/dev-02-k8s-base/versions.tf.Once authenticated, every Helm-installed object (DaemonSet, SA, ClusterRole, ClusterRoleBinding, CRD) is authorized by k8s RBAC, not GCP IAM.
tf-plan-sa'sroles/viewer+roles/container.viewerandtf-apply-sa'sroles/editor+roles/container.adminare sufficient — the Airflow Helm release and Gateway manifests already exercise the same pathway in CI.
#
Gotchas
Helm repo search confusion. Initial attempt to verify upstream versions turned up only the base driver chart in the standard Helm repo. The GCP provider is a different project at a different repo;
helm repo addagainst any GoogleCloudPlatform URL returns 404. This is the signal that forced the vendoring decision.GKE cluster zone vs. region.
gcloud container clusters get-credentials ume-data-dev-gke --region us-east1404s — the cluster is zonal, not regional. Use--zone us-east1-b.Workload-pool has zero nodes right now.
min=1is the autoscaler floor, not a permanent 1-node baseline. The DaemonSets schedule on whichever pools actually have nodes (2 default-pool nodes at apply time → 2 driver pods + 2 provider pods).
#
Verification (post-apply)
✓
kubectl -n kube-system get pods -l app=secrets-store-csi-driver→ 2 pods,3/3Running per pod (driver + node-driver-registrar + liveness-probe containers).✓
kubectl -n kube-system get pods -l app=csi-secrets-store-provider-gcp→ 2 pods,1/1Running per pod.✓
kubectl get crd secretproviderclasses.secrets-store.csi.x-k8s.io→ present (v1).✓
kubectl -n kube-system get ds secrets-store-csi-driver -o jsonpath='{.status.numberReady}/{.status.desiredNumberScheduled}'→2/2.✓
kubectl -n kube-system get ds csi-secrets-store-provider-gcp ...→2/2.
#
Then
Story 8 installs the Strimzi Kafka operator in strimzi-system,
cluster-scoped watch.
#
Story 8 — Strimzi Kafka Operator
Status: done
Date: 2026-04-18
PR: #59 (merge commit 83deb1b)
Plan doc: plans/story-08-strimzi-operator.md
Platform prerequisite for Phase 2's Kafka cluster. Installs the
Strimzi cluster operator on GKE with cluster-wide watch, pinned to
workload-pool. Establishes the kafka.strimzi.io CRDs Story 9
needs to declare the Kafka CR.
#
What changed
modules/strimzi-kafka-operator/(new) — wraps onekubernetes_namespace_v1+ onehelm_release:kubernetes_namespace_v1.strimzi—strimzi-systemnamespace withlabels = merge(common, { service = "kafka" }).helm_release.operator— chartstrimzi-kafka-operatorv0.51.0 fromhttps://strimzi.io/charts/. Values:watchAnyNamespace = true,nodeSelector = { pool = "workload" },tolerations = [].atomic,cleanup_on_fail,wait = true,timeout = 600s.Variables cover namespace, chart_version, watch scope, node selector, tolerations, timeout — all defaulted to the master plan §4 shape; prod will replicate the caller unchanged.
Outputs:
namespace,chart_version(audit).
modules/strimzi-kafka-operator/README.md(new) — chart source, inputs/outputs tables, CRD list, Helm-v3 CRD-upgrade caveat with the manualkubectl apply -f crds/procedure for schema-changing bumps.environments/dev-02-k8s-base/strimzi.tf(new) — one-linemodule "strimzi_kafka_operator"call withlabels = local.common_labels.
#
Key decisions
Module, not flat Helm release. Invariant #8 + prior correction on flat Airflow releases: env-scoped platform components ship as modules from day one when prod replication is on the horizon. The caller stays one line.
Chart 0.51.0 confirmed live.
curl https://strimzi.io/charts/index.yamlreturned HTTP 200 / 70 KB before any code change. Values schema (watchAnyNamespace,nodeSelector,tolerations,resources) confirmed against the pulled 0.51.0 tarball, not guessed.No
resourcesoverride. Upstream default (requests: {200m, 384Mi},limits: {1000m, 384Mi}) already matches master plan §4 sizing; overriding would be diff noise.Cluster-wide watch.
watchAnyNamespace = true. Story 9 can place the Kafka CR in its ownkafkanamespace without bouncing the operator or editing awatchNamespaceslist.Operator pinned to workload-pool. Keeps default-pool reserved for Airflow / platform addons. workload-pool has no taint → empty tolerations.
#
Invariant #11 — bootstrap CI IAM
Walked before PR open. No delta required. Same pathway as Stories 4b (Airflow Helm), 4c (Gateway), 7 (CSI driver):
Helm + Kubernetes providers authenticate via
data.google_client_config.default.access_token+data.terraform_remote_state.base.outputs.gke_endpointalready wired inenvironments/dev-02-k8s-base/versions.tf.Once authenticated, namespace creation, Helm release lifecycle, Deployment / ClusterRole / ClusterRoleBinding / CRD installs are all k8s RBAC authorized by the cluster —
roles/container.viewer(tf-plan-sa, bootstrap line 162) androles/container.admin(tf-apply-sa, bootstrap line 209) are sufficient.
#
Gotchas
tf planjob is namedvalidatein CI. The workflow lists one check per stack labelledvalidate (environments/<stack>)that actually runsterraform plan. Easy to misread as "plan didn't run". The log confirmsPlan: 2 to add, 0 to change, 0 to destroy.Helm release creation is slow on cold workload-pool. First apply took 2m01s for
helm_release.operatorbecause the pool had zero nodes at the time and the autoscaler had to land one. Within the 600s timeout but worth flagging — subsequent Story 9/10 applies will see similar waits if the pool scales to zero between stories.Strimzi chart serves tarballs from GitHub releases, not from the repo host.
https://strimzi.io/charts/index.yamlis the index; the tarball URL inside it points atgithub.com/strimzi/strimzi-kafka-operator/releases/download/.... Helm resolves this transparently; caller only passesrepository.
#
Verification (post-apply)
✓
kubectl get ns strimzi-system --show-labels→Active, labelsenv=dev,layer=k8s-base,owner=platform-team,cost_center=data-platform,service=kafka.✓
kubectl -n strimzi-system get pods -o wide→strimzi-cluster-operator-8686cb4f64-w8tnv 1/1 Runningongke-ume-data-dev-gke-workload-pool-f8275362-gdjc.✓
kubectl get node gke-ume-data-dev-gke-workload-pool-... -o jsonpath='{.metadata.labels.pool}'→workload.✓
kubectl -n strimzi-system get deploy strimzi-cluster-operator -o jsonpath='{.spec.template.spec.nodeSelector}'→{"pool":"workload"}.✓
kubectl get crd -o name | grep strimzi.io→ 10 CRDs includingkafkas,kafkanodepools,kafkatopics,kafkausers,strimzipodsets,kafkaconnects,kafkarebalances,kafkamirrormaker2s,kafkabridges,kafkaconnectors.✓ Operator logs:
Starting ClusterOperator for namespace *, followed byOpened watch for Kafka/KafkaConnect/KafkaBridge/ KafkaMirrorMaker2/KafkaRebalance/KafkaNodePool operator.
#
Then
Story 9 adds modules/strimzi-kafka/ and declares the Kafka cluster
CR (KRaft, 3 controllers + 2 brokers) in dev-03-runtime, plus the
PV-utilisation alert in dev-02-k8s-base/alerts.tf.
#
Story 9 — Kafka Cluster (KRaft, 3 Controllers + 2 Brokers)
Status: done
Date: 2026-04-18
PRs: #61 (merge commit 8c69310) + follow-up #62 (merge commit 36584ab)
Plan doc: plans/story-09-kafka-cluster.md
Declared the event bus for DataHub: a KRaft Kafka cluster with 3
dedicated controllers + 2 brokers in a new kafka namespace,
managed by the Strimzi operator that landed in Story 8. Also added
the broker-PVC utilisation alert the master plan §5 called for.
#
What changed
modules/strimzi-kafka/(new) — environment-scoped module wrappingkubernetes_namespace_v1.kafka+ 2 xkubernetes_manifest(KafkaNodePool, rolescontrollerandbroker) + 1 xkubernetes_manifest(KafkaCR).- Kafka 4.2.0 (Strimzi 0.51.0 default, verified against upstream
kafka-versions.yaml), metadata version 4.2. - Controllers: 3 replicas,
100m/256Mirequests, 1 GiBstandard-rwoPVs. Brokers: 2 replicas,500m/1.5Gi,10 GiBpremium-rwo. - Cluster config:
default.replication.factor=2,min.insync.replicas=1,log.retention.hours=72,log.retention.bytes=8589934592(8 GiB cap),log.segment.bytes=536870912,auto.create.topics.enable=false. - Internal plaintext listener on 9092; no TLS / SASL (DataHub isthe only consumer, same cluster).
entityOperator.topicOperatoron (no userOperator) — smalloverhead, future-optional topic-as-code.- Soft anti-affinity on
kubernetes.io/hostnamefor both pools. - Variables cover every knob the story spec called for plus
controller_storage_class(defaults tostandard-rwo) andbroker_storage_class(premium-rwo).
- Kafka 4.2.0 (Strimzi 0.51.0 default, verified against upstream
modules/strimzi-kafka/README.md(new) — topology table, inputs, upgrade notes for Kafka version bumps + PVC expansion + broker scale-out, links to the pinned kafka-versions.yaml.environments/dev-03-runtime/kafka.tf(new) — one-line module call,cluster_name = ume-data-dev-kafka, namespacekafka.environments/dev-02-k8s-base/alerts.tf— appendedgoogle_monitoring_alert_policy.kafka_broker_pv. Metrickubernetes.io/pod/volume/utilizationfiltered bynamespace_name=kafka, threshold 0.70 for 10 minutes. Notification channels empty — wired in Story 13.
#
Key decisions
Dedicated controllers + KRaft. Three controllers give an odd-quorum that tolerates one outage; combined-role with 2 brokers would have left the cluster unable to elect a leader on a single controller pod failure. Added ~0.8 GiB RAM total for a major availability win (master plan §4).
Module from day one. Invariant #8 plus precedent from the Airflow-module correction: env-scoped platform components get a module even with a single current caller, because prod-03-runtime will replicate the call unchanged.
PD-SSD for brokers, pd-balanced for controllers. Kafka is IOPS-sensitive on retention sweeps; KRaft controllers write tiny sequential metadata and do not need SSD. Cuts cost on the 3 controllers while keeping broker IO responsive.
RF=2 with
min.insync.replicas=1. Hedged against the user's prior 1-replica CNPG incident: one broker can be offline during a rolling upgrade without losing write availability. Prod bumps to RF=3 with 3 brokers (backlog).Internal plaintext listener. DataHub is the only consumer and runs in the same cluster. TLS/SASL adds cert rotation plumbing for no threat-model gain at this stage — deferred to Story 13.
deleteClaim = falseon both node pools. Explicit guard against aterraform destroywiping Kafka data. Same principle asprevent_destroyon stateful GCP resources.auto.create.topics.enable=false. DataHub'skafka-setupJob creates its topics explicitly; auto-create masks config mistakes.Alert threshold at 70%.
log.retention.bytes=8 GiBcaps broker PVC growth near 80%; firing at 70% gives time to bump PVC size before retention stops reclaiming.Ship the initial Kafka CR even with no pods yet. Strimzi reports
READY=Trueon first reconciliation after spec validation; pods land seconds later once the operator generates the StrimziPodSets. Pattern matches Story 8's 2m first-apply note on a cold workload-pool.
#
Invariant #11 — bootstrap CI IAM
Walked before PR open. No delta required.
kubernetes_manifestresources targeting Strimzi CRDs use the same Helm/Kubernetes provider pathway as Story 4c (Gateway) and Story 4b (Airflow HTTPRoute):data.google_client_configtoken- remote-state GKE endpoint +
kubernetes_manifestobject-levelauthorization via k8s RBAC.
- remote-state GKE endpoint +
roles/container.viewer(plan SA) suffices for GET + CRD-schema discovery on plan.roles/container.admin(apply SA) covers CREATE/PATCH on thekafkas.kafka.strimzi.io+kafkanodepools.kafka.strimzi.iocustom resources, confirmed empirically by the CI apply succeeding on first run.google_monitoring_alert_policyis a GCP resource — covered by plan-SAroles/viewer+ apply-SAroles/editor, same pathway as Story 6's Cloud SQL disk alert.
#
Gotchas
Strimzi reserves every
strimzi.io/*label on objects the operator creates. First apply looked clean (CR READY=True, 5 PVCs Pending as expected withWaitForFirstConsumerstorage classes) but no pods ever scheduled. Operator log showedInvalidResourceException: User provided labels or annotations includes a Strimzi annotation: [strimzi.io/cluster]on every reconciliation. Cause:strimzi.io/clusterwas present in the pod templatemetadata.labelsvia a sharedlocal.node_pool_labels. Fix (PR #62): split the locals —pool_binding_labels(withstrimzi.io/cluster) stays on KafkaNodePool top-level metadata only; pod templates getkafka_labelswithout anystrimzi.io/*key. Cluster reconciled in under 2 minutes after the fix apply.spec.kafka.replicas+spec.kafka.storageare no longer required. Initial Kafka CR carriedreplicas: 1+storage: { type: ephemeral }stubs on the theory that the CRD schema still required them (older Strimzi docs say so). In 0.51.0 these fields emitDeprecatedFieldswarnings and are otherwise ignored when KafkaNodePools drive topology. Removed in the follow-up PR.Fully rolling workload-pool minimum is at play. With
min=1the pool was warm from Story 8; broker pods landed immediately without waiting on autoscaler. The five Kafka pods +1 entity-operator all scheduled on the single workload-pool node. Budget still matches master plan §1 resource table.validatejob is the plan job (same as Story 8). Plan output in thevalidate (environments/...)logs confirmed Plan: 4 to add, 0 to change, 0 to destroy for dev-03-runtime and Plan: 1 to add, 0 to change, 0 to destroy for dev-02-k8s-base on PR #61. Follow-up #62 showed Plan: 0 to add, 3 to change, 0 to destroy on dev-03-runtime — the expected in-place update to the three CRs.
#
Verification (post-apply)
- ✓
kubectl get ns kafka --show-labels→Activewith labelsenv=dev,layer=runtime,service=kafka,owner=platform-team,cost_center=data-platform. - ✓
kubectl -n kafka get kafka→ume-data-dev-kafka READY=True(no warnings). - ✓
kubectl -n kafka get kafkanodepool→brokers DESIRED=2 ROLES=[broker] NODEIDS=[0,1],controllers DESIRED=3 ROLES=[controller] NODEIDS=[2,3,4]. - ✓
kubectl -n kafka get strimzipodset→ume-data-dev-kafka-brokers 2/2,ume-data-dev-kafka-controllers 3/3. - ✓
kubectl -n kafka get pods→ 3 controllers + 2 brokers + 1 entity-operator, all1/1 Running, 0 restarts. - ✓
kubectl -n kafka get pvc→ 5 PVCs Bound:data-ume-data-dev-kafka-brokers-0,1(10Gipremium-rwo) anddata-ume-data-dev-kafka-controllers-2,3,4(1Gistandard-rwo). - ✓
kubectl -n kafka get pod ume-data-dev-kafka-brokers-0 -o jsonpath='{.spec.nodeName}'→gke-ume-data-dev-gke-workload-pool-f8275362-gdjc— workload-pool placement confirmed. - ✓
kubectl -n kafka get svc→ume-data-dev-kafka-kafka-bootstrapClusterIP on 9091/9092, plus the headlessume-data-dev-kafka-kafka-brokersservice. - ✓ Alert policy landed via CI apply on
environments/dev-02-k8s-base(Plan: 1 to add, 0 to change, 0 to destroy — thegoogle_monitoring_alert_policy.kafka_broker_pvresource). No notification channels — wired in Story 13.
#
Then
Story 10 provisions the OpenSearch operator in dev-02-k8s-base,
the single-node OpenSearch cluster + snapshot CronJob in
dev-03-runtime, and the snapshot bucket + ume-opensearch-snapshot
GSA in dev-01-base.
#
Story 10 — OpenSearch operator + cluster (snapshot scaffolding)
Status: done
Date: 2026-04-18
PRs: #64 (operator + scaffolding) + #65 (webhook) + #66 (cluster CR) + #67 (API group) + #68 (bootstrap env) + #69 + #70 (force_conflicts) + #71 (gotchas doc) + #72 + #73 (self-bootstrap) + this one (status entry)
Plan doc: plans/story-10-opensearch.md
Landed the metadata search + graph-index backend for DataHub: a
single-node OpenSearch 2.19.5 cluster in a new opensearch
namespace, managed by the opensearch-k8s-operator 2.8.4 Helm
chart. Also shipped the snapshot bucket, ume-opensearch-snapshot
GSA, bucket-scoped roles/storage.objectAdmin, and a Workload
Identity binding to opensearch/opensearch-snapshot KSA — all as
scaffolding for a future snapshot CronJob. The CronJob itself is
out of scope this story (see below).
#
What changed
modules/opensearch-operator/(new) — environment-scoped module wrappingkubernetes_namespace_v1.operator+helm_release.opensearch_operator.Chart
opensearch-operator2.8.4 fromhttps://opensearch-project.github.io/opensearch-k8s-operator/.webhook.enabled = falseto skip the cert-manager-backedValidatingWebhookConfiguration(we don't run cert-manager).Operator pinned to the workload pool via
manager.nodeSelector.
modules/opensearch-cluster/(new) — wraps theopensearchnamespace + the snapshot KSA (WI annotation to the GSA) + theOpenSearchClusterCR + theOpenSearchISMPolicyCR.- OpenSearch 2.19.5, 1 data node (
cluster_manager + data + ingest), JVM heap 512m, 5Gipremium-rwoPVC, security plugin disabled (plugins.security.disabled = "true"inadditionalConfig+ env on both bootstrap and data pods). - Dashboards off (DataHub has its own UI).
field_manager.force_conflicts = trueon both CRs because the operator ownsspec.nodePools,spec.bootstrap.diskSize, andspec.statesafter create.cluster.initial_master_nodesoverridden vianodePools[0].envto${cluster}-nodes-0so the data node self-bootstraps as initial master (see Gotchas).- ISM policy
ume-retentiondeletes indices older than 30 days.
- OpenSearch 2.19.5, 1 data node (
environments/dev-01-base/buckets.tf(new) +environments/dev-01-base/iam.tf(append) +environments/dev-01-base/outputs.tf(append) — snapshot bucketume-opensearch-snapshots-poc-ume-data(35d lifecycle delete, versioning off),ume-opensearch-snapshotGSA, bucket-scopedroles/storage.objectAdmin, WI binding toopensearch/opensearch-snapshotKSA, two new outputs.environments/dev-02-k8s-base/opensearch.tf(new) — one-line module call for the operator.environments/dev-02-k8s-base/alerts.tf(append) —google_monitoring_alert_policy.opensearch_pv: metrickubernetes.io/pod/volume/utilizationfiltered bynamespace_name=opensearch, threshold 0.70 for 10 minutes.environments/dev-03-runtime/opensearch.tf(new) — one-line module call consumingopensearch_snapshot_sa_emailfrom remote state.modules/opensearch-cluster/README.mddocuments the three non-obvious requirements learned during bring-up:force_conflicts, mirrored bootstrap env,opensearch.orgAPI group.
#
Key decisions
Snapshots deferred.
repository-gcsdoes not support Workload Identity upstream — it requires an SA JSON key in the OpenSearch keystore, which conflicts with invariant #5 ("no service-account key files"). Ship the bucket + GSA + WI binding + KSA as scaffolding so a future story can wire a credential path (SM-CSI-mounted JSON, external-dump CronJob, …) without a new IAM change. The ISM policy handles local retention; indices are rebuildable from Kafka MAE replay / BigQuery lineage for the PoC-scale dataset.Two-PR split for operator + cluster (mirrors Story 8 → 9).
kubernetes_manifestvalidates against cluster-live CRD schemas at plan time; the operator's CRDs land only after its Helm release applies on main, so a single PR would fail plan-on-PR fordev-03-runtime.opensearch.orgAPI group, notopensearch.opster.io. The operator ships both; theopster.iogroup is deprecated and runs a migration-only controller that sits idle until anopensearch.orgCR exists. Initial CRs were onopster.io; PR #67 switched them.Security plugin off, no Dashboards. Dev-only; hardening in Story 13 (basic auth or mTLS). Dashboards saves a pod + GSA + route since DataHub has its own UI.
Module-from-day-one, two modules.
opensearch-operator+opensearch-clusterparallel the strimzi split. Prod-02-k8s-base and prod-03-runtime will call them unchanged per invariant #8.premium-rwofor data. 5Gi pd-ssd; OpenSearch indexing is IOPS-sensitive.
#
Invariant #11 — bootstrap CI IAM
Walked before PR open. No delta required.
google_storage_bucket+google_storage_bucket_iam_member: plan covered bytfResourceIamReader.storage.buckets.getIamPolicy; apply byroles/editor. Precedent: Story 4's Airflow buckets.google_service_account+google_service_account_iam_member(WI binding): plan-SA refresh works today on the airflow + airflow-kpo WI bindings so the pathway is proven; apply covered by thetfIamPolicyAdmincustom role oniam.serviceAccounts.setIamPolicy.helm_release+kubernetes_manifest+kubernetes_namespace_v1+kubernetes_service_account_v1:roles/container.viewer+roles/container.admin, same as Stories 4/7/8/9.google_monitoring_alert_policy: covered byroles/viewer+roles/editor, same as the Kafka alert.
#
Gotchas
Chart 2.8.x requires cert-manager by default. First apply failed with
no matches for kind "Certificate" in version "cert-manager.io/v1"from the operator'sValidatingWebhookConfiguration. Fix (PR #65):webhook.enabled = false. Trade-off is loss of admission validation onOpenSearchCluster+OpenSearchISMPolicy— safe because Terraform is the only client mutating them.opensearch.opster.iois deprecated and runs migration-only. Initial CRs applied underopster.io/v1; operator logs read "DEPRECATION WARNING: opensearch.opster.io API group is deprecated"- "Old cluster is not ready, skipping migration" and no primarycontroller ran. Fix (PR #67): switch apiVersion to
opensearch.org/v1. The two groups share CRD schemas and kindnames exactly.
- "Old cluster is not ready, skipping migration" and no primarycontroller ran. Fix (PR #67): switch apiVersion to
kubernetes_manifestschema-merge fails across apiVersion changes. Plan errored with "Failed to update proposed state from prior state" on the apiVersion change. Worked around by renaming the Terraform resource addresses (opensearch_cluster→cluster,opensearch_ism_retention→ism_retention) to force a destroy+create rather than an in-place update.Kind names are case-sensitive on ISM.
OpenSearchISMPolicy(notOpensearchISMPolicy).kubectl get crdconfirms viaspec.names.kind— consult that before writing the manifest.general.additionalConfigapplies to ALL pods, butnodePools[].envdoes NOT cover the bootstrap pod. WithoutDISABLE_INSTALL_DEMO_CONFIGon the bootstrap pod, docker-entrypoint runs the demo-security setup, tripped over the disabled security plugin, and the pod died before registering as cluster-manager. Fix (PR #68): mirror the env ontospec.bootstrap.envand pinspec.bootstrap.nodeSelector = { pool = "workload" }.Operator claims ownership of several subfields via SSA. Apply failed on field-manager conflicts for
spec.nodePools,spec.bootstrap.diskSize(cluster CR, PR #69) andspec.states(ISM CR, PR #70). Addedfield_manager { force_conflicts = true }on both resources so Terraform re-asserts its declared shape without oscillating.discovery.type: single-nodeis incompatible with the operator-injectedcluster.initial_master_nodesenv. OpenSearch explicitly errors out withsetting [cluster.initial_master_nodes] is not allowed when [discovery.type] is set to [single-node]. Fix (PR #73): drop the single-node setting and override the env to${cluster}-nodes-0— duplicate env vars resolve last-write-wins, so the override beats the operator's default and the data pod self-bootstraps as initial master.The operator kills the bootstrap pod prematurely on a single-node cluster. The operator creates bootstrap-0, waits only for the first StatefulSet replica to be Ready (tcp probe), then deletes bootstrap. The data pod then loops forever on
cluster_manager_not_discovered_exceptionbecause its env still referencesbootstrap-0. The self-bootstrap env override above sidesteps the race entirely. With 3+ data nodes the operator's bootstrap flow works, but for a 1-node cluster you must self-bootstrap.Recovery from wedged state requires CR deletion. Once the operator marks
status.phase=RUNNING, initialized=true, it won't re-run bootstrap. Deleting the StatefulSet + PVC alone doesn't help; the operator recreates them with the same stale env. The only clean recovery iskubectl delete opensearchcluster.opensearch.org/<name>, let Terraform recreate on next apply.ISM CR managedCluster reference is sticky. After the cluster CR is destroyed and recreated, the ISM CR's status.managedCluster still points at the old UID and the operator errors with "cannot change the cluster a resource refers to". Cleared by
kubectl delete opensearchismpolicy.opensearch.org/<name>; next apply recreates the CR bound to the current cluster.ConfigMap mount propagation has kubelet lag. Changing
general.additionalConfigupdates theume-data-dev-opensearch-configConfigMap, but the running pod's mountedopensearch.ymlcan be ~1 minute stale. If the change requires a new pod (env change, not just YAML), kubelet sync lag can mean the restarted pod reads stale YAML for its first boot.
#
Verification (post-apply)
- ✓
gcloud storage buckets describe gs://ume-opensearch-snapshots-poc-ume-data→ location us-east1, 35d delete lifecycle, versioning off, 5 labelsservice=opensearch.
- ✓
gcloud iam service-accounts get-iam-policy ume-opensearch-snapshot@poc-ume-data.iam.gserviceaccount.com→roles/iam.workloadIdentityUsermemberserviceAccount:poc-ume-data.svc.id.goog[opensearch/opensearch-snapshot]. - ✓
gcloud storage buckets get-iam-policy gs://ume-opensearch-snapshots-poc-ume-data→ume-opensearch-snapshot@…bound asroles/storage.objectAdmin. - ✓
kubectl get ns opensearch-operator opensearch --show-labels→ both Active with the 6 mandatory labels. - ✓
kubectl -n opensearch-operator get pods -o wide→ operator podRunning 1/1on apool=workloadnode. - ✓
kubectl get crd | grep opensearch→ 20+ CRDs includingopensearchclusters.opensearch.organdopensearchismpolicies.opensearch.org. - ✓
kubectl -n opensearch get opensearchcluster.opensearch.org→HEALTH=green, NODES=1, VERSION=2.19.5, PHASE=RUNNING. - ✓
kubectl -n opensearch exec ume-data-dev-opensearch-nodes-0 -c opensearch -- curl -s localhost:9200/_cluster/health→status=green, number_of_nodes=1, discovered_cluster_manager=true, active_primary_shards=3, active_shards_percent_as_number=100.0. - ✓
kubectl -n opensearch get pods -o wide→ 1 data podRunning 1/1onpool=workload. - ✓
kubectl -n opensearch get pvc→ 1 PVC bound, 5Gi,premium-rwo. - ✓
kubectl -n opensearch get sa opensearch-snapshot -o yaml→ annotationiam.gke.io/gcp-service-account = ume-opensearch-snapshot@…. - ✓
kubectl -n opensearch get opensearchismpolicy.opensearch.org→ume-retentionpresent (recreated clean after UID drift cleanup). - ✓
gcloud monitoring policies list --project=poc-ume-data→OpenSearch PV > 70% — opensearch namespaceenabled=True.
#
Then
Story 11 lands DataHub via modules/datahub-helm/ in dev-03-runtime,
reading bootstrap_servers from modules/strimzi-kafka and the
OpenSearch service_host from modules/opensearch-cluster, plus
the Cloud SQL password from Story 6's Secret Manager CSI mount.
Story 10's snapshot scaffolding stays inert until a dedicated
follow-up story wires a credential path for repository-gcs or
swaps to an external-dump CronJob.
#
Story 11 — DataHub Dry-Run (no IAP)
Status: done
Date: 2026-04-19
PRs:
#76 (preflight — narrow OpenSearch ISM to DataHub time-series indices) +
#77 (main — modules/datahub-helm/, dev-03-runtime/datahub.tf, OpenSearch 3-node migration) +
#78 (fix — ZooKeeper placeholder for the chart's kafka-setup template) +
#79 (fix — pin elasticsearch-setup image tag v1.4.0.3; chart default v1.5.0.1 unpublished) +
#80 (fix — implementation: "opensearch" + USE_AWS_ELASTICSEARCH=true so DataHub targets ISM not ILM) +
#83 (fix — ume-datahub GSA + WI binding in dev-01-base) +
#84 (fix — annotate datahub/datahub KSA with the GSA) +
#85 (fix — mounter Job must run as datahub KSA, not default) +
#86 (fix — global.sql.datasource.host must be host:port for the chart's tcp wait) +
#87 (fix — point every DataHub subchart at the WI-annotated KSA)
- this one (status entry).
Plan doc: plans/story-11-datahub-dryrun.md
Landed DataHub v1.5.0 on the existing cluster via chart 0.9.10 (datahub from
helm.datahubproject.io). GMS + frontend serving over kubectl port-forward; the stock JAAS datahub/datahub login gates the UI (no IAP,
no OIDC yet — Story 12). System-update + setup jobs completed cleanly;
Postgres, Kafka (KRaft), and OpenSearch 2.19.5 are all wired. Also bundled
the long-overdue OpenSearch 1 → 3 node migration so the self-bootstrap
env hack is gone before steady-state ingestion.
#
What changed
modules/datahub-helm/(new) — wraps the upstreamdatahubchart.kubernetes_namespace_v1.datahub+kubernetes_service_account_v1.datahub(annotated withiam.gke.io/gcp-service-account = ume-datahub@…so the Secrets Store CSI driver'sdatahub-db-passwordfetch resolves under Workload Identity).kubernetes_manifest.datahub_db_secret_provider_class— CSISecretProviderClasswithsyncSecretenabled so the driver materialises a k8s Secret nameddatahub-db-passwordthe first time a pod mounts it.kubernetes_job_v1.db_secret_mounter— tiny one-shot Job that mounts the SPC as thedatahubKSA.wait_for_completion = truemakes Terraform holdhelm_release.datahubuntil the k8s Secret exists — the chart'sdatahub-system-updatepre-install hook reads the password viasecretKeyRefand wedges inCreateContainerConfigErrorotherwise.helm_release.datahub— pinned to chart0.9.10(appVersionv1.5.0, verified againsthelm.datahubproject.io/index.yamlon 2026-04-19). Key value overrides baked into the module:global.sql.datasource.host = "<ip>:5432"(chart convention — see quickstart values; the upgrade image parses it as a tcp target),hostForMysqlClienthost-only,portseparate,urla full JDBC URL,driver = org.postgresql.Driver,password.secretRef + secretKeypointing at the CSI-synced Secret.global.kafka.bootstrap.server = <strimzi bootstrap>plusglobal.kafka.zookeeper.server = <same placeholder>— the chart'skafka-setup-job.ymltemplate unconditionally dereferenceszookeeper.serverat render time even in KRaft mode.global.elasticsearch:host,port: 9200,useSSL: false,skipcheck: true,implementation: "opensearch"(GMS + consumer side).elasticsearchSetupJob:enabled: true,image.tag: "v1.4.0.3"(chart defaultv1.5.0.1was never pushed toacryldata/datahub-elasticsearch-setup),extraEnvs: USE_AWS_ELASTICSEARCH=trueso the setup targets ISM not ILM.kafkaSetupJob:enabled: true(chart default tagv1.2.0.1is fine).Every subchart (
datahub-gms,datahub-frontend,datahub-mae-consumer,datahub-mce-consumer):replicaCount: 1,nodeSelector: { pool = "workload" }, andserviceAccount: { create: false, name: datahub }— each subchart's SA default iscreate: truewith no WI annotation, which broke the CSI mount on GMS.datahub-gmsadditionally mounts the SPC viaextraVolumes+extraVolumeMountsso the mount triggerssyncSecreton the first GMS start too (redundant with the mounter Job but harmless).datahub-ingestion-cron: enabled: falseandacryl-datahub-actions: enabled: false— Story 12/13.
environments/dev-03-runtime/datahub.tf(new) — singlemodule "datahub"call wiring SQL + Kafka + OpenSearch from remote state.modules/opensearch-cluster/1 → 3 node migration —data_replicasdefault1 → 3, drop the duplicate-env self-bootstrap override onspec.nodePools[0].env. README +10-operations.md"Current shape (dev)" block updated to match.environments/dev-01-base/iam.tf— newume-datahubGSA, project-scopedroles/secretmanager.secretAccessor, WI binding fromdatahub/datahubKSA. Outputdatahub_sa_emailexported fordev-03-runtime.modules/opensearch-cluster/main.tfpreflight (PR #76) —ume-retentionISMindexPatternsnarrowed from["*"]to["datahub_usage_event*", "*_timeseries_v1*"]. Backlog "URGENT" entry retired in the same PR, matching known-issue bullet removed from10-operations.md.backlog.md— 3-node migration + URGENT ISM entries retired; new entry added for scoping the DataHub GSA'ssecretmanager.secretAccessorbinding to the specific secret (currently project-wide for parity with Airflow).
#
Key decisions
Preflight PR for the ISM scope fix. Narrowing
indexPatternswas a 3-line change that blocked Story 11 ingestion safety. Shipped separately so the DataHub PR review surface stayed focused and the ISM-breaking risk was off the DataHub critical path.OpenSearch 3-node migration bundled. The single-node
cluster.initial_master_nodesenv hack was fragile and diverged from the operator's happy path. Landing the migration inside Story 11 meant DataHub met a prod-shaped cluster on day one; the rolling restart completed cleanly without hitting thecluster_manager_not_discoveredwedge.Module-first.
modules/datahub-helm/ships every knob as a variable (invariant #9);dev-03-runtime/datahub.tfis a singlemodule "datahub"block. Prod replication is the justification, not future callers (invariant #8).Chart-native
password.secretRef+secretKeyinstead of the spec'sextraEnvsform — propagates through all subcharts via the chart'sdatasourcestanza and keeps the password out of Helm values rendering.Mounter-Job pattern for CSI sync. DataHub's
datahub-system-updatepre-install hook reads the SQL password viasecretKeyRef— the k8s Secret has to exist BEFORE Helm starts the install. Abusyboxkubernetes_job_v1mounts the SPC, verifies the file, exits; Terraform waits for completion, then helm_release fires.JAAS login for the dry-run.
oidcAuthentication.enabled: false(chart default). Reachable only via port-forward; OIDC + IAP land in Story 12.
#
Invariant #11 — bootstrap CI IAM
Walked before each PR opened. One gap found and closed:
google_service_account+google_service_account_iam_member(WI binding) — already covered (Airflow precedent). Plan-SA via project-level refresh; apply-SA via thetfIamPolicyAdmincustom role.google_project_iam_member(roles/secretmanager.secretAccessor) — covered by the existingroles/editorgrant on apply-SA (secretmanager admin subset) androles/vieweron plan-SA. No delta.kubernetes_namespace_v1,kubernetes_service_account_v1,kubernetes_manifest(SecretProviderClass),kubernetes_job_v1,helm_release— covered byroles/container.viewer+tfK8sSecretsReaderon plan-SA;roles/container.adminon apply-SA. Same coverage that Stories 4 / 7 / 8 / 9 / 10 use.
No layers/00-bootstrap/ changes required.
#
Gotchas (the long list)
Every one of these cost at least one PR to uncover:
Chart template requires
global.kafka.zookeeper.servereven when the target Kafka is KRaft. Thekafka-setup-job.ymltemplate dereferences it at render time; an omitted field throwsnil pointer evaluating interface {}.server. Setting it to the bootstrap address is a safe placeholder — the setup image routes everything through the Kafka Admin API.acryldata/datahub-elasticsearch-setuphas nov1.5.xsemver tag. Chart 0.9.10 defaultselasticsearchSetupJob.image.tagtoglobal.datahub.version(v1.5.0.1), which was never pushed for this image. Latest published semver isv1.4.0.3. The chart's ownkafkaSetupJob.image.tagis hard-pinned tov1.2.0.1for the same reason. Fix: explicitelasticsearchSetupJob.image.tag: "v1.4.0.3"override.DataHub speaks Elasticsearch ILM by default, not OpenSearch ISM. The setup job hits
GET _ilm/policy/datahub_usage_event_policywhich 400s on OpenSearch. Two switches needed:global.elasticsearch.implementation: "opensearch"(GMS + consumers) andUSE_AWS_ELASTICSEARCH=trueon the setup job (extraEnvs). TheAWSin the name is misleading — it's the chart-wide OpenSearch toggle.The Secrets Store CSI driver needs Workload Identity on the mounting pod's KSA, not on some shared driver identity. Without a GSA binding on the
datahub/datahubKSA the driver falls back to the node GSA and hitssecretmanager.versions.access denied. Module now requiresgsa_email.Pod
service_account_namehas to be set explicitly on the mounter Job.kubernetes_job_v1withoutservice_account_nameruns pods as the namespacedefaultKSA — the WI annotation ondatahubKSA never applies and the CSI driver hits the same 403. Had to explicitly setspec.template.spec.service_account_namein the Job.Each DataHub subchart creates its own
serviceAccountby default. Setting the chart-top-levelserviceAccount.nameonly affects the chart's own templates (setup jobs, mounter-free paths). Thedatahub-gms,datahub-frontend,datahub-mae-consumer, anddatahub-mce-consumersubcharts each default tocreate: truewith no annotation. Fix: overrideserviceAccountinside each subchart's block.global.sql.datasource.hostmust behost:port, not just host. The upgrade / system-update image uses the value directly as a tcp target (go-dockerizestyle); just an IP givesdial tcp: address 10.64.0.3: missing port in addressand the pre-install hook hangs. The chart's quickstart values show the convention (host: "prerequisites-mysql:3306").Cancelling a Helm install mid-flight leaves the release
pending-installforever. Subsequent applies fail withanother operation (install/upgrade/rollback) is in progress. Recovery:helm uninstall datahub -n datahub(leaves Terraform-owned resources alone — namespace, KSA, SPC, mounter Job, CSI-synced Secret); the next apply starts clean. Happened three times this session.Terraform cancellation inside a Helm wait can orphan a GCS state lock.
gsutil rm gs://ume-tf-state-poc-ume-data/environments/dev-03-runtime/default.tflockis the recovery; the state file itself is untouched as long as the Helm wait was the only operation in flight.Cross-stack remote_state outputs block plan. Splitting PR #83 (
dev-01-baseoutput) from #84 (dev-03-runtimeconsumer) was mandatory —terraform-plan.ymlreads remote state from GCS, which only contains outputs after the producer stack applies. Same pattern as Story 10's two-PR split for operator vs. cluster.
#
Verification (post-apply)
- ✓
gh run view <apply>→conclusion=success. - ✓
helm -n datahub list→datahubrevision 1deployed. - ✓
kubectl -n datahub get pods:datahub-datahub-frontendanddatahub-datahub-gmsRunning 1/1 onpool=workload;datahub-db-password-mounter,datahub-elasticsearch-setup-job,datahub-kafka-setup-job,datahub-system-update,datahub-system-update-nonblkall Completed. - ✓
kubectl -n datahub logs deploy/datahub-datahub-gms→Ready: tcp://10.64.0.3:5432(SQL) andReady: tcp://ume-data-dev-kafka-kafka-bootstrap.kafka.svc:9092(Kafka) early in the startup sequence; no ILM / CSI / secretKeyRef errors. - ✓
kubectl -n opensearch get opensearchcluster.opensearch.org→HEALTH=green, NODES=3, VERSION=2.19.5after the 1 → 3 migration. - ✓
kubectl -n opensearch get pods -o wide→ 3 data pods Running 1/1 onpool=workload; 3 PVCs bound 5 Gipremium-rwo. - ✓
kubectl -n datahub get sa datahub -o jsonpath='{.metadata.annotations}'→{"iam.gke.io/gcp-service-account":"ume-datahub@poc-ume-data.iam.gserviceaccount.com"}. - ✓
gcloud iam service-accounts get-iam-policy ume-datahub@…→roles/iam.workloadIdentityUsermemberserviceAccount:poc-ume-data.svc.id.goog[datahub/datahub]. - ✓
kubectl -n opensearch get opensearchismpolicy.opensearch.org ume-retention -o jsonpath='{.spec.ismTemplate.indexPatterns}'→ narrowed list;status.state=CREATED. - Port-forward + browser verification is operator sign-off — not scripted here.
#
Then
Story 12 wires IAP + HTTPRoute on the shared Gateway + DataHub OIDC
against Google + the groups-and-policies bootstrap, replacing the
datahub/datahub JAAS fallback. Story 13 hardens cost + ops (label
audit, budget alerts, PDB verification, runbook drill).
#
Story 12 — DataHub IAP + HTTPRoute + OIDC Auth
Status: done
Date: 2026-04-22
PRs:
#89 (main — modules/datahub-helm adds HTTPRoute + OIDC surface, dev-03-runtime wires datahub_iap, imports the OIDC secret container) +
#90 (fix — real chart service name is datahub-datahub-frontend; flip frontend + GMS Services to ClusterIP so IAP is the only way in) +
this one (status entry).
DataHub is now reachable at https://datahub.umedev.marpont.es behind
IAP (perimeter) and DataHub's own Google OIDC (in-app identity). JAAS
stays on so the built-in datahub user can still bootstrap the first
Admin over port-forward — the proper groups / policies / admin-promotion
bootstrap lives in Story 13. Frontend and GMS Services flipped from
LoadBalancer to ClusterIP in the same round, so the two public IPs the
chart provisioned by default are gone.
#
What changed
modules/datahub-helm/:httproute.tf(new) — optionalHTTPRouteattached to the shared Gateway, targeting the real chart-generated Service. Mirrorsmodules/airflow-helm/httproute.tfin shape.oidc.tf(new) —SecretProviderClassfor the OIDC client secret + akubernetes_job_v1mounter that forces the CSI driver to materialise the backing k8s Secret beforedatahub-frontendstarts. Same pre-install-hook-avoidance pattern as the DB password mounter from Story 11.main.tf—local.frontend_service_name = "${var.release_name}-datahub-frontend"so both the HTTPRoute backend and the IAP target reference the chart's actual Service name (see gotcha below). Subchart overrides addservice = { type = "ClusterIP" }ondatahub-frontendanddatahub-gms; chart default is LoadBalancer, which gives each pod a public IP that sits outside IAP.extraEnvs,extraVolumes, andextraVolumeMountsondatahub-frontendare populated unconditionally (see gotcha on the tuple-length flag drop).variables.tf—httproute_enabled+gateway_*+hostname(mirrors the Airflow module). OIDC vars (oidc_client_id,oidc_client_secret_secret_id,oidc_base_url,oidc_discovery_uridefault to Google,oidc_user_name_claim=email,oidc_scopes=openid profile email,oidc_extract_groups_enabled=falsefor Phase 1). Client ID is not sensitive — shipped as a plain env var; client secret flows through CSI.outputs.tf—frontend_service_namenow returns the derived<release>-datahub-frontendinstead of the lie from Story 11.
environments/dev-03-runtime/:datahub.tf—module "datahub"getshttproute_enabled = true- the gateway refs from
dev-02-k8s-baseremote state + the OIDCinputs. Agoogle_secret_manager_secret.datahub_oidc_client_secretresource +importblock adopts the human-created Secret Managercontainer on first apply (labels + replication reconcile, versionstays out-of-band forever).oidc_client_secret_secret_idreadsfrom this resource so the module call tracks the tf-managed name.
- the gateway refs from
iap.tf—module "datahub_iap"mirrorsairflow_iapverbatim (samemodules/iap-oauth/, same brand, same allow-list). Target service reads frommodule.datahub.frontend_service_name, not a hardcoded string.variables.tf+terraform.tfvars—datahub_subdomain,datahub_oidc_client_id,datahub_oidc_client_secret_secret_id.outputs.tf—datahub_url,datahub_namespace,datahub_iap_client_id(parity with the airflow outputs).
#
Key decisions
IAP at the perimeter, DataHub OIDC inside. IAP alone collapses to all-admin-or-all-reader; DataHub's role + policy + ownership layer does per-user and per-dataset work. Two OAuth clients on the same brand — IAP client managed by
modules/iap-oauth/, DataHub OIDC client created manually in the Console (Story 12 spec).Reused
modules/iap-oauth/verbatim. Same brand, same IAM grants, sameGCPBackendPolicyshape — no new bootstrap work.iap_allowed_usersstays shared with Airflow until there's a reason to diverge.Terraform adopts the Secret Manager container, not the version.
importblock brings the human-created container under tf management so labels + replication drift is caught. The value is never read —feedback_never_fetch_secrets.mdstands.Client ID as a plain env var, not through CSI. OAuth 2.0 client IDs are public (visible in every redirect URL); no reason to make them mount-time dependencies.
OIDC always on in the module; no
oidc_enabledfeature flag. First shipped with avar.oidc_enabled ? [...] : []conditional; terraform rejected it because the two branches are tuples of different lengths and can't be unified. Dropped the flag since every caller enables OIDC anyway — re-introduce as list-typed locals if a no-auth dry-run path is ever needed again.Both frontend and GMS flipped to ClusterIP. Discovered mid-apply that the chart provisions
type: LoadBalancerfor both services, which allocates public IPs that completely bypass IAP. Closed the hole in PR #90 — only ingress path is now Gateway API → HTTPRoute → IAP → ClusterIP Service.JAAS left enabled for Story 12. The built-in
datahublocal user still works via port-forward — it's the only way to bootstrap the first Admin in a fresh install. Story 13 replaces this with a policies-as-code Admin grant and can then turn JAAS off.
#
Invariant #11 — bootstrap CI IAM
No layers/00-bootstrap/ changes required. DataHub IAP reuses the same
resource types as the Airflow precedent:
google_iap_client+google_project_iam_member(iap.httpsResourceAccessor) — covered bytf_apply_iap_admin(roles/iap.admin) on apply-SA andtfIapReaderon plan-SA (now reads two IAP clients instead of one; same permissions cover both).kubernetes_secret_v1(IAP OAuth secret),kubernetes_manifest(GCPBackendPolicy, HTTPRoute, SecretProviderClass),kubernetes_job_v1(OIDC mounter) — covered byroles/container.adminon apply-SA andtfK8sSecretsReaderon plan-SA.google_secret_manager_secret(OIDC client secret container, viaimportblock) —roles/editoron apply-SA covers secret create/update,roles/secretmanager.secretAccessoron plan-SA covers refresh.
#
Gotchas
DataHub chart prefixes Service names with the release name. The real Service is
datahub-datahub-frontend, notdatahub-frontend. First apply landed HTTPRoute + GCPBackendPolicy pointing at the short name;kubectl describe gcpbackendpolicyshowedTargetNotFound, the L7 LB returned 404, and the frontend never saw a request. Airflow happens to name its Serviceairflow-api-server(no release prefix) so the same pattern worked there blind. Fix in PR #90: derive the name fromvar.release_nameinside the module instead of hardcoding.type: LoadBalanceris the chart's default for frontend and GMS. Those two public IPs were provisioned from Story 11 onwards and sit outside the IAP perimeter entirely. Fixed in PR #90 by injectingservice = { type = "ClusterIP" }on both subcharts.gcloud compute forwarding-rules listis worth running after any chart-based service landing in this repo.Terraform tuples can't be unified across different lengths. A
var.oidc_enabled ? [... 8 envs ...] : []ternary hitInconsistent conditional result types: tuple length 8 vs 0in validate.tolist()needs homogeneous element types; our envs mixvalueandvalueFromfields. Cleanest fix was to drop the feature flag — every caller turns OIDC on anyway.fault filter abort/ 500 for ~2 minutes after the IAP policy replacement. Renaming the IAP OAuth client (display name embeds the service name) forces a google_iap_client replacement, which also re-creates the k8s Secret and GCPBackendPolicy. While the L7 LB's Envoy config reshuffles, unauthenticated requests get 500 with bodyfault filter abortinstead of the expected 302 to accounts.google.com. Cleared on its own around T+2 min; no config knob to twiddle — just wait.depends_on = [helm_release.datahub]on the HTTPRoute means ServiceName drift hides behind the longest Helm step. When PR #90's apply ran, the GCPBackendPolicy and IAP client recreated immediately (new name) while the HTTPRoute waited for helm to finish upgrading — so for ~7 minutes the policy reportedGatewayNotFoundagainst a still-old HTTPRoute backendRef. Not a bug, just a noisy log window. Checkingkubectl -n datahub get httproute datahub -o jsonpath='{.spec.rules[0].backendRefs[0].name}'pins down whether the flip has happened yet.Manual OIDC client creation is unavoidable today.
google_iap_clientonly creates IAP-flavoured OAuth clients (no control over redirect URI). DataHub's own OIDC needs a "Web application" client with a redirect URI under the DataHub hostname, which is a Console-only step on a brand outside a Workspace org. Steps codified in the header ofenvironments/dev-03-runtime/datahub.tf. Same shape as the IAP brand prerequisite iniap.tf.
#
Verification (post-apply)
- ✓
gh run view <apply>→conclusion=successfor both PRs. - ✓
helm list -n datahub→datahubrevision 3deployedafter PR #90. Revision 2 rolled frontend with OIDC envs; revision 3 added the service-type flip. - ✓
kubectl -n datahub get httproute datahub→Accepted=True, backenddatahub-datahub-frontend. - ✓
kubectl -n datahub describe gcpbackendpolicy datahub-datahub-frontend-iap→Attached=True. - ✓
gcloud compute backend-services list --format='table(name,iap.enabled)'→iap.enabled=Trueongkegw1-89a1-datahub-datahub-datahub-frontend-9002-*. - ✓
kubectl -n datahub get svc→ bothdatahub-datahub-frontendanddatahub-datahub-gmstype=ClusterIP, no EXTERNAL-IP. - ✓
kubectl -n datahub get pod <frontend> -o jsonpath='{.spec.containers[0].env[?(@.valueFrom)]}'→AUTH_OIDC_CLIENT_SECRETwired viasecretKeyRef {name: datahub-oidc-secret, key: client_secret}. - ✓
curl -sI http://datahub.umedev.marpont.es/→ 301 to https. - ✓
curl -sI https://datahub.umedev.marpont.es/→ 302 toaccounts.google.comwith the DataHub IAP client_id in the consent URL. - Browser sign-in (operator sign-off) and non-allowlisted 403 — not scripted here.
#
Then
Story 13 hardens cost + ops (label audit, budget alerts, PDB
verification, maintenance window drill) and lands the
groups / domains / policies-as-code bootstrap that replaces the
manual Admin-promotion step — at which point the local datahub
JAAS user can be retired.