# Deployment Stories

This section defines the implementation sequence for wave-1. Each story is designed to be a single PR (or a small set of closely related PRs) that delivers a verifiable outcome.

Stories are ordered by dependency: each story builds on the output of the previous ones. Do not skip ahead.


# Phase 1 — Airflow on GKE

Phase 1 provisions a GKE Standard cluster and deploys Airflow via the official Apache Airflow Helm chart with CeleryExecutor + Redis. DataHub and its dependencies (Kafka, OpenSearch) are deferred to Phase 2 — they'll be added to the same cluster.

Why GKE Standard instead of Cloud Composer: Composer 3's minimum dev cost floor is ~$300-400/mo. Airflow on a single e2-standard-2 node + Cloud SQL db-g1-small costs ~$81/mo. The 4-5x cost difference is the primary driver. The operational burden is acceptable because the GKE cluster is already planned for DataHub, and all Phase 1 infrastructure is reused in Phase 2 — no throwaway work.

Content repo: DAG, dbt, and Docker image work lives in a sibling repo ume-data-dags. That repo's merges build + push the custom Airflow image, rsync dags/ and dbt/ to the GCS DAGs bucket, and auto-open a tfvars-bump PR on this repo via the INFRA_PR_TOKEN-authenticated bot-PR workflow. ume-data-infra now only tracks the image tag in environments/dev-03-runtime/terraform.tfvars.

Initial Phase 1 content was scaffolded here under resources/ (Stories 4d + 5) and moved out once validated. See story-status.md for the migration record.


# Story 0 — Repository Scaffold

Repo: github.com/1edata/ume-data-infra Agent: infra-terraform Status: DONE

Initialize the ume-data-infra repository with directory skeleton, CI workflows, and the bootstrap stack stub. See story-status.md for details.


# Story 1 — Bootstrap

Stack: layers/00-bootstrap/ Agent: infra-terraform Status: DONE

Terraform state bucket, Artifact Registry, WIF pool + provider, CI service accounts, API enablement. See story-status.md for details.


# Story 2 — Platform Shared (Airflow-focused) → Doc Restructure

Scope: Documentation only (no Terraform resources) Agent: docs-infra Status: DONE

# What happened

Airflow service accounts are environment-scoped, not shared. The Workload Identity bindings reference a specific project's identity pool ({project}.svc.id.goog), and in the multi-project future each project gets its own SAs for its own cluster.

Decision: SA + WI binding creation moved to Story 3c (environments/dev-01-base/). layers/10-platform-shared/ deferred to Phase 2 when cross-environment resources appear (DataHub SA, KMS, logging sink).

# What this story delivered

  • Updated all docs to reflect the restructured SA location
  • Fixed SA naming to follow the ume-{purpose} convention: ume-airflow, ume-airflow-kpo
  • Fixed KSA naming: airflow (not airflow-scheduler — the Helm chart applies one KSA to all components)
  • Updated inter-stack contracts: dev-01-base exports SA emails, dev-02-runtime reads from one stack
  • Updated Story 3c spec to absorb SA + WI binding creation

# Design decisions

  • SA naming: ume-airflow and ume-airflow-kpo (follows ume-{purpose} convention from naming table)
  • KSA naming: airflow — the Helm chart's serviceAccount.name applies to scheduler, worker, webserver, and triggerer. A generic name is accurate.
  • SAs belong in environments/, not layers/: In the multi-project setup, each project has its own SAs for its own cluster. layers/ is for resources shared across all environments and projects (state bucket, WIF, AR).
  • layers/10-platform-shared/ deferred: No cross-environment resources exist in Phase 1. Created in Story 6 when DataHub work begins.
  • storage.objectAdmin project-wide for PoC: The log bucket doesn't exist until Story 4. Scope to specific buckets as a hardening task in Story 4.

# Then

Stories 3a–3d provision networking, Cloud SQL, Airflow IAM, and GKE (one PR each).


# Story 3a — Networking

Stack: environments/dev-01-base/ Agent: infra-terraform

# What to build

Creates the environments/dev-01-base/ directory with stack scaffolding and networking resources.

Stack scaffolding: versions.tf, variables.tf, outputs.tf, locals.tf, backend.hcl, terraform.tfvars, data.tf

Networking (networking.tf):

  • VPC ume-data-dev-vpc (custom mode, regional routing).
  • Subnet ume-data-dev-gke-nodes (10.0.0.0/20) with secondary ranges: gke-pods (10.4.0.0/14), gke-services (10.8.0.0/20).
  • Private Google Access enabled on subnet (for GCS, AR, Secret Manager, BigQuery API access).
  • Static IP ume-data-dev-nat-ip for Cloud NAT egress.
  • Cloud Router ume-data-dev-router + Cloud NAT ume-data-dev-nat for outbound internet from GKE nodes (private cluster, no public IPs). NAT applies to all subnets, error-only logging enabled.

Remote state (data.tf):

  • terraform_remote_state data source reading 00-bootstrap outputs. Separated from networking.tf because it is a stack-level concern shared by Stories 3b-3d.

# Design decisions

  • Direct resources (modularized in Story 3d): Originally used direct resources. Extracted into modules/vpc/ in Story 3d via moved blocks.
  • ume-data-{env} naming prefix: Changed from ume-{env} to avoid generic collisions in shared GCP projects. Updated naming table in 04-terraform-structure.md.
  • Static NAT IP: Reserved google_compute_address for predictable egress. Allows allowlisting by external services.
  • ALL_SUBNETWORKS_ALL_IP_RANGES: No public subnets planned. Cloud NAT only affects VMs without external IPs, so this is safe even if public-IP VMs are added later.
  • Remote state in data.tf: Stack-level concern. Stories 3b-3d will add files to this stack that reference bootstrap outputs. Shared data source avoids duplication.
  • Zone variable in scaffolding: zone = us-east1-b included in variables.tf for Story 3d's zonal GKE cluster.
  • No composer subnet: Composer is not used. VPC design only needs GKE subnets.
  • No Private Service Access (PSA) here: PSA is only needed for Cloud SQL private IP — provisioned in Story 3b alongside the SQL instance.

# Outputs to export

  • vpc_id, vpc_self_link, subnet_self_link, pod_secondary_range_name, service_secondary_range_name, nat_ip_address

# What to verify

  • terraform fmt -check -recursive environments/dev-01-base/
  • terraform init -backend-config=backend.hcl && terraform validate passes
  • terraform plan shows 5 resources (VPC, subnet, static IP, router, NAT)
  • After CI apply: VPC and subnets exist: gcloud compute networks subnets list --project=poc-ume-data
  • After CI apply: Private Google Access enabled: gcloud compute networks subnets describe ume-data-dev-gke-nodes --region=us-east1 --format='value(privateIpGoogleAccess)'
  • After CI apply: Cloud NAT configured: gcloud compute routers list --project=poc-ume-data
  • After CI apply: Static IP reserved: gcloud compute addresses list --project=poc-ume-data --filter='name=ume-data-dev-nat-ip'

# Then

Story 3b adds Cloud SQL on this network.


# Story 3b — Cloud SQL

Stack: environments/dev-01-base/ Agent: infra-terraform Depends on: Story 3a (VPC for PSA peering)

# What to build

Cloud SQL (cloud-sql.tf):

  • Private Service Access (PSA) — google_compute_global_address (ume-data-dev-psa-range, 10.64.0.0/20) + google_service_networking_connection. PSA is only needed for Cloud SQL private IP; GCS/AR/Secret Manager use Private Google Access (enabled in Story 3a), not PSA.
  • PostgreSQL 16 instance ume-data-dev-airflow-pg, tier db-g1-small (shared core, 1.7 GB RAM).
  • Private IP via PSA (no public IP). enable_private_path_for_google_cloud_services = true.
  • IAM authentication flag enabled (cloudsql.iam_authentication = on). The actual IAM user (google_sql_user) and roles/cloudsql.client binding are created in Story 3c alongside the ume-airflow SA.
  • 10 GB SSD, auto-increase enabled, limit 50 GB (safety cap).
  • Automated daily backups at 3 AM UTC, 7-day retention. No PITR (deferred to prod).
  • Maintenance window: Sunday 4 AM UTC, stable track.
  • deletion_protection = false (PoC only).
  • airflow database created via google_sql_database so Story 4's Helm chart can connect immediately.
  • Break-glass admin password: google_secret_manager_secret shell (ume-data-dev-cloudsql-admin-password). Value populated out-of-band. Default postgres user password set manually — no separate Terraform-managed admin user.

# Design decisions

  • db-g1-small over db-f1-micro: db-f1-micro has 614 MB RAM — OOM risk under write load. db-g1-small at 1.7 GB is sufficient for Airflow metadata. Cost: $26 vs $8/mo.
  • PostgreSQL 16: Latest GA on Cloud SQL with improved query performance. Airflow supports 12-16.
  • PSA range /20 not /24: Zero cost difference (just an IP allocation). Expanding PSA ranges later requires deleting/recreating the peering connection (downtime). /20 is future-proof for DataHub, replicas.
  • PSA range hardcoded at 10.64.0.0: Deterministic, reproducible plans. Safely outside all existing allocations (nodes 10.0.0.0/20, pods 10.4.0.0/14, services 10.8.0.0/20).
  • airflow database created here, not in Story 4: Story 4's Helm chart expects metadataConnection.db: airflow. Creating the database alongside the instance avoids a manual prerequisite.
  • No google_sql_user for admin: The default postgres user is created automatically by Cloud SQL. Break-glass access uses postgres + password from Secret Manager.
  • disk_autoresize_limit = 50: Safety cap prevents runaway growth on a PoC instance.
  • File name cloud-sql.tf (not persistence.tf): More specific, consistent with networking.tf and gke.tf. Updated 04-terraform-structure.md to match.
  • No labels on PSA range: google_compute_global_address with purpose = VPC_PEERING rejects labels (GCP API limitation).
  • Shared instance strategy: When DataHub arrives in Phase 2, evaluate whether to create a second logical database on this instance (cheaper) or a separate instance (better isolation).

# Outputs to export (added)

  • sql_connection_name, sql_private_ip, sql_instance_name

# What to verify

  • terraform fmt -check -recursive environments/dev-01-base/
  • terraform validate passes
  • After CI apply: Cloud SQL running: gcloud sql instances list --project=poc-ume-data
  • After CI apply: Private IP assigned (no public): gcloud sql instances describe ume-data-dev-airflow-pg --format='value(ipAddresses)'
  • After CI apply: PSA range allocated: gcloud compute addresses list --global --filter='purpose=VPC_PEERING' --project=poc-ume-data
  • After CI apply: airflow database exists: gcloud sql databases list --instance=ume-data-dev-airflow-pg --project=poc-ume-data
  • After CI apply: Secret shell exists: gcloud secrets list --project=poc-ume-data --filter='name:cloudsql-admin-password'

# Then

Story 3c creates the Airflow service accounts.


# Story 3c — Airflow IAM

Stack: environments/dev-01-base/ Agent: infra-terraform Depends on: Story 3a (stack scaffolding), Story 3b (Cloud SQL instance for IAM database user)

# What to build

Airflow service accounts and IAM (iam.tf):

  • ume-airflow service account with roles/bigquery.dataEditor, roles/cloudsql.client, roles/secretmanager.secretAccessor, roles/storage.objectAdmin (project-wide for PoC; scope to specific buckets in Story 4).
  • ume-airflow-kpo service account with roles/bigquery.dataEditor, roles/storage.objectViewer (scoped identity for KPO tasks — separate from main Airflow SA for security isolation).
  • Workload Identity bindings for both SAs (depends_on = [module.gke] — GCP validates the WI pool exists, so these must wait for the cluster):
    • airflow KSA in airflow namespace → ume-airflow GSA
    • airflow-kpo KSA in airflow-kpo namespace → ume-airflow-kpo GSA
  • Cloud SQL IAM database user (google_sql_user with type = CLOUD_IAM_SERVICE_ACCOUNT) for the ume-airflow SA — deferred from Story 3b.

# Design decisions

  • google_sql_user in iam.tf, not cloud-sql.tf: IAM concern (granting SA database auth). Keeps Story 3c's PR self-contained.
  • google_project_iam_member (additive): Same pattern as bootstrap. Authoritative bindings would revoke other members from shared roles like roles/bigquery.dataEditor.
  • for_each over role sets: Role bindings use toset() locals with for_each. Adding/removing a role is a one-line change. Plan output is self-documenting (keys are full role strings).
  • trimsuffix for SQL user name: The GCP API expects the SA email without .gserviceaccount.com. Using trimsuffix(google_service_account.airflow.email, ".gserviceaccount.com") maintains the Terraform dependency graph.
  • No labels on any resources: google_service_account, google_project_iam_member, google_service_account_iam_member, and google_sql_user do not support GCP labels. Not a label-invariant violation.
  • WI bindings depend on GKE: GCP validates the Workload Identity pool ({project}.svc.id.goog) exists — it is created when a GKE cluster enables Workload Identity. The bindings use depends_on = [module.gke] to ensure correct ordering. GCP does NOT validate that the KSA exists (Story 4 creates them via Helm).
  • Broad permissions flagged for scoping: roles/storage.objectAdmin and roles/secretmanager.secretAccessor are project-wide for PoC. Inline TODO(narrow-scope) comments mark these for Story 4 / future hardening.

# Outputs to export (added)

  • airflow_sa_email, airflow_kpo_sa_email

# What to verify

  • terraform fmt -check -recursive environments/dev-01-base/
  • terraform validate passes
  • After CI apply: gcloud iam service-accounts list --project=poc-ume-data | grep ume-airflow
  • After CI apply: Both SAs created with correct roles
  • After CI apply: Workload Identity bindings exist: gcloud iam service-accounts get-iam-policy ume-airflow@poc-ume-data.iam.gserviceaccount.com
  • After CI apply: Workload Identity bindings exist: gcloud iam service-accounts get-iam-policy ume-airflow-kpo@poc-ume-data.iam.gserviceaccount.com
  • After CI apply: Cloud SQL IAM user exists: gcloud sql users list --instance=ume-data-dev-airflow-pg --project=poc-ume-data

# Then

Story 3d provisions the GKE cluster.


# Story 3d — GKE Cluster + Module Extraction

Stack: environments/dev-01-base/ + modules/gke-standard/ + modules/vpc/ + modules/cloud-sql-postgres/ Agent: infra-terraform Depends on: Story 3a (VPC subnets for nodes/pods/services)

# What to build

Module extraction (applied first): Extract existing flat resources from Stories 3a-3c into reusable modules. State migrated via moved blocks (declarative, CI-friendly — no manual terraform state mv).

  • modules/vpc/ — VPC, subnet with GKE secondary ranges, Cloud NAT, Cloud Router. Single network_cidr_base (/12) parameter derives all CIDRs via cidrsubnet().
  • modules/cloud-sql-postgres/ — PSA peering, Cloud SQL instance, database, admin password secret. Includes PSA because its sole purpose is Cloud SQL private networking.
  • IAM stays flat in the env layer (policy layer, not infrastructure pattern).

Bootstrap fix: Custom role tfIamPolicyAdmin on tf-apply-sa with {get,set}IamPolicy for both projects and service accounts. roles/editor omits these permissions, which are needed for google_project_iam_member and google_service_account_iam_member. Applied manually before CI can manage IAM bindings.

GKE module (modules/gke-standard/): Reusable module encapsulating cluster creation, node pool management, naming, labels, and security defaults. All settings exposed as variables with sensible defaults. Called from environments/dev-01-base/gke.tf.

GKE cluster (via module):

  • Cluster ume-data-dev-gke, zonal (us-east1-b) for dev PoC. Regional deferred to prod.
  • Private cluster: private nodes, public endpoint with authorized networks (default 0.0.0.0/0 for dev, variable-driven for future Cloudflare WARP/VPN restriction).
  • Master CIDR: 172.16.0.0/28 (control plane VPC peering, outside all existing allocations).
  • Workload Identity enabled (${project_id}.svc.id.goog).
  • Release channel: Regular.
  • Dataplane V2 (ADVANCED_DATAPATH) for built-in network policy enforcement via Cilium/eBPF. Chosen over Calico (spec's original choice) because it is Google's strategic direction and avoids LEGACY_DATAPATH.
  • Maintenance window: weekdays 02:00-06:00 UTC.
  • deletion_protection = true.

Node pools:

Pool Machine Disk Min Max Spot Taints Purpose
default-pool e2-standard-2 100 GB pd-balanced 1 2 No None Airflow + system services
kpo-pool e2-standard-2 100 GB pd-balanced 0 3 Yes workload=kpo:NoSchedule KPO batch tasks (scale-to-zero)

Both pools: shielded instances (secure boot + integrity monitoring), Workload Identity metadata mode, legacy metadata endpoint disabled, surge upgrade (max_surge=1, max_unavailable=0).

The kpo-pool scales to zero nodes when idle. When Airflow triggers a KPO task, the pod is created with a toleration for the workload=kpo:NoSchedule taint and a nodeSelector for pool: kpo. The Cluster Autoscaler detects the pending pod and provisions a spot node (~60-90s cold start). After ~10 minutes idle, the node is removed. Max 3 nodes in dev (tightened from 10 to limit blast radius from runaway DAGs).

# Design decisions

  • Local module (modules/gke-standard/): Encapsulates cluster + node pools + naming + labels + security defaults. Environment stacks call the module with different parameters (machine types, node counts, location). Prod replication requires changing ~10 values in the module call instead of duplicating 160 lines of Terraform. All settings exposed as variables with defaults for maximum configurability per environment.
  • Zonal cluster for dev: Halves node count vs regional. Regional deferred to prod when HA is required.
  • e2-standard-2 is the smallest viable machine: Shared-core machines (e2-small, e2-medium) lose ~1060m to flat CPU reservation. With e2-standard-2 (2 vCPU, 8 GiB), allocatable is ~1930m CPU / ~6.1 GiB RAM.
  • Dataplane V2 over Calico: Irreversible choice (requires cluster recreation to change). Cilium/eBPF is more performant than iptables-based Calico. Built-in network policy enforcement without a separate network_policy block. Known limitations reviewed: anetd CPU usage under high TCP churn (not applicable for Airflow), no manual internal passthrough NLBs (not needed).
  • Authorized networks 0.0.0.0/0: API server still requires authentication regardless. Variable-driven list(object) makes restricting to Cloudflare WARP CIDRs a one-line tfvars change.
  • Master CIDR 172.16.0.0/28 hardcoded: Architectural decision, not per-environment. In a different RFC 1918 block from all existing allocations (nodes 10.0.0.0/20, pods 10.4.0.0/14, services 10.8.0.0/20, PSA 10.64.0.0/20).
  • kpo-pool max=3: Tightened from 10 for dev PoC. Limits cost exposure from runaway DAGs while allowing some parallelism.
  • deletion_protection = true: Deliberate two-step teardown (flip flag, then destroy). Safer default even for PoC.
  • oauth_scopes = ["cloud-platform"]: Broad scope is standard practice because Workload Identity provides fine-grained pod-level auth. Node-level scopes are a legacy mechanism.

# Phase 1 resource budget (1x e2-standard-2 default-pool)

Consumer CPU request Memory request
Airflow scheduler 500m 1.5 Gi
Celery worker (1) 250m 1 Gi
Airflow webserver 250m 512 Mi
Airflow triggerer 100m 256 Mi
Redis 50m 128 Mi
System pods (kube-system) ~300m ~400 Mi
Used ~1450m ~3.8 Gi
Remaining headroom ~480m ~2.3 Gi

Snug but workable — dbt-bigquery is I/O-bound (submits SQL and waits). See Airflow on GKE — Scaling signals for when to upgrade to e2-standard-4.

# Outputs to export

  • gke_cluster_name, gke_endpoint, gke_ca_cert (sensitive)

# What to verify

  • terraform fmt -check -recursive environments/dev-01-base/
  • terraform validate passes
  • After CI apply: GKE cluster running: gcloud container clusters list --project=poc-ume-data
  • After CI apply: kubectl works: gcloud container clusters get-credentials ume-data-dev-gke --zone=us-east1-b --project=poc-ume-data && kubectl get nodes
  • After CI apply: One default-pool node visible, zero kpo-pool nodes
  • After CI apply: Both pools listed: gcloud container node-pools list --cluster=ume-data-dev-gke --zone=us-east1-b --project=poc-ume-data

# Then

Story 4 deploys Airflow onto the cluster.


# Story 4a — Runtime Stack Scaffolding + GCS Buckets

Stack: environments/dev-02-runtime/ + modules/gcs-bucket/ + updates to modules/gke-standard/, environments/dev-01-base/, layers/00-bootstrap/ Agent: infra-terraform Depends on: Story 3d (dev-01-base complete)

# What to build

New module — modules/gcs-bucket/:

  • google_storage_bucket with configurable name, location, storage class, lifecycle rules, versioning.
  • Hardcoded: uniform bucket-level access.
  • Variables: name, project_id, location, storage_class, versioning (bool), force_destroy (bool, default false), lifecycle_rules (list of objects supporting Delete and SetStorageClass actions with age, created_before, num_newer_versions, with_state conditions), labels.

GKE module update — modules/gke-standard/:

  • Add gcs_fuse_csi_enabled variable (default true).
  • Enable gcs_fuse_csi_driver_config add-on on the cluster via addons_config block. Required for GCS-based DAG sync in Story 4b.

Prerequisite fixes (gaps from Story 3d):

  • environments/dev-01-base/outputs.tf — Add missing GKE outputs: gke_cluster_name, gke_endpoint, gke_ca_cert (sensitive). Required by dev-02-runtime's kubernetes/helm providers via remote state.
  • environments/dev-01-base/moved.tf — Delete (moves applied in Story 3d, file is dead weight).
  • layers/00-bootstrap/main.tf — Add roles/container.viewer to plan SA. Required for terraform plan on kubernetes/helm resources (Story 4b onward). roles/viewer does not grant k8s API access.

Stack scaffolding — environments/dev-02-runtime/:

  • versions.tf — Terraform + google + google-beta + kubernetes + helm providers. Kubernetes and Helm providers use data.google_client_config.default.access_token for auth and read endpoint + CA cert from dev-01-base remote state.
  • variables.tf — Active: project_id, environment, region, zone, state_bucket. Commented out (wired by later stories): airflow_image_repository, airflow_image_tag, domain_name, airflow_subdomain.
  • outputs.tfairflow_logs_bucket, airflow_dags_bucket.
  • locals.tfcommon_labels (layer=runtime).
  • backend.hcl — GCS backend: ume-tf-state-poc-ume-data/environments/dev-02-runtime/.
  • terraform.tfvars — dev values.
  • data.tfterraform_remote_state reading dev-01-base + 00-bootstrap, plus google_client_config for access token.

GCS buckets (buckets.tf):

  • Log bucket via module: ume-airflow-logs-poc-ume-data, 90-day delete lifecycle, no versioning.
  • DAGs bucket via module: ume-airflow-dags-poc-ume-data, no lifecycle (synced from CI), versioning enabled (rollback support).

# Design decisions

  • modules/gcs-bucket/ module: Log bucket, DAG bucket, and future data buckets share the same pattern (lifecycle, labels, uniform access, versioning). Module-first strategy, justified by multiple callers within Phase 1 alone.
  • Full lifecycle rule support: lifecycle_rules variable accepts a list of objects with action type (Delete/SetStorageClass) and multiple condition types. Handles tiering rules, not just age-based delete.
  • force_destroy as variable: Module invariant says expose all configurable settings. Defaults to false (safe), dev can override for easy teardown.
  • GCS FUSE CSI over git-sync: Workload Identity handles auth to GCS (already configured in Story 3c). No tokens, SSH keys, or deploy keys needed. GCS FUSE is a native GKE add-on. See Story 4b for the mount configuration.
  • Layer named dev-02-runtime (was dev-03-runtime): The dev-02-k8s-base layer was planned for Phase 2 (Strimzi, OpenSearch, ingress). Skipping from dev-01 to dev-03 is confusing when dev-02 doesn't exist. Renumber if Phase 2 needs an intermediate layer.
  • Provider auth pattern: kubernetes/helm providers use data.google_client_config.default.access_token + GKE endpoint/CA from remote state. No gcloud get-credentials calls. Providers initialize lazily, so Story 4a (no k8s resources) doesn't require cluster connectivity during plan.
  • roles/container.viewer on plan SA: roles/viewer does not map to any k8s RBAC role, so the plan SA cannot read k8s state for drift detection. roles/container.viewer grants read-only k8s API access via the view ClusterRole.
  • Two remote state sources: dev-02-runtime reads from both dev-01-base (GKE, SQL, SA outputs) and 00-bootstrap (AR URL, state bucket). Clear provenance over pass-through outputs.

# What to verify

  • terraform fmt -check -recursive passes across all changed stacks
  • terraform init -backend=false && terraform validate passes on modules/gcs-bucket, environments/dev-01-base, environments/dev-02-runtime, layers/00-bootstrap
  • terraform plan shows: 2 GCS buckets + GKE cluster update (FUSE add-on) + 3 new outputs on dev-01-base + 1 new IAM binding on bootstrap
  • After CI apply: buckets exist: gsutil ls gs://ume-airflow-logs-poc-ume-data/ and gsutil ls gs://ume-airflow-dags-poc-ume-data/
  • After CI apply: GCS FUSE CSI enabled on cluster

# Then

Story 4b deploys Airflow onto the cluster.


# Story 4b -- Airflow Helm Release (Stock Image, Port-Forward)

Stack: environments/dev-02-runtime/ + modules/airflow-helm/, with base-layer changes in environments/dev-01-base/ and modules/cloud-sql-postgres/ Agent: infra-terraform Depends on: Story 4a (buckets created, providers configured, GCS FUSE enabled)

# What to build

New module -- modules/airflow-helm/: Namespace, shared service account, connection secrets, DB bootstrap Job, and Helm release. All settings exposed as variables with defaults. Called from environments/dev-02-runtime/airflow.tf as module "airflow".

Airflow Helm release (via module):

  • Official Apache Airflow Helm chart 1.20.0 deployed via helm_release.
  • Stock image: apache/airflow:3.2.0 (parametrized via var.airflow_image_repository + var.airflow_image_tag). Custom image with Cosmos/dbt added in Story 4d.
  • Executor: CeleryExecutor with Redis.
  • 1 Celery worker (min=1, always on).
  • Triggerer enabled (for deferrable operators).
  • DAG processor enabled (mandatory standalone component in Airflow 3).
  • API server enabled (replaces webserver in Airflow 3 -- serves UI and REST API).
  • Namespace: airflow.
  • No external auth -- basic admin user created via Helm createUserJob. Port-forward access is already gated by kubectl / GKE IAM.

Airflow 3 component changes (vs. Airflow 2):

  • Chart 1.20.0 uses semver gates in templates: apiServer renders for Airflow >= 3.0.0, webserver renders for < 3.0.0.
  • dagProcessor is mandatory -- DAG parsing moved out of the scheduler into a standalone process.
  • webserver block kept only for defaultUser config consumed by createUserJob. Its deployment template does not render.

Workload Identity:

  • Chart 1.20.0 creates per-component KSAs by default (airflow-scheduler, airflow-api-server, etc.), none of which carry the WI annotation.
  • A single kubernetes_service_account_v1 is created in Terraform with the WI annotation, and every component references it via serviceAccount = { create = false, name = "airflow" }.
  • The base layer's WI binding targets [airflow/airflow].

Cloud SQL connection (via Auth Proxy sidecar):

  • Cloud SQL Auth Proxy gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.14.3 added as extraContainers on scheduler, workers, api-server, triggerer, dag-processor.
  • Proxy flags: --structured-logs --auto-iam-authn --private-ip --port=5432.
  • --private-ip is required because the Cloud SQL instance has only a private IP (PSA networking).
  • --auto-iam-authn lets the proxy handle IAM token refresh via Workload Identity.
  • Connection string: Pre-built kubernetes_secret_v1 with URL-encoded IAM user (the @ in ume-airflow@poc-ume-data.iam breaks the Helm chart's URI template). Referenced via data.metadataSecretName / data.resultBackendSecretName.

Bootstrap Job (kubernetes_job_v1.db_bootstrap):

  • Runs before the Helm release via depends_on.
  • Cloud SQL Auth Proxy native sidecar (init container with restartPolicy: Always).
  • Step 1 (grants init container): connects as postgres admin, GRANTs privileges to the IAM user on the airflow database. Cloud SQL IAM users are created without any DB privileges.
  • Step 2 (migrate init container): runs airflow db migrate as the IAM user via the proxy.
  • The chart's migrateDatabaseJob is disabled because the chart's hook runs after the main release resources and failed when privileges didn't exist.
  • The postgres admin password is fetched at runtime from Secret Manager via Workload Identity. No long-lived credentials in Kubernetes.

Base-layer changes (required for the bootstrap to work):

  • roles/cloudsql.instanceUser added to the Airflow SA. This is required for IAM database authentication (cloudsql.instances.login), separate from roles/cloudsql.client which only allows proxy connections.
  • cloud-sql-postgres module: automated postgres admin password via random_password + google_sql_user + google_secret_manager_secret_version. No manual password setup.
  • Default pool max_count raised from 2 to 3 (7 Airflow pods with sidecars need room on e2-standard-2 nodes).

DAG sync via GCS FUSE:

  • dags.gitSync.enabled = false.
  • Per-component extraVolumes + extraVolumeMounts on scheduler, workers, triggerer, dag-processor.
  • Pod annotations override GCS FUSE sidecar resources: GKE default injection is 250m CPU / 256Mi memory / 5Gi ephemeral, overridden to 10m / 64Mi / 256Mi (read-only DAG mount barely uses any CPU). Frees ~960m CPU requests across 4 pods.
  • Mounted at /opt/airflow/dags/ (read-only).

Remote logging to GCS (hybrid with Cloud Logging):

  • Container stdout/stderr goes to Cloud Logging automatically (GKE default, zero config).
  • Airflow task execution logs go to GCS via built-in remote_logging:
env:
  - name: AIRFLOW__LOGGING__REMOTE_LOGGING
    value: "True"
  - name: AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER
    value: "gs://ume-airflow-logs-poc-ume-data/logs"
  - name: AIRFLOW__LOGGING__DELETE_LOCAL_LOGS
    value: "True"

Probe tuning: Chart default probes run airflow jobs check which imports the full Python framework on every invocation. On e2-standard-2 nodes this takes >20s. Startup probe failureThreshold set to 20 on scheduler and api-server. Liveness probe timeoutSeconds raised to 60 on scheduler, worker, triggerer, dag-processor.

Cleanup: Standalone kubernetes_cron_job_v1 (disabled by default, var.cleanup_enabled = false). The chart's built-in cleanup section doesn't support extraInitContainers, so the Cloud SQL Auth Proxy can't be injected there.

Resource requests (dev PoC -- 2-3x e2-standard-2 nodes):

scheduler:
  resources:
    requests: { cpu: 200m, memory: 512Mi }
    limits: { cpu: "1", memory: 1Gi }

apiServer:
  resources:
    requests: { cpu: 250m, memory: 512Mi }
    limits: { cpu: 500m, memory: 1Gi }

dagProcessor:
  resources:
    requests: { cpu: 150m, memory: 384Mi }
    limits: { cpu: 500m, memory: 1Gi }

workers:
  replicas: 1
  resources:
    requests: { cpu: 500m, memory: 1536Mi }
    limits: { cpu: "1.5", memory: 3Gi }

triggerer:
  resources:
    requests: { cpu: 100m, memory: 256Mi }
    limits: { cpu: 250m, memory: 512Mi }

redis:
  enabled: true
  resources:
    requests: { cpu: 50m, memory: 64Mi }

postgresql:
  enabled: false  # external Cloud SQL

Hardening note: ume-airflow has project-wide roles/storage.objectAdmin. After this story, scope the grant to the specific log and DAG buckets via bucket-level IAM.

# Design decisions

  • Airflow 3.2.0 / chart 1.20.0: Spec was for 2.10.3 / 1.15.0. Upgraded because 3.2.0 was latest stable at deployment time, which forced the apiServer, dagProcessor, shared KSA, and bootstrap Job changes below.
  • Stock image first: Validates the platform before adding Cosmos/dbt. Custom image in Story 4d.
  • Cloud SQL Auth Proxy sidecar (not Python connector): Stock Airflow image lacks cloud-sql-python-connector. Auth Proxy handles IAM token refresh as a sidecar with no image dependencies.
  • Shared KSA: Chart 1.20.0 creates per-component KSAs, none with WI. One Terraform-managed kubernetes_service_account_v1 avoids N separate WI bindings and keeps the base layer's [airflow/airflow] binding working.
  • Terraform bootstrap Job: The chart's migrateDatabaseJob is a post-install hook -- runs after the release resources exist. Cloud SQL IAM users start with zero DB privileges, so the hook fails on first install. The Terraform Job runs grants + migrate before the Helm release, then disables the chart's migration job. See backlog for investigating the chart's intended pattern.
  • waitForMigrations disabled: Chart 1.20.0 places extraInitContainers after the wait-for-airflow-migrations init container, so a native sidecar proxy there wouldn't be running when the check executes. Safe to disable because the Terraform bootstrap Job already ran migrations.
  • --private-ip: Cloud SQL instance is private-only (PSA). Without this flag the proxy tries public IP and fails.
  • GCS FUSE resource overrides: Default injection (250m CPU / 256Mi memory / 5Gi ephemeral per pod) is overkill for a read-only DAG mount. Annotations bring it down to 10m / 64Mi / 256Mi.
  • Probe timeout 60s: airflow jobs check imports the full framework. 20s is not enough on e2-standard-2.
  • Scheduler CPU limit 1000m: At 500m the scheduler was throttled during Python import and couldn't start within the probe window.
  • Pre-built connection Secrets: IAM DB user ume-airflow@poc-ume-data.iam has @ which breaks standard URI parsing in the Helm chart's template.
  • Port-forward for initial access: No ingress, DNS, or TLS on the critical path. Port-forward is already gated by kubectl / GKE IAM. External access in Story 4c.
  • Hybrid logging: Container logs go to Cloud Logging, task execution logs go to GCS (Airflow UI reads them natively).
  • GCS FUSE over git-sync: Auth handled by Workload Identity, no tokens or keys. CI pushes DAGs to GCS on merge to main.

# What to verify

  • terraform fmt -check -recursive passes
  • terraform init -backend=false && terraform validate passes on dev-02-runtime
  • terraform plan clean on both base and runtime stacks
  • All Airflow pods running: api-server 2/2, scheduler 4/4, dag-processor 4/4, triggerer 4/4, worker 4/4, redis 1/1, statsd 1/1
  • Auth Proxy sidecars running with successful DB connections: kubectl logs deploy/airflow-scheduler -c cloud-sql-proxy -n airflow
  • Bootstrap Job completed (grants + migrations)
  • Airflow UI accessible: kubectl port-forward svc/airflow-api-server 8080:8080 -n airflow
  • Push a hello-world DAG to GCS DAGs bucket, appears in Airflow UI
  • Hello-world DAG runs on the Celery worker
  • Logs appear in GCS: gsutil ls gs://ume-airflow-logs-poc-ume-data/logs/
  • Cloud Logging shows container logs from airflow namespace

# Outputs to export

  • airflow_namespace
  • airflow_logs_bucket (GCS log bucket name)
  • airflow_dags_bucket (GCS DAGs bucket name)

# Then

Story 4c adds ingress, TLS, DNS, and OIDC authentication for the API server.


# Story 4c — Ingress + TLS + DNS + IAP (Gateway API, three layers)

Stacks: layers/00-bootstrap/, environments/dev-01-base/, environments/dev-02-k8s-base/ (new), environments/dev-03-runtime/ (renamed from dev-02-runtime/) Agent: infra-terraform Status: DONE Depends on: Story 4b (Airflow running)

The original spec called for classic GKE Ingress + Flask-AppBuilder OAuth in webserver_config.py. Both were abandoned during execution: classic Ingress can't share a static IP across services (precluding shared-IP + wildcard DNS + per-app ingress), and Airflow 3 replaced Flask-AppBuilder auth with a pluggable auth_manager. The shipped design uses GKE Gateway API with IAP at the load balancer. See story-status.md for the PR-by-PR account.

# What was built

Layer structure reshuffle. New environments/dev-02-k8s-base/ platform layer (pulled forward from Story 8). Old dev-02-runtime/ renamed to dev-03-runtime/. DNS + shared static IP + wildcard cert moved to dev-01-base/ (zero k8s provider dependency).

layers/00-bootstrap/:

  • dns.googleapis.com, iap.googleapis.com APIs enabled.
  • roles/iap.admin on tf-apply-sa (brand/client write path).
  • Custom role tfIapReader on tf-plan-sa with clientauthconfig.{brands,clients}.{get,list}WithSecret variants (plan refresh).
  • Invariant added to CLAUDE.md: verify plan-SA + apply-SA permission coverage before every new downstream resource type.

environments/dev-01-base/:

  • google_dns_managed_zone ume-data-${env}-zone (delegated from GoDaddy).
  • google_compute_global_address ume-data-${env}-ingress-ip (shared across every service on the Gateway).
  • Wildcard A record *.${domain} → shared IP.
  • Certificate Manager DNS-01 authorization + auth CNAME + wildcard managed cert + certificate map + entry — all *.${domain} coverage.
  • New outputs: domain_name, dns_zone_name, dns_zone_nameservers, ingress_ip_name, ingress_ip_address, certificate_map_name.
  • modules/gke-standard/ gained gateway_api_config { channel = "CHANNEL_STANDARD" } (installs Gateway/HTTPRoute v1 CRDs on the cluster).

environments/dev-02-k8s-base/ (new stack):

  • google + kubernetes + helm providers wired via remote_state from dev-01-base.
  • Gateway namespace ume-data-${env}-gateway.
  • kubernetes_manifest Gateway: gatewayClassName = gke-l7-global-external-managed, NamedAddress to base's static IP, listeners https:443 and http:80 both with allowedRoutes.namespaces.from = All, annotation networking.gke.io/certmap to base's cert map.
  • kubernetes_manifest HTTPRoute on :80 with a catch-all PathPrefix: / match and a RequestRedirect filter (scheme https, 301).
  • Outputs: gateway_name, gateway_namespace.

New modules/iap-oauth/:

  • google_iap_client under a caller-provided brand (brand stays in the stack as a project singleton).
  • kubernetes_secret_v1 with exactly one key key = <oauth client secret> (GCPBackendPolicy expects a single-key secret).
  • kubernetes_manifest GCPBackendPolicy with spec.default.iap.{enabled, clientID, oauth2ClientSecret.name} and targetRef to the app Service.
  • google_project_iam_member unconditional bindings on roles/iap.httpsResourceAccessor for each member of the UNION of iap_allowed_domains/groups/users.

Extended modules/airflow-helm/:

  • Optional HTTPRoute (httproute_enabled) attaching to the shared Gateway via cross-namespace parentRef with sectionName = "https" (pins Airflow to the HTTPS listener, leaves :80 for the redirect HTTPRoute).
  • airflow_config.simple_auth_manager_all_admins flag. When true, the module also pins [core] auth_manager = SimpleAuthManager and force-disables the chart's createUserJob (both required to avoid FAB/SimpleAuthManager conflicts).

environments/dev-03-runtime/:

  • IAP brand passed in via var.iap_brand_name (brand is created manually in the GCP Console — see the iap.tf header for the runbook).
  • module "airflow_iap" wires IAP to airflow-api-server with per-user allow-list (ext_marcello.pontes@ume.com.br, wagner.jorge@ume.com.br, leonardo.luiz@ume.com.br).
  • Airflow HTTPRoute on https://airflow.${domain}.
  • airflow_config.simple_auth_manager_all_admins = true — users signed in through IAP land straight on the Airflow UI.

# Prerequisites (one-time manual)

  1. GCP Console → APIs & Services → OAuth consent screen. For Workspace-owned projects pick Internal; for standalone projects pick External. App name, support email, developer contact. The IAP brand is auto-created.
  2. gcloud iap oauth-brands list --project=<id> --format='value(name)' → paste into iap_brand_name in the runtime tfvars.
  3. Delegate ${domain} NS records to Google from the apex registrar (GoDaddy in our case). Fetch nameservers with terraform output -raw dns_zone_nameservers on dev-01-base.

# Design decisions

  • Gateway API over classic Ingress. Enables shared IP + wildcard DNS + per-app ingress (classic Ingress pins one GCLB per Ingress, cannot share).
  • Wildcard Certificate Manager cert (DNS-01). Covers every *.${domain} subdomain. DNS-01 against our own zone activates in minutes. ManagedCertificate CRD is HTTP-01-only and doesn't support wildcards.
  • Three-layer split. 01-base pure GCP; 02-k8s-base k8s-platform singletons (Gateway today, Prometheus/CSI in Phase 2); 03-runtime apps. DNS in base keeps the k8s providers out of the base plan.
  • IAP at GCLB over Airflow-native OIDC. Airflow 3 pluggable auth would require the FAB provider + custom image. IAP is zero-image-change and aligns with DataHub's future auth.
  • Per-user IAP allow-list, unconditional binding. IAM conditions do not propagate to IAP's authorization path for Gateway-API backends — tried and rejected. Tight scoping via the allow-list.
  • simple_auth_manager_all_admins = true with auth_manager pinned to SimpleAuthManager. One login (IAP) is enough; the module pins both configs together to avoid FAB/Simple conflicts and also disables createUserJob.
  • IAP brand stays manual. google_iap_brand doesn't work for non-Workspace projects and the IAP OAuth Admin API is being phased out. Stack accepts the brand as an input.
  • Orthogonal module boundaries. iap-oauth is per-service (reused by DataHub in Phase 2). Gateway sits inline in dev-02-k8s-base for now (extract into modules/gke-gateway/ when prod replicates).

# What to verify

  • terraform fmt -check -recursive + terraform validate pass across all changed stacks and modules.
  • DNS: dig NS umedev.marpont.es @8.8.8.8 returns 4 Google nameservers.
  • Cert: gcloud certificate-manager certificates describe ume-data-dev-wildcard --location=global reaches state: ACTIVE.
  • Gateway: kubectl get gateway -n ume-data-dev-gateway shows PROGRAMMED=True.
  • HTTPRoute: kubectl get httproute -n airflow shows airflow accepted (bound to https section).
  • BackendPolicy: kubectl describe gcpbackendpolicy airflow-api-server-iap -n airflow shows Type: Attached, Status: True.
  • Backend service: gcloud compute backend-services list --format='table(name,iap.enabled)' shows iap.enabled = True on gkegw1-…-airflow-api-server-….
  • IAM: three user bindings on roles/iap.httpsResourceAccessor, unconditional.
  • HTTP redirect: curl -sI http://airflow.umedev.marpont.es/ → 301 to https.
  • IAP: curl -sI https://airflow.umedev.marpont.es/ → 302 to accounts.google.com/o/oauth2/v2/auth?client_id=....
  • Browser sign-in as allow-listed user lands on the Airflow UI with no second login.

# Then

Story 4d adds the custom Airflow image with Cosmos and dbt.


# Story 4d — Custom Airflow Image + Cosmos/dbt

Location: today, the ume-data-dags repo (docker/, scripts/, .github/workflows/image.yml, .github/workflows/bot-pr.yml). On this side: the wait-for-image gate in .github/workflows/terraform-apply.yml and the airflow_image_tag line in environments/dev-03-runtime/terraform.tfvars. Agent: airflow-dags (image + requirements, in ume-data-dags) + infra-terraform (bootstrap SA + WIF, tfvars plumbing, in ume-data-infra) Depends on: Story 4c (Airflow running with ingress + auth)

Spec rewritten 2026-04-18 to match what actually shipped. Original spec targeted apache/airflow:2.10.3 and environments/dev-02-runtime/; Story 4b deployed Airflow 3.2.0 and Story 4c renamed the runtime stack to dev-03-runtime. The 4d base image must extend the deployed 3.2.0 image. Content was initially scaffolded under resources/ in this repo and moved to ume-data-dags once validated.

# What to build

Custom Docker image (ume-data-dags/docker/):

  • Dockerfile extending apache/airflow:3.2.0. Installs astronomer-cosmos~=1.14 in the Airflow Python env (constrained) and dbt-core~=1.9 + dbt-bigquery~=1.9 in an isolated /home/airflow/dbt-venv/ (required because Airflow 3.2's constraints clash with dbt-core on pathspec/protobuf).
  • Build-time guardrails (which dbt, import cosmos, FAB-provider check) fail fast on drift.
  • scripts/build-image.sh — local build helper that tags with the same 3.2.0-<sha> convention as CI.

CI workflows (in ume-data-dags):

  • .github/workflows/image.yml — builds + pushes 3.2.0-<sha> on merge to main when docker/ changes.
  • .github/workflows/dag-sync.ymlgcloud storage rsyncs dags/ + dbt/ to the bucket on merge when those paths change.
  • .github/workflows/pr-ci.yml — PR lint (hadolint + python -m py_compile + dbt parse); no GCP auth needed.
  • .github/workflows/bot-pr.yml — after image.yml succeeds on main, uses INFRA_PR_TOKEN (fine-grained PAT scoped to ume-data-infra only) to open a tfvars-bump PR on this repo.

CI workflows (in ume-data-infra):

  • .github/workflows/terraform-apply.yml — wait-for-image gate before the runtime apply (15-min poll) so Helm never starts a rollout against a missing tag.

Bootstrap and base-layer changes (ume-data-infra):

  • layers/00-bootstrap/main.tfdocker_config { immutable_tags = true } on the AR repo. A content-push SA (ume-datainfra-content-push) scoped to AR writer on ume-composer-images + WIF bound to 1edata/ume-data-dags. Three narrow custom roles on tf-apply-sa (tfWifProviderUpdater, tfCustomRoleManager, tfArRepoIamAdmin) for self-management of these resource types.
  • environments/dev-01-base/iam.tfroles/bigquery.jobUser on ume-airflow and ume-airflow-kpo. Without it, dbt-bigquery cannot submit queries (bigquery.jobs.create denied; bigquery.dataEditor does not include it).
  • environments/dev-03-runtime/buckets.tf — bucket-scoped roles/storage.objectAdmin for the content-push SA on the dev DAGs bucket.

Runtime rollouts (continuous, via the bot-PR loop):

  • airflow_image_repository is set to the AR URL once; airflow_image_tag is bumped on every DAGs-repo merge by the bot-PR workflow. Merging the bot-PR triggers terraform-apply's wait-for-image gate, then Helm rolls the pods.

Ownership model:

ume-data-dags repo:
  └── docker/Dockerfile + dbt venv
  └── dags/ + dbt/
  └── CI: build image + push to AR
  └── CI: gcloud storage rsync dags/ + dbt/ → GCS DAGs bucket
  └── CI: open bot-PR against ume-data-infra bumping airflow_image_tag

ume-data-infra repo (this repo):
  └── environments/dev-03-runtime/terraform.tfvars
      └── airflow_image_repository = "us-east1-docker.pkg.dev/.../ume-composer-images/airflow"
      └── airflow_image_tag = "3.2.0-<sha>"  ← bumped by bot-PR
  └── .github/workflows/terraform-apply.yml
      └── wait-for-image gate before Helm rollout

Tag format: <airflow-version>-<commit-sha> (e.g., 3.2.0-a1b2c3d). Immutable — AR's docker_config.immutable_tags rejects overwrites.

Rollback: revert airflow_image_tag in tfvars to the previous value and apply. The db_bootstrap Job re-runs with the rolled-back image; airflow db migrate is idempotent but the target image must be known-good (otherwise the init container fails and the bootstrap stays Pending).

# What to verify

  • ume-data-dags's pr-ci.yml green on PR (hadolint, py_compile, dbt parse).
  • On merge in ume-data-dags: image present in AR (gcloud artifacts docker images list us-east1-docker.pkg.dev/poc-ume-data/ume-composer-images).
  • Immutable tags enabled (gcloud artifacts repositories describe ume-composer-images --location=us-east1 --format='value(dockerConfig.immutableTags)'True).
  • roles/bigquery.jobUser present on both Airflow SAs.
  • Bot-PR opened on this repo, merged, pods restart with new image.
  • astronomer-cosmos importable (≥ 1.14); dbt --version works at /home/airflow/dbt-venv/bin/dbt.
  • No regression of IAP + SimpleAuthManager: browser sign-in at https://airflow.umedev.marpont.es/ lands on the UI with no Airflow-side login.
  • Cosmos execution mode (local) functional — validated in Story 5.

# Then

Story 5 is bundled — ume_dbt_example DAG is already in ume-data-dags/dags/.


# Story 5 — First Cosmos-Powered dbt DAG

Location: ume-data-dags/dags/ and ume-data-dags/dbt/ Agent: airflow-dags (in ume-data-dags) Depends on: Story 4d (custom image with Cosmos + dbt installed) Bundled with: Story 4d — initially shipped together under resources/ in this repo; content moved to ume-data-dags once validated.

# What to build

In ume-data-dags/dbt/:

  • dbt_project.yml with project configuration.
  • profiles.yml configured for BigQuery OAuth using the Airflow SA identity (workload identity, oauth method).
  • models/example/:
    • ume_hello_world.sql — materialized as table, SELECT CURRENT_TIMESTAMP(), message, sentinel.
    • ume_hello_world_downstream.sql — depends on ume_hello_world via {{ ref(...) }}. Having a ref() edge proves Cosmos renders the task-graph with a dependency, not just "dbt ran."
    • schema.yml documenting both.

In ume-data-dags/dags/:

  • cosmos_dbt_dag.py — a Cosmos DAG using local execution mode (ExecutionMode.LOCAL). The DAG renders the dbt project as individual Airflow tasks, each dispatched to the Celery worker. Cosmos copies the project to a per-task tmp directory before invoking dbt, so the read-only GCS FUSE mount is not a problem.
    • dbt_project_path = /opt/airflow/dags/dbt (GCS FUSE mounts the bucket root at /opt/airflow/dags/, so dbt/ is a sibling of dags/).
    • dbt_executable_path = /home/airflow/dbt-venv/bin/dbt (isolated venv; see Story 4d note about Airflow 3.2 constraints vs dbt-core).
    • is_paused_upon_creation = True, schedule = None, default_args with owner, retries=1.

# What to verify

  • DAGs and dbt project synced to GCS bucket: gsutil ls gs://ume-airflow-dags-poc-ume-data/dags/ gs://ume-airflow-dags-poc-ume-data/dbt/
  • Files visible in worker filesystem: kubectl exec deploy/airflow-worker -n airflow -c worker -- ls /opt/airflow/dags/dbt/
  • ume_dbt_example DAG visible in Airflow UI with two dbt-model tasks and a dependency edge (ume_hello_worldume_hello_world_downstream)
  • Un-pause and trigger the DAG manually — all dbt tasks run successfully
  • bq show --format=prettyjson poc-ume-data:dbt_dev.ume_hello_world and ...ume_hello_world_downstream return expected schemas
  • Airflow task logs show dbt output (both Airflow UI and gs://ume-airflow-logs-poc-ume-data/logs/)
  • Tasks execute on the Celery worker (not scheduler) — verify in task instance details
  • Re-trigger the DAG once; materialized: table replaces tables idempotently (no accidental appends)
  • kubectl top pod -n airflow during the run — worker RSS stays well below the 3 Gi limit

# Then

Phase 1 is complete. The data pipeline (Airflow + dbt + BigQuery) is operational on GKE, and the content pipeline is split into a dedicated ume-data-dags repo. Next steps:

  1. Scope roles/storage.objectAdmin on ume-airflow SA to specific buckets (see backlog).
  2. Extend the DAGs repo workflows to cover prod when the prod project is provisioned (matrix or split workflow files).
  3. Begin Phase 2 (DataHub) when priorities allow.

# Phase 2 — DataHub & Additional Infrastructure

Phase 2 adds DataHub to the existing GKE cluster with Strimzi Kafka and self-hosted OpenSearch as backing services. The GKE cluster, VPC, shared Cloud SQL instance, Gateway, wildcard cert, and IAP brand from Phase 1 are all reused.

Master plan: plans/datahub-deployment-plan.md — read this first. It covers architecture decisions, node-pool strategy, disk sizing, alerting, and the per-story execution strategy (one autonomous session per story, restricted profile).

Each story below is sized for one session. Specs below are the implementation contract; design rationale lives in the master plan.


# Story 6 — Workload Pool + DataHub SQL + Password Secret

Stack: environments/dev-01-base/ (update) + layers/00-bootstrap/ (CI IAM coverage, if needed) Agent: infra-terraform Depends on: Phase 1 complete

# What to build

Node pool (environments/dev-01-base/terraform.tfvars):

Add a new entry to gke_node_pools:

workload-pool = {
  machine_type = "e2-standard-4"
  min_count    = 1
  max_count    = 4
  spot         = false
  extra_labels = { pool = "workload" }
  # No taint — workload selector (pool=workload) is enough.
}

DataHub database + user + password (environments/dev-01-base/cloud-sql.tf):

  • google_sql_database.datahubname = "datahub", instance = module.airflow_sql.instance_name.
  • random_password.datahub_db — length 32, special=false (avoids JDBC URL-encoding traps).
  • google_sql_user.datahubtype = BUILT_IN (password auth), name = "datahub", password = random_password.datahub_db.result.
  • google_secret_manager_secret.datahub_db_passwordsecret_id = "ume-data-dev-datahub-db-password", automatic replication.
  • google_secret_manager_secret_version.datahub_db_password_v1secret_data = random_password.datahub_db.result.

Outputs (environments/dev-01-base/outputs.tf):

  • datahub_db_name = "datahub"
  • datahub_db_user = google_sql_user.datahub.name
  • datahub_db_host = module.airflow_sql.private_ip_address
  • datahub_db_password_secret_id = google_secret_manager_secret.datahub_db_password.secret_id

Cloud Monitoring alert (environments/dev-02-k8s-base/alerts.tf — new file):

  • Policy "Cloud SQL disk > 75%" on metric cloudsql.googleapis.com/database/disk/utilization, filter instance ume-data-dev-airflow-pg, threshold 0.75, duration 10m.

Bootstrap CI IAM check (invariant #11):

Verify tf-plan-sa can read google_secret_manager_secret_version data sources (needed downstream by Story 11's Helm release). The existing tfK8sSecretsReader role (Story 4b era) covers secretmanager.versions.* — confirm during planning; add a custom role if gap found.

# Design decisions

Canonical in plans/datahub-deployment-plan.md §1, §2, §3, §5. Key points:

  • Shared SQL instance, not a new one. Saves ~$26/mo; dev workload fits.
  • Password auth, not IAM auth. Skips 5 Cloud SQL Auth Proxy sidecars in DataHub pods.
  • Secret Manager (not plaintext in Helm values). DataHub pods mount via Secrets Store CSI (Story 7 + Story 11).
  • workload-pool distinct from default-pool. Stateful workloads on their own nodes.
  • min=1 with soft anti-affinity for Kafka/OS pods — cold-start fits one node, scales out as needed.

# What to verify

  • terraform fmt -check -recursive + validate pass.
  • After CI apply: gcloud container node-pools list --cluster=ume-data-dev-gke --zone=us-east1-b shows workload-pool with locations=us-east1-b, machineType=e2-standard-4.
  • gcloud sql databases list --instance=ume-data-dev-airflow-pg shows datahub.
  • gcloud sql users list --instance=ume-data-dev-airflow-pg shows datahub user (type BUILT_IN).
  • gcloud secrets versions list ume-data-dev-datahub-db-password returns exactly one version.
  • gcloud alpha monitoring policies list shows the Cloud SQL disk policy.

# Then

Story 7 installs the Secret Manager CSI driver.


# Story 7 — Secrets Store CSI Driver

Stack: environments/dev-02-k8s-base/ Agent: infra-terraform Depends on: Story 6

# What to build

environments/dev-02-k8s-base/secrets-store-csi.tf (new file):

  • helm_release.secrets_store_csi_driver — chart secrets-store-csi-driver from https://kubernetes-sigs.github.io/secrets-store-csi-driver/charts, namespace kube-system, pinned chart version (verify latest at story time).
  • helm_release.secrets_store_csi_driver_gcp — chart secrets-store-csi-driver-provider-gcp from https://googlecloudplatform.github.io/secrets-store-csi-driver-provider-gcp, namespace kube-system, pinned chart version.
  • Values: syncSecret.enabled = true on the base driver (so mounted secrets can also be synced to native k8s Secrets — DataHub's chart expects env-var refs to k8s Secrets, not file paths).

Outputs: none needed (driver exposes cluster-wide SecretProviderClass CRD).

# Design decisions

  • kube-system namespace. The driver is a DaemonSet that must run on every node pool; standard convention places it in kube-system.
  • GCP provider alongside the base driver. The base driver is generic; the GCP provider is the Secret Manager plugin. Both are required.
  • syncSecret.enabled = true. DataHub's Helm chart and most upstream charts consume passwords via env.valueFrom.secretKeyRef, which requires a k8s Secret object. Sync mode creates one from the CSI mount.

# What to verify

  • kubectl -n kube-system get pods -l app=secrets-store-csi-driver all Running.
  • kubectl get crd secretproviderclasses.secrets-store.csi.x-k8s.io exists.
  • kubectl -n kube-system get pods -l app=csi-secrets-store-provider-gcp all Running.

# Then

Story 8 installs the Strimzi operator.


# Story 8 — Strimzi Kafka Operator

Stack: environments/dev-02-k8s-base/ Agent: infra-terraform Depends on: Story 6 (workload-pool exists; operator itself runs anywhere but its watched clusters target it)

# What to build

environments/dev-02-k8s-base/strimzi.tf (new file):

  • kubernetes_namespace_v1.strimzi_systemstrimzi-system namespace with common labels.
  • helm_release.strimzi_kafka_operator — chart strimzi-kafka-operator from https://strimzi.io/charts/, pinned chart version (verify latest at story time).
  • Values:
    • watchAnyNamespace: true — cluster-wide watch.
    • resources.requests: { cpu: 200m, memory: 384Mi } — operator itself is small.
    • nodeSelector: { pool: workload } — pin operator to workload-pool.

# Design decisions

  • Cluster-wide watch. Matches our shared-Gateway pattern — one operator, many namespaces possible later.
  • Operator on workload-pool. Keeps default-pool free of operator pods.
  • No Kafka CR yet. That's Story 9. Having the operator install PR separate means any CRD/operator upgrade is a clean roll-back.

# What to verify

  • kubectl -n strimzi-system get pods shows operator Running.
  • CRDs installed: kubectl get crd | grep strimzi.io lists kafkas, kafkanodepools, kafkatopics, kafkausers.
  • Operator scheduled on workload-pool: kubectl -n strimzi-system get pods -o wide → node has label pool=workload.

# Then

Story 9 provisions the Kafka cluster.


# Story 9 — Kafka Cluster (KRaft, 3 Controllers + 2 Brokers)

Stack: environments/dev-03-runtime/ + new modules/strimzi-kafka/ Agent: infra-terraform Depends on: Story 8

# What to build

modules/strimzi-kafka/ (new):

  • main.tf — namespace + KafkaNodePool (controllers) + KafkaNodePool (brokers) + Kafka CR via kubernetes_manifest.
  • variables.tfnamespace, cluster_name, kafka_version, controller_replicas (default 3), controller_memory (default 256Mi), controller_storage_size (default 1Gi), broker_replicas (default 2), broker_memory (default 1.5Gi), broker_cpu (default 500m), broker_storage_size (default 10Gi), broker_storage_class (default premium-rwo), log_retention_hours (default 72), log_retention_bytes (default 8589934592 = 8 GiB), min_insync_replicas (default 1), node_selector (default { pool = "workload" }).
  • outputs.tfbootstrap_servers (= <cluster_name>-kafka-bootstrap.<namespace>.svc:9092), namespace, cluster_name.

environments/dev-03-runtime/kafka.tf (new file):

  • module "kafka" call with defaults; cluster_name = "ume-data-dev-kafka", namespace = "kafka".

Alert (environments/dev-02-k8s-base/alerts.tf):

  • Policy "Kafka broker PV > 70%" on metric kubernetes.io/node/persistentvolume/volume/used_bytes / capacity_bytes filter namespace kafka.

# Design decisions

Canonical in plans/datahub-deployment-plan.md §4, §5.

  • KRaft, not ZooKeeper. Strimzi 0.38+ supports KRaft; one fewer moving part.
  • Dedicated controllers. 2-broker combined-role clusters can't form an odd-quorum. 3 tiny controllers solve it.
  • Retention + size caps together. Time-based retention (72h) + byte-based cap (8 GiB) ensures the PV never fills even under a burst.
  • Soft anti-affinity. preferredDuringSchedulingIgnoredDuringExecution on topology.kubernetes.io/hostname. Lets brokers co-locate when there's only one node; spreads when autoscaler adds more.
  • PD-SSD. Kafka is IOPS-sensitive; pd-balanced is cheaper but can stall during retention sweeps.
  • No Cruise Control. Added to backlog for prod.
  • min.insync.replicas = 1. With RF=2, one broker can be down during rolling upgrade without losing write availability.

# What to verify

  • kubectl -n kafka get kafka ume-data-dev-kafkaREADY=True.
  • kubectl -n kafka get pods shows 3 -controllers-* and 2 -brokers-* pods Running.
  • Brokers scheduled on workload-pool nodes.
  • kubectl -n kafka get pvc shows 5 PVCs bound (3 controller + 2 broker).
  • Bootstrap service reachable in-cluster: kubectl -n kafka run kcat --rm -it --image=edenhill/kcat:1.7.1 --restart=Never -- -b ume-data-dev-kafka-kafka-bootstrap:9092 -L (metadata listing).
  • PV alert policy exists.

# Then

Story 10 provisions OpenSearch.


# Story 10 — OpenSearch + Snapshots

Stack: environments/dev-02-k8s-base/ (operator) + environments/dev-03-runtime/ (cluster) + environments/dev-01-base/ (snapshot bucket) Agent: infra-terraform Depends on: Story 8 (pattern proven; independent of Kafka at runtime)

# What to build

Snapshot bucket (environments/dev-01-base/buckets.tf — new file, or append to an existing one):

  • Module call to modules/gcs-bucket/ for ume-opensearch-snapshots-poc-ume-data:
    • versioning = false
    • Lifecycle: delete objects older than 35 days.
    • Expose in outputs as opensearch_snapshots_bucket.

OpenSearch GSA (environments/dev-01-base/iam.tf):

  • google_service_account.opensearch_snapshotume-opensearch-snapshot.
  • Bucket-scoped roles/storage.objectAdmin on the snapshot bucket.
  • Workload Identity binding: opensearch/opensearch-snapshot KSA → ume-opensearch-snapshot GSA.

Operator (environments/dev-02-k8s-base/opensearch.tf — new file):

  • kubernetes_namespace_v1.opensearch_operatoropensearch-operator namespace.
  • helm_release.opensearch_operator — chart opensearch-operator from https://opensearch-project.github.io/opensearch-k8s-operator/, pinned chart version.
  • Values: operator pinned to workload-pool.

Cluster (environments/dev-03-runtime/opensearch.tf — new file):

  • kubernetes_namespace_v1.opensearchopensearch namespace.
  • kubernetes_service_account_v1.opensearch_snapshot — with WI annotation.
  • OpenSearchCluster CR via kubernetes_manifest:
    • 1 data node (also master-eligible), 512Mi JVM heap, 1 CPU, 1.5Gi memory request.
    • 5 GiB PD-SSD storage.
    • nodeSelector: { pool: workload }.
    • Security plugin disabled (dev only; Story 13 hardens with basic auth or mTLS).
  • SecretProviderClass (CSI) — mounts bucket name (not secret, just config; optional, can use direct env).
  • kubernetes_manifest ISM policy (JSON CRD) — delete indices > 30 days.
  • kubernetes_cron_job_v1.opensearch_snapshot — daily at 04:00 UTC, runs curl -XPUT opensearch-cluster/_snapshot/gcs_backup/$(date +%Y%m%d). Uses opensearch-snapshot KSA.

Alert (environments/dev-02-k8s-base/alerts.tf):

  • Policy "OpenSearch PV > 70%" (namespace opensearch).

# Design decisions

  • Single data node in dev. 3-node minimum is a prod concern; dev can take unassigned-shard risk. Snapshots provide the durability backstop.
  • OpenSearch 2.x. DataHub supports both ES 7.10+ and OS 2.x; OS has no license friction.
  • GCS snapshots over cross-zone replication. Cheaper, simpler, and the ops story is clear (restore from snapshot).
  • ISM + bucket lifecycle both. Indices deleted at 30 days inside OS; snapshots deleted at 35 days in GCS. Always a 5-day overlap for recovery.
  • Security plugin off in dev. Keeps the story small. Story 13 re-evaluates.

# What to verify

  • kubectl -n opensearch-operator get pods shows operator Running.
  • kubectl -n opensearch get opensearchclusterREADY.
  • kubectl -n opensearch get pods shows 1 data node Running on workload-pool.
  • gsutil ls gs://ume-opensearch-snapshots-poc-ume-data/ (may be empty before first run).
  • First CronJob run logs show a successful snapshot API call.
  • ISM policy exists: kubectl -n opensearch get opensearchismpolicy.

# Then

Story 11 deploys DataHub.


# Story 11 — DataHub Dry-Run

Stack: environments/dev-03-runtime/ + new modules/datahub-helm/ Agent: datahub-platform Depends on: Stories 6, 7, 9, 10

# What to build

modules/datahub-helm/ (new):

  • Wraps the upstream acryldata/datahub chart. Verify latest chart version at story time (the verify_versions invariant).
  • main.tf — namespace + KSA (no WI binding yet; ingestion adds it) + SecretProviderClass (CSI, syncs Secret Manager datahub-db-password → k8s Secret) + helm_release.
  • Helm values set via module:
    • datahub-gms.replicaCount, datahub-frontend.replicaCount, datahub-mae-consumer.replicaCount, datahub-mce-consumer.replicaCount = 1 each.
    • All pod nodeSelector: { pool: workload }.
    • global.sql.datasource:
      • host: <sql_private_ip>
      • hostForMysqlClient: <sql_private_ip> (chart quirk; still set for postgres paths).
      • port: 5432
      • database: datahub
      • url: jdbc:postgresql://<ip>:5432/datahub
      • driver: org.postgresql.Driver
      • username: datahub
      • extraEnvs: [{ name: DATAHUB_DB_PASSWORD, valueFrom: { secretKeyRef: { name: datahub-db-password, key: password } } }]
    • global.kafka.bootstrap.server: <kafka.bootstrap_servers>.
    • global.elasticsearch.host: opensearch-cluster.opensearch.svc, port: 9200, useSSL: false, skipcheck: true (disables X-Pack check since OS isn't ES).
    • elasticsearchSetupJob.enabled: true — creates DataHub indices.
    • kafkaSetupJob.enabled: true — creates DataHub topics.
  • variables.tf — all knobs exposed (replicas, resources, versions, backing endpoints).
  • outputs.tfnamespace, release_name, frontend_service_name.

environments/dev-03-runtime/datahub.tf (new):

  • module "datahub" call wiring remote_state refs from dev-01-base (SQL) and reading Kafka/OpenSearch service DNS directly (same cluster, well-known names).

environments/dev-03-runtime/data.tf — add outputs passthrough if needed.

No IAP yet. Verify via kubectl port-forward svc/datahub-frontend 9002:9002 -n datahub.

# Design decisions

Canonical in plans/datahub-deployment-plan.md §7.

  • Module over inline. Env-scoped resource, replicates to prod.
  • CSI-synced k8s Secret for DB password. DataHub chart expects secretKeyRef; syncSecret fills it from Secret Manager.
  • Port-forward verification step. No ingress wiring yet — Story 12 adds it. Keeps each PR small.
  • elasticsearch.skipcheck: true — required when pointing DataHub at OpenSearch 2.x (X-Pack check fails otherwise).
  • No KSA → GSA WI binding yet. DataHub GMS does not make GCP API calls; ingestion recipes (in ume-data-dags) do. Adding the binding here would grant permissions nothing uses.

# What to verify

  • kubectl -n datahub get pods shows all DataHub pods Running, setup jobs Completed.
  • kubectl -n datahub logs deploy/datahub-gms shows successful SQL connection, Kafka producer connected, OpenSearch client initialized.
  • kubectl port-forward -n datahub svc/datahub-frontend 9002:9002 + browser http://localhost:9002 loads the UI.
  • datahub DB schema populated: gcloud sql connect ume-data-dev-airflow-pg --database=datahub --user=datahub\dt (read-only check — prohibited per session rules; instead verify via GMS logs).
  • Kafka topics created: kubectl exec -n kafka ume-data-dev-kafka-brokers-0 -- bin/kafka-topics.sh --list --bootstrap-server localhost:9092 lists MetadataChangeLog_Versioned_v1 etc.
  • OpenSearch indices created: visit /_cat/indices via port-forward.

# Then

Story 12 wires IAP and public ingress.


# Story 12 — DataHub IAP + HTTPRoute + OIDC Auth

Stack: environments/dev-03-runtime/ (update) + small modules/datahub-helm/ addition Agent: datahub-platform Depends on: Story 11 Status: DONE — see story-status.md for the post-mortem. First-admin bootstrap still manual (local datahub JAAS user); groups/policies-as-code lands in Story 13.

# What to build

modules/datahub-helm/ — add:

  • httproute_enabled, gateway_name, gateway_namespace, hostname variables (match modules/airflow-helm/ surface).
  • Optional HTTPRoute resource attached to datahub-frontend Service on :9002.
  • DataHub OIDC values passthrough (see "DataHub OIDC" below).

environments/dev-03-runtime/datahub.tf — extend module call with HTTPRoute params + an iap-oauth module call:

  • module "datahub_iap" (new, uses modules/iap-oauth/):
    • service_name = "datahub-frontend"
    • namespace = "datahub"
    • allowed_users = var.iap_allowed_users (same list as Airflow initially).

environments/dev-03-runtime/terraform.tfvars — add datahub_subdomain = "datahub".

DataHub OIDC (in-app identity, not the perimeter):

IAP alone collapses to "all-admin or all-reader" — doesn't meet per-user / per-dataset stewardship. Keep IAP as the perimeter (who can reach the host) and layer DataHub OIDC inside it for in-app identity + roles.

  • Separate OAuth client from the IAP client, created on the same GCP OAuth consent screen. clientId / clientSecret land in Secret Manager and mount into datahub-frontend via Secrets Store CSI (Story 7 driver).

  • Helm values on the frontend chart:

    • authentication.enabled = true / authentication.provider = oidc
    • oidcAuthentication.discoveryUri = https://accounts.google.com/.well-known/openid-configuration
    • oidcAuthentication.userNameClaim = email
    • oidcAuthentication.scopes = "openid profile email"
    • oidcAuthentication.extractGroupsEnabled = false (Phase 1 — see"Phased migration" below).
  • JIT user provisioning is on by default; a new Google account landing through IAP becomes a DataHub user record on first login.

DataHub groups + policies bootstrap (idempotent, driven from a DataHub policies-as-code file checked into this repo or ume-data-dags — final home decided in Story 13 alongside ingestion recipes):

Groups (pre-create, membership managed by admins until Phase 2):

  • platform-admins
  • data-stewards (per-domain children: finance-stewards, marketing-stewards, …)
  • viewers

Domains: one per business area. Each domain has an owner group from data-stewards. Datasets join a domain via ingestion metadata (dbt tags / BigQuery labels / source-system owners surfaced through the recipe).

Policies (all bound to groups, never user URNs — see design decisions):

  • Platform: platform-admins → Admin role.
  • Platform: data-stewards → Editor role.
  • Platform: viewers → Reader (or rely on the Reader default).
  • Platform: finance-stewards → Manage Domain scoped to urn:li:domain:finance (templated per-domain via for_each).
  • Metadata: per-domain "edit metadata where domain=…", bound to the matching steward group.
  • Platform: ingestion SA (Airflow) → Manage Ingestion Sources + Manage Secrets. Runs unattended; no human role.

# Access control model

Stewardship on a specific dataset is Ownership of that entity with ownershipType = DATA_STEWARD — DataHub has no global "Steward" role. The global Editor role just gates who can propose edits at all; ownership gates which assets they can touch.

Actor Setup Can do
Platform admin platform-admins group → Admin role all policies, ingestion, user/group/domain admin
Ingestion SA (Airflow) platform policy: Manage Ingestion + Manage Secrets trigger ingestion runs, create/update recipes
Data steward (per domain) Editor role + Owner of a Domain edit tags / terms / documentation on their domain's assets; propose changes elsewhere
Domain owner scoped "Manage Domain" policy add/remove owners inside their domain without being a platform admin
Viewer Reader (default) browse, search, read

Ownership on a dataset is assigned by (a) admins via UI, (b) domain owners within their scope, (c) ingestion recipes carrying owners metadata. (c) is the scalable path — don't expect to click-assign owners on hundreds of datasets.

# Phased migration to Workspace groups

Google's accounts.google.com OIDC issuer has a fixed claim set — no custom per-user claims, and no groups claim outside Workspace. Plan around that.

Phase 1 (now, no Workspace access):

  • DataHub OIDC → Google; user identity via email.
  • Admins manually add users to the Phase-1 groups on first login.
  • Policies + domains + ownerships already bind to groups, so the Phase-1 work is throwaway-free.

Phase 2 (Workspace access returned):

  • Recreate the same group names as Google Groups under the Workspace domain (platform-admins@…, finance-stewards@…, …).

  • Flip oidcAuthentication.extractGroupsEnabled = true and set oidcAuthentication.groupsClaimName = groups. DataHub syncs group membership on each login.

  • Optional cleanup: remove the Phase-1 manual group memberships (dual membership is harmless during transition).

  • No policy rewrites — because nothing binds to user URNs.

# Design decisions

  • Reuse modules/iap-oauth/ verbatim for the perimeter. Confirmed working for Airflow; parameterized per service.
  • Same IAP allow-list initially. Expand in tfvars when needed.
  • Wildcard cert already covers datahub.umedev.marpont.es. No Certificate Manager changes.
  • IAP at perimeter + DataHub OIDC inside. IAP alone is binary; DataHub's role+policy+ownership layer does per-user and per-dataset work.
  • Bind every policy to a group, never to a user URN. Phase-2 migration to Workspace groups becomes a rename, not a rewrite.
  • Stewardship = Ownership on entity + Editor role, not a global role. Matches DataHub's data model and makes domain-based delegation natural.
  • Policies as code, not click-ops. The group/domain/policy bootstrap lives in a checked-in config so Phase 2 and prod rebuilds are deterministic. Exact location (this repo vs ume-data-dags) decided in Story 13 when ingestion recipes land.

# What to verify

  • kubectl -n datahub get httproute shows datahub accepted.
  • kubectl -n datahub describe gcpbackendpolicy datahub-frontend-iapAttached.
  • gcloud compute backend-services list --format='table(name,iap.enabled)' shows iap.enabled = True on the DataHub backend.
  • curl -sI http://datahub.umedev.marpont.es/ → 301 to https.
  • curl -sI https://datahub.umedev.marpont.es/ → 302 to accounts.google.com.
  • Browser sign-in as allow-listed user lands on DataHub UI.
  • DataHub /login shows "Sign in with Google" after OIDC config applies.
  • A non-allowlisted Google account hits IAP 403 before reaching DataHub (perimeter works independently of DataHub OIDC).
  • Allowlisted user signs in → DataHub user record auto-created with their email as userName.
  • Admin user can reach /settings/policies and create a policy; non-admin user gets 403 on the same path.
  • Steward user can edit a tag on a dataset inside their domain; cannot edit a tag on a dataset outside it.

# Then

Story 13 hardens cost + ops and finalizes where the policies-as-code bootstrap lives.


# Story 13 — Cost + Operations Hardening

Stacks: all dev stacks + ingestion cross-repo coordination Agent: infra-terraform + docs-infra Depends on: Story 12

# What to build

  • Label audit across all Terraform-managed resources (fail CI if labels missing).
  • Budget alerts at 50 / 80 / 100% of target in Cloud Billing.
  • PDB verification: simulate a node drain on workload-pool, confirm DataHub, Kafka, OpenSearch survive.
  • Maintenance window verification on the GKE cluster and Cloud SQL instance.
  • Ingestion DAGs added to ume-data-dags (BigQuery, Airflow, dbt) — cross-repo work, tracked here as coordination.
  • Runbook drill: at least one end-to-end scenario (e.g. Kafka broker restart, OpenSearch snapshot restore).
  • Consider re-enabling OpenSearch security plugin with basic auth backed by Secret Manager CSI.

# What to verify

  • CI label-lint passes on every stack.
  • Budget alert emails received at 50%.
  • kubectl drain <workload-pool-node> — no DataHub/Kafka/OS service disruption.
  • Runbook entry for at least one end-to-end recovery scenario merged.

# Monthly Cost Summary

# Phase 1 — Airflow only (~$81/mo)

Resource Spec Monthly
GKE cluster mgmt Free tier (zonal) $0
default-pool 1x e2-standard-2 $49
kpo-pool 0 nodes idle; spot when active $0 idle
Cloud SQL db-g1-small $26
Cloud SQL storage 10 GiB SSD $2
Boot + PD 20 GiB pd-balanced $2
Cloud NAT 2 NICs min $2
Total ~$81

# Phase 2 — Add DataHub (~$200-310/mo incremental, depending on autoscaler)

Addition Spec Monthly
workload-pool steady state 1x e2-standard-4 +$98
workload-pool scaled up to 4x e2-standard-4 up to +$392
PD-SSD (Kafka 2×10 GiB + OS 5 GiB + controllers 3×1 GiB) ~28 GiB +$6
Cloud SQL — shared instance, no tier bump same instance +$0
Snapshot bucket Standard, 35-day retention ~$1
Phase 2 incremental (steady) ~$105
Phase 2 incremental (worst case) ~$400

Savings vs original plan: ~$100/mo by reusing Cloud SQL + dropping the Auth Proxy sidecars + dev-sized Kafka (2 brokers vs 3, no Cruise Control) + single-node OpenSearch.

Note: GKE free tier covers one zonal cluster. Regional cluster in prod costs an additional ~$74/mo.


# After Phase 2

Once all stories are completed and verified on dev:

  1. Review lessons learned. Update docs where reality diverged from plan.
  2. Provision prod GCP projects (externally, by org admin).
  3. Create prod-01-base, prod-02-runtime stacks (mirror dev structure, different terraform.tfvars).
  4. Execute Phase 2 stories against prod, with the GitHub Environment approval gate.
  5. Promote the dev-validated custom Airflow image tag to prod.
  6. Enable DataHub ingestion recipes against prod BigQuery datasets.