# Deployment Stories

This section defines the implementation sequence for wave-1. Each story is designed to be a single PR (or a small set of closely related PRs) that delivers a verifiable outcome.

Stories are ordered by dependency: each story builds on the output of the previous ones. Do not skip ahead.

These stories are designed to be executable by both humans and Claude Code agents. Each story specifies: context, what to build, what to verify, and which agent (if any) should be invoked.

# Phase 1 — Airflow on GKE

Phase 1 provisions a GKE Standard cluster and deploys Airflow via the official Apache Airflow Helm chart with CeleryExecutor + Redis. DataHub and its dependencies (Kafka, OpenSearch) are deferred to Phase 2 — they'll be added to the same cluster.

Why GKE Standard instead of Cloud Composer: Composer 3's minimum dev cost floor is ~$300-400/mo. Airflow on a single e2-standard-2 node + Cloud SQL db-g1-small costs ~$81/mo. The 4-5x cost difference is the primary driver. The operational burden is acceptable because the GKE cluster is already planned for DataHub, and all Phase 1 infrastructure is reused in Phase 2 — no throwaway work.

Content repo: DAG, dbt, and Docker image work lives in a sibling repo ume-data-dags. That repo's merges build + push the custom Airflow image, rsync dags/ and dbt/ to the GCS DAGs bucket, and auto-open a tfvars-bump PR on this repo via the INFRA_PR_TOKEN-authenticated bot-PR workflow. ume-data-infra now only tracks the image tag in environments/dev-03-runtime/terraform.tfvars.

Initial Phase 1 content was scaffolded here under resources/ (Stories 4d + 5) and moved out once validated. See story-status.md for the migration record.

# Story 0 — Repository Scaffold

Repo: github.com/1edata/ume-data-infra Agent: infra-terraform Status: DONE

Initialize the ume-data-infra repository with directory skeleton, CI workflows, and the bootstrap stack stub. See story-status.md for details.

# Story 1 — Bootstrap

Stack: layers/00-bootstrap/ Agent: infra-terraform Status: DONE

Terraform state bucket, Artifact Registry, WIF pool + provider, CI service accounts, API enablement. See story-status.md for details.

# Story 2 — Platform Shared (Airflow-focused) → Doc Restructure

Scope: Documentation only (no Terraform resources) Agent: docs-infra Status: DONE

# What happened

Airflow service accounts are environment-scoped, not shared. The Workload Identity bindings reference a specific project's identity pool ({project}.svc.id.goog), and in the multi-project future each project gets its own SAs for its own cluster.

Decision: SA + WI binding creation moved to Story 3c (environments/dev-01-base/). layers/10-platform-shared/ deferred to Phase 2 when cross-environment resources appear (DataHub SA, KMS, logging sink).

# What this story delivered

Updated all docs to reflect the restructured SA location
Fixed SA naming to follow the ume-{purpose} convention: ume-airflow, ume-airflow-kpo
Fixed KSA naming: airflow (not airflow-scheduler — the Helm chart applies one KSA to all components)
Updated inter-stack contracts: dev-01-base exports SA emails, dev-02-runtime reads from one stack
Updated Story 3c spec to absorb SA + WI binding creation

# Design decisions

SA naming: ume-airflow and ume-airflow-kpo (follows ume-{purpose} convention from naming table)
KSA naming: airflow — the Helm chart's serviceAccount.name applies to scheduler, worker, webserver, and triggerer. A generic name is accurate.
SAs belong in environments/, not layers/: In the multi-project setup, each project has its own SAs for its own cluster. layers/ is for resources shared across all environments and projects (state bucket, WIF, AR).
layers/10-platform-shared/ deferred: No cross-environment resources exist in Phase 1. Created in Story 6 when DataHub work begins.
storage.objectAdmin project-wide for PoC: The log bucket doesn't exist until Story 4. Scope to specific buckets as a hardening task in Story 4.

# Then

Stories 3a–3d provision networking, Cloud SQL, Airflow IAM, and GKE (one PR each).

# Story 3a — Networking

Stack: environments/dev-01-base/ Agent: infra-terraform

# What to build

Creates the environments/dev-01-base/ directory with stack scaffolding and networking resources.

Stack scaffolding: versions.tf, variables.tf, outputs.tf, locals.tf, backend.hcl, terraform.tfvars, data.tf

Networking (networking.tf):

VPC ume-data-dev-vpc (custom mode, regional routing).
Subnet ume-data-dev-gke-nodes (10.0.0.0/20) with secondary ranges: gke-pods (10.4.0.0/14), gke-services (10.8.0.0/20).
Private Google Access enabled on subnet (for GCS, AR, Secret Manager, BigQuery API access).
Static IP ume-data-dev-nat-ip for Cloud NAT egress.
Cloud Router ume-data-dev-router + Cloud NAT ume-data-dev-nat for outbound internet from GKE nodes (private cluster, no public IPs). NAT applies to all subnets, error-only logging enabled.

Remote state (data.tf):

terraform_remote_state data source reading 00-bootstrap outputs. Separated from networking.tf because it is a stack-level concern shared by Stories 3b-3d.

# Design decisions

Direct resources (modularized in Story 3d): Originally used direct resources. Extracted into modules/vpc/ in Story 3d via moved blocks.
ume-data-{env} naming prefix: Changed from ume-{env} to avoid generic collisions in shared GCP projects. Updated naming table in 04-terraform-structure.md.
Static NAT IP: Reserved google_compute_address for predictable egress. Allows allowlisting by external services.
ALL_SUBNETWORKS_ALL_IP_RANGES: No public subnets planned. Cloud NAT only affects VMs without external IPs, so this is safe even if public-IP VMs are added later.
Remote state in data.tf: Stack-level concern. Stories 3b-3d will add files to this stack that reference bootstrap outputs. Shared data source avoids duplication.
Zone variable in scaffolding: zone = us-east1-b included in variables.tf for Story 3d's zonal GKE cluster.
No composer subnet: Composer is not used. VPC design only needs GKE subnets.
No Private Service Access (PSA) here: PSA is only needed for Cloud SQL private IP — provisioned in Story 3b alongside the SQL instance.

# Outputs to export

vpc_id, vpc_self_link, subnet_self_link, pod_secondary_range_name, service_secondary_range_name, nat_ip_address

# What to verify

terraform fmt -check -recursive environments/dev-01-base/
terraform init -backend-config=backend.hcl && terraform validate passes
terraform plan shows 5 resources (VPC, subnet, static IP, router, NAT)
After CI apply: VPC and subnets exist: gcloud compute networks subnets list --project=poc-ume-data
After CI apply: Private Google Access enabled: gcloud compute networks subnets describe ume-data-dev-gke-nodes --region=us-east1 --format='value(privateIpGoogleAccess)'
After CI apply: Cloud NAT configured: gcloud compute routers list --project=poc-ume-data
After CI apply: Static IP reserved: gcloud compute addresses list --project=poc-ume-data --filter='name=ume-data-dev-nat-ip'

# Then

Story 3b adds Cloud SQL on this network.

# Story 3b — Cloud SQL

Stack: environments/dev-01-base/ Agent: infra-terraform Depends on: Story 3a (VPC for PSA peering)

# What to build

Cloud SQL (cloud-sql.tf):

Private Service Access (PSA) — google_compute_global_address (ume-data-dev-psa-range, 10.64.0.0/20) + google_service_networking_connection. PSA is only needed for Cloud SQL private IP; GCS/AR/Secret Manager use Private Google Access (enabled in Story 3a), not PSA.
PostgreSQL 16 instance ume-data-dev-airflow-pg, tier db-g1-small (shared core, 1.7 GB RAM).
Private IP via PSA (no public IP). enable_private_path_for_google_cloud_services = true.
IAM authentication flag enabled (cloudsql.iam_authentication = on). The actual IAM user (google_sql_user) and roles/cloudsql.client binding are created in Story 3c alongside the ume-airflow SA.
10 GB SSD, auto-increase enabled, limit 50 GB (safety cap).
Automated daily backups at 3 AM UTC, 7-day retention. No PITR (deferred to prod).
Maintenance window: Sunday 4 AM UTC, stable track.
deletion_protection = false (PoC only).
airflow database created via google_sql_database so Story 4's Helm chart can connect immediately.
Break-glass admin password: google_secret_manager_secret shell (ume-data-dev-cloudsql-admin-password). Value populated out-of-band. Default postgres user password set manually — no separate Terraform-managed admin user.

# Design decisions

db-g1-small over db-f1-micro: db-f1-micro has 614 MB RAM — OOM risk under write load. db-g1-small at 1.7 GB is sufficient for Airflow metadata. Cost: $26 vs $8/mo.
PostgreSQL 16: Latest GA on Cloud SQL with improved query performance. Airflow supports 12-16.
PSA range /20 not /24: Zero cost difference (just an IP allocation). Expanding PSA ranges later requires deleting/recreating the peering connection (downtime). /20 is future-proof for DataHub, replicas.
PSA range hardcoded at 10.64.0.0: Deterministic, reproducible plans. Safely outside all existing allocations (nodes 10.0.0.0/20, pods 10.4.0.0/14, services 10.8.0.0/20).
airflow database created here, not in Story 4: Story 4's Helm chart expects metadataConnection.db: airflow. Creating the database alongside the instance avoids a manual prerequisite.
No google_sql_user for admin: The default postgres user is created automatically by Cloud SQL. Break-glass access uses postgres + password from Secret Manager.
disk_autoresize_limit = 50: Safety cap prevents runaway growth on a PoC instance.
File name cloud-sql.tf (not persistence.tf): More specific, consistent with networking.tf and gke.tf. Updated 04-terraform-structure.md to match.
No labels on PSA range: google_compute_global_address with purpose = VPC_PEERING rejects labels (GCP API limitation).
Shared instance strategy: When DataHub arrives in Phase 2, evaluate whether to create a second logical database on this instance (cheaper) or a separate instance (better isolation).

# Outputs to export (added)

sql_connection_name, sql_private_ip, sql_instance_name

# What to verify

terraform fmt -check -recursive environments/dev-01-base/
terraform validate passes
After CI apply: Cloud SQL running: gcloud sql instances list --project=poc-ume-data
After CI apply: Private IP assigned (no public): gcloud sql instances describe ume-data-dev-airflow-pg --format='value(ipAddresses)'
After CI apply: PSA range allocated: gcloud compute addresses list --global --filter='purpose=VPC_PEERING' --project=poc-ume-data
After CI apply: airflow database exists: gcloud sql databases list --instance=ume-data-dev-airflow-pg --project=poc-ume-data
After CI apply: Secret shell exists: gcloud secrets list --project=poc-ume-data --filter='name:cloudsql-admin-password'

# Then

Story 3c creates the Airflow service accounts.

# Story 3c — Airflow IAM

Stack: environments/dev-01-base/ Agent: infra-terraform Depends on: Story 3a (stack scaffolding), Story 3b (Cloud SQL instance for IAM database user)

# What to build

Airflow service accounts and IAM (iam.tf):

ume-airflow service account with roles/bigquery.dataEditor, roles/cloudsql.client, roles/secretmanager.secretAccessor, roles/storage.objectAdmin (project-wide for PoC; scope to specific buckets in Story 4).
ume-airflow-kpo service account with roles/bigquery.dataEditor, roles/storage.objectViewer (scoped identity for KPO tasks — separate from main Airflow SA for security isolation).
Workload Identity bindings for both SAs (depends_on = [module.gke] — GCP validates the WI pool exists, so these must wait for the cluster):
- airflow KSA in airflow namespace → ume-airflow GSA
- airflow-kpo KSA in airflow-kpo namespace → ume-airflow-kpo GSA
Cloud SQL IAM database user (google_sql_user with type = CLOUD_IAM_SERVICE_ACCOUNT) for the ume-airflow SA — deferred from Story 3b.

# Design decisions

google_sql_user in iam.tf, not cloud-sql.tf: IAM concern (granting SA database auth). Keeps Story 3c's PR self-contained.
google_project_iam_member (additive): Same pattern as bootstrap. Authoritative bindings would revoke other members from shared roles like roles/bigquery.dataEditor.
for_each over role sets: Role bindings use toset() locals with for_each. Adding/removing a role is a one-line change. Plan output is self-documenting (keys are full role strings).
trimsuffix for SQL user name: The GCP API expects the SA email without .gserviceaccount.com. Using trimsuffix(google_service_account.airflow.email, ".gserviceaccount.com") maintains the Terraform dependency graph.
No labels on any resources: google_service_account, google_project_iam_member, google_service_account_iam_member, and google_sql_user do not support GCP labels. Not a label-invariant violation.
WI bindings depend on GKE: GCP validates the Workload Identity pool ({project}.svc.id.goog) exists — it is created when a GKE cluster enables Workload Identity. The bindings use depends_on = [module.gke] to ensure correct ordering. GCP does NOT validate that the KSA exists (Story 4 creates them via Helm).
Broad permissions flagged for scoping: roles/storage.objectAdmin and roles/secretmanager.secretAccessor are project-wide for PoC. Inline TODO(narrow-scope) comments mark these for Story 4 / future hardening.

# Outputs to export (added)

airflow_sa_email, airflow_kpo_sa_email

# What to verify

terraform fmt -check -recursive environments/dev-01-base/
terraform validate passes
After CI apply: gcloud iam service-accounts list --project=poc-ume-data | grep ume-airflow
After CI apply: Both SAs created with correct roles
After CI apply: Workload Identity bindings exist: gcloud iam service-accounts get-iam-policy ume-airflow@poc-ume-data.iam.gserviceaccount.com
After CI apply: Workload Identity bindings exist: gcloud iam service-accounts get-iam-policy ume-airflow-kpo@poc-ume-data.iam.gserviceaccount.com
After CI apply: Cloud SQL IAM user exists: gcloud sql users list --instance=ume-data-dev-airflow-pg --project=poc-ume-data

# Then

Story 3d provisions the GKE cluster.

# Story 3d — GKE Cluster + Module Extraction

Stack: environments/dev-01-base/ + modules/gke-standard/ + modules/vpc/ + modules/cloud-sql-postgres/ Agent: infra-terraform Depends on: Story 3a (VPC subnets for nodes/pods/services)

# What to build

Module extraction (applied first): Extract existing flat resources from Stories 3a-3c into reusable modules. State migrated via moved blocks (declarative, CI-friendly — no manual terraform state mv).

modules/vpc/ — VPC, subnet with GKE secondary ranges, Cloud NAT, Cloud Router. Single network_cidr_base (/12) parameter derives all CIDRs via cidrsubnet().
modules/cloud-sql-postgres/ — PSA peering, Cloud SQL instance, database, admin password secret. Includes PSA because its sole purpose is Cloud SQL private networking.
IAM stays flat in the env layer (policy layer, not infrastructure pattern).

Bootstrap fix: Custom role tfIamPolicyAdmin on tf-apply-sa with {get,set}IamPolicy for both projects and service accounts. roles/editor omits these permissions, which are needed for google_project_iam_member and google_service_account_iam_member. Applied manually before CI can manage IAM bindings.

GKE module (modules/gke-standard/): Reusable module encapsulating cluster creation, node pool management, naming, labels, and security defaults. All settings exposed as variables with sensible defaults. Called from environments/dev-01-base/gke.tf.

GKE cluster (via module):

Cluster ume-data-dev-gke, zonal (us-east1-b) for dev PoC. Regional deferred to prod.
Private cluster: private nodes, public endpoint with authorized networks (default 0.0.0.0/0 for dev, variable-driven for future Cloudflare WARP/VPN restriction).
Master CIDR: 172.16.0.0/28 (control plane VPC peering, outside all existing allocations).
Workload Identity enabled (${project_id}.svc.id.goog).
Release channel: Regular.
Dataplane V2 (ADVANCED_DATAPATH) for built-in network policy enforcement via Cilium/eBPF. Chosen over Calico (spec's original choice) because it is Google's strategic direction and avoids LEGACY_DATAPATH.
Maintenance window: weekdays 02:00-06:00 UTC.
deletion_protection = true.

Node pools:

Pool	Machine	Disk	Min	Max	Spot	Taints	Purpose
`default-pool`	`e2-standard-2`	100 GB pd-balanced	1	2	No	None	Airflow + system services
`kpo-pool`	`e2-standard-2`	100 GB pd-balanced	0	3	Yes	`workload=kpo:NoSchedule`	KPO batch tasks (scale-to-zero)

Both pools: shielded instances (secure boot + integrity monitoring), Workload Identity metadata mode, legacy metadata endpoint disabled, surge upgrade (max_surge=1, max_unavailable=0).

The kpo-pool scales to zero nodes when idle. When Airflow triggers a KPO task, the pod is created with a toleration for the workload=kpo:NoSchedule taint and a nodeSelector for pool: kpo. The Cluster Autoscaler detects the pending pod and provisions a spot node (~60-90s cold start). After ~10 minutes idle, the node is removed. Max 3 nodes in dev (tightened from 10 to limit blast radius from runaway DAGs).

# Design decisions

Local module (modules/gke-standard/): Encapsulates cluster + node pools + naming + labels + security defaults. Environment stacks call the module with different parameters (machine types, node counts, location). Prod replication requires changing ~10 values in the module call instead of duplicating 160 lines of Terraform. All settings exposed as variables with defaults for maximum configurability per environment.
Zonal cluster for dev: Halves node count vs regional. Regional deferred to prod when HA is required.
e2-standard-2 is the smallest viable machine: Shared-core machines (e2-small, e2-medium) lose ~1060m to flat CPU reservation. With e2-standard-2 (2 vCPU, 8 GiB), allocatable is ~1930m CPU / ~6.1 GiB RAM.
Dataplane V2 over Calico: Irreversible choice (requires cluster recreation to change). Cilium/eBPF is more performant than iptables-based Calico. Built-in network policy enforcement without a separate network_policy block. Known limitations reviewed: anetd CPU usage under high TCP churn (not applicable for Airflow), no manual internal passthrough NLBs (not needed).
Authorized networks 0.0.0.0/0: API server still requires authentication regardless. Variable-driven list(object) makes restricting to Cloudflare WARP CIDRs a one-line tfvars change.
Master CIDR 172.16.0.0/28 hardcoded: Architectural decision, not per-environment. In a different RFC 1918 block from all existing allocations (nodes 10.0.0.0/20, pods 10.4.0.0/14, services 10.8.0.0/20, PSA 10.64.0.0/20).
kpo-pool max=3: Tightened from 10 for dev PoC. Limits cost exposure from runaway DAGs while allowing some parallelism.
deletion_protection = true: Deliberate two-step teardown (flip flag, then destroy). Safer default even for PoC.
oauth_scopes = ["cloud-platform"]: Broad scope is standard practice because Workload Identity provides fine-grained pod-level auth. Node-level scopes are a legacy mechanism.

# Phase 1 resource budget (1x `e2-standard-2` default-pool)

Consumer	CPU request	Memory request
Airflow scheduler	500m	1.5 Gi
Celery worker (1)	250m	1 Gi
Airflow webserver	250m	512 Mi
Airflow triggerer	100m	256 Mi
Redis	50m	128 Mi
System pods (kube-system)	~300m	~400 Mi
Used	~1450m	~3.8 Gi
Remaining headroom	~480m	~2.3 Gi

Snug but workable — dbt-bigquery is I/O-bound (submits SQL and waits). See Airflow on GKE — Scaling signals for when to upgrade to e2-standard-4.

# Outputs to export

gke_cluster_name, gke_endpoint, gke_ca_cert (sensitive)

# What to verify

terraform fmt -check -recursive environments/dev-01-base/
terraform validate passes
After CI apply: GKE cluster running: gcloud container clusters list --project=poc-ume-data
After CI apply: kubectl works: gcloud container clusters get-credentials ume-data-dev-gke --zone=us-east1-b --project=poc-ume-data && kubectl get nodes
After CI apply: One default-pool node visible, zero kpo-pool nodes
After CI apply: Both pools listed: gcloud container node-pools list --cluster=ume-data-dev-gke --zone=us-east1-b --project=poc-ume-data

# Then

Story 4 deploys Airflow onto the cluster.

# Story 4a — Runtime Stack Scaffolding + GCS Buckets

Stack: environments/dev-02-runtime/ + modules/gcs-bucket/ + updates to modules/gke-standard/, environments/dev-01-base/, layers/00-bootstrap/ Agent: infra-terraform Depends on: Story 3d (dev-01-base complete)

# What to build

New module — modules/gcs-bucket/:

google_storage_bucket with configurable name, location, storage class, lifecycle rules, versioning.
Hardcoded: uniform bucket-level access.
Variables: name, project_id, location, storage_class, versioning (bool), force_destroy (bool, default false), lifecycle_rules (list of objects supporting Delete and SetStorageClass actions with age, created_before, num_newer_versions, with_state conditions), labels.

GKE module update — modules/gke-standard/:

Add gcs_fuse_csi_enabled variable (default true).
Enable gcs_fuse_csi_driver_config add-on on the cluster via addons_config block. Required for GCS-based DAG sync in Story 4b.

Prerequisite fixes (gaps from Story 3d):

environments/dev-01-base/outputs.tf — Add missing GKE outputs: gke_cluster_name, gke_endpoint, gke_ca_cert (sensitive). Required by dev-02-runtime's kubernetes/helm providers via remote state.
environments/dev-01-base/moved.tf — Delete (moves applied in Story 3d, file is dead weight).
layers/00-bootstrap/main.tf — Add roles/container.viewer to plan SA. Required for terraform plan on kubernetes/helm resources (Story 4b onward). roles/viewer does not grant k8s API access.

Stack scaffolding — environments/dev-02-runtime/:

versions.tf — Terraform + google + google-beta + kubernetes + helm providers. Kubernetes and Helm providers use data.google_client_config.default.access_token for auth and read endpoint + CA cert from dev-01-base remote state.
variables.tf — Active: project_id, environment, region, zone, state_bucket. Commented out (wired by later stories): airflow_image_repository, airflow_image_tag, domain_name, airflow_subdomain.
outputs.tf — airflow_logs_bucket, airflow_dags_bucket.
locals.tf — common_labels (layer=runtime).
backend.hcl — GCS backend: ume-tf-state-poc-ume-data/environments/dev-02-runtime/.
terraform.tfvars — dev values.
data.tf — terraform_remote_state reading dev-01-base + 00-bootstrap, plus google_client_config for access token.

GCS buckets (buckets.tf):

Log bucket via module: ume-airflow-logs-poc-ume-data, 90-day delete lifecycle, no versioning.
DAGs bucket via module: ume-airflow-dags-poc-ume-data, no lifecycle (synced from CI), versioning enabled (rollback support).

# Design decisions

modules/gcs-bucket/ module: Log bucket, DAG bucket, and future data buckets share the same pattern (lifecycle, labels, uniform access, versioning). Module-first strategy, justified by multiple callers within Phase 1 alone.
Full lifecycle rule support: lifecycle_rules variable accepts a list of objects with action type (Delete/SetStorageClass) and multiple condition types. Handles tiering rules, not just age-based delete.
force_destroy as variable: Module invariant says expose all configurable settings. Defaults to false (safe), dev can override for easy teardown.
GCS FUSE CSI over git-sync: Workload Identity handles auth to GCS (already configured in Story 3c). No tokens, SSH keys, or deploy keys needed. GCS FUSE is a native GKE add-on. See Story 4b for the mount configuration.
Layer named dev-02-runtime (was dev-03-runtime): The dev-02-k8s-base layer was planned for Phase 2 (Strimzi, OpenSearch, ingress). Skipping from dev-01 to dev-03 is confusing when dev-02 doesn't exist. Renumber if Phase 2 needs an intermediate layer.
Provider auth pattern: kubernetes/helm providers use data.google_client_config.default.access_token + GKE endpoint/CA from remote state. No gcloud get-credentials calls. Providers initialize lazily, so Story 4a (no k8s resources) doesn't require cluster connectivity during plan.
roles/container.viewer on plan SA: roles/viewer does not map to any k8s RBAC role, so the plan SA cannot read k8s state for drift detection. roles/container.viewer grants read-only k8s API access via the view ClusterRole.
Two remote state sources: dev-02-runtime reads from both dev-01-base (GKE, SQL, SA outputs) and 00-bootstrap (AR URL, state bucket). Clear provenance over pass-through outputs.

# What to verify

terraform fmt -check -recursive passes across all changed stacks
terraform init -backend=false && terraform validate passes on modules/gcs-bucket, environments/dev-01-base, environments/dev-02-runtime, layers/00-bootstrap
terraform plan shows: 2 GCS buckets + GKE cluster update (FUSE add-on) + 3 new outputs on dev-01-base + 1 new IAM binding on bootstrap
After CI apply: buckets exist: gsutil ls gs://ume-airflow-logs-poc-ume-data/ and gsutil ls gs://ume-airflow-dags-poc-ume-data/
After CI apply: GCS FUSE CSI enabled on cluster

# Then

Story 4b deploys Airflow onto the cluster.

# Story 4b -- Airflow Helm Release (Stock Image, Port-Forward)

Stack: environments/dev-02-runtime/ + modules/airflow-helm/, with base-layer changes in environments/dev-01-base/ and modules/cloud-sql-postgres/ Agent: infra-terraform Depends on: Story 4a (buckets created, providers configured, GCS FUSE enabled)

# What to build

New module -- modules/airflow-helm/: Namespace, shared service account, connection secrets, DB bootstrap Job, and Helm release. All settings exposed as variables with defaults. Called from environments/dev-02-runtime/airflow.tf as module "airflow".

Airflow Helm release (via module):

Official Apache Airflow Helm chart 1.20.0 deployed via helm_release.
Stock image: apache/airflow:3.2.0 (parametrized via var.airflow_image_repository + var.airflow_image_tag). Custom image with Cosmos/dbt added in Story 4d.
Executor: CeleryExecutor with Redis.
1 Celery worker (min=1, always on).
Triggerer enabled (for deferrable operators).
DAG processor enabled (mandatory standalone component in Airflow 3).
API server enabled (replaces webserver in Airflow 3 -- serves UI and REST API).
Namespace: airflow.
No external auth -- basic admin user created via Helm createUserJob. Port-forward access is already gated by kubectl / GKE IAM.

Airflow 3 component changes (vs. Airflow 2):

Chart 1.20.0 uses semver gates in templates: apiServer renders for Airflow >= 3.0.0, webserver renders for < 3.0.0.
dagProcessor is mandatory -- DAG parsing moved out of the scheduler into a standalone process.
webserver block kept only for defaultUser config consumed by createUserJob. Its deployment template does not render.

Workload Identity:

Chart 1.20.0 creates per-component KSAs by default (airflow-scheduler, airflow-api-server, etc.), none of which carry the WI annotation.
A single kubernetes_service_account_v1 is created in Terraform with the WI annotation, and every component references it via serviceAccount = { create = false, name = "airflow" }.
The base layer's WI binding targets [airflow/airflow].

Cloud SQL connection (via Auth Proxy sidecar):

Cloud SQL Auth Proxy gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.14.3 added as extraContainers on scheduler, workers, api-server, triggerer, dag-processor.
Proxy flags: --structured-logs --auto-iam-authn --private-ip --port=5432.
--private-ip is required because the Cloud SQL instance has only a private IP (PSA networking).
--auto-iam-authn lets the proxy handle IAM token refresh via Workload Identity.
Connection string: Pre-built kubernetes_secret_v1 with URL-encoded IAM user (the @ in ume-airflow@poc-ume-data.iam breaks the Helm chart's URI template). Referenced via data.metadataSecretName / data.resultBackendSecretName.

Bootstrap Job (kubernetes_job_v1.db_bootstrap):

Runs before the Helm release via depends_on.
Cloud SQL Auth Proxy native sidecar (init container with restartPolicy: Always).
Step 1 (grants init container): connects as postgres admin, GRANTs privileges to the IAM user on the airflow database. Cloud SQL IAM users are created without any DB privileges.
Step 2 (migrate init container): runs airflow db migrate as the IAM user via the proxy.
The chart's migrateDatabaseJob is disabled because the chart's hook runs after the main release resources and failed when privileges didn't exist.
The postgres admin password is fetched at runtime from Secret Manager via Workload Identity. No long-lived credentials in Kubernetes.

Base-layer changes (required for the bootstrap to work):

roles/cloudsql.instanceUser added to the Airflow SA. This is required for IAM database authentication (cloudsql.instances.login), separate from roles/cloudsql.client which only allows proxy connections.
cloud-sql-postgres module: automated postgres admin password via random_password + google_sql_user + google_secret_manager_secret_version. No manual password setup.
Default pool max_count raised from 2 to 3 (7 Airflow pods with sidecars need room on e2-standard-2 nodes).

DAG sync via GCS FUSE:

dags.gitSync.enabled = false.
Per-component extraVolumes + extraVolumeMounts on scheduler, workers, triggerer, dag-processor.
Pod annotations override GCS FUSE sidecar resources: GKE default injection is 250m CPU / 256Mi memory / 5Gi ephemeral, overridden to 10m / 64Mi / 256Mi (read-only DAG mount barely uses any CPU). Frees ~960m CPU requests across 4 pods.
Mounted at /opt/airflow/dags/ (read-only).

Remote logging to GCS (hybrid with Cloud Logging):

Container stdout/stderr goes to Cloud Logging automatically (GKE default, zero config).
Airflow task execution logs go to GCS via built-in remote_logging:

env:
  - name: AIRFLOW__LOGGING__REMOTE_LOGGING
    value: "True"
  - name: AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER
    value: "gs://ume-airflow-logs-poc-ume-data/logs"
  - name: AIRFLOW__LOGGING__DELETE_LOCAL_LOGS
    value: "True"

Probe tuning: Chart default probes run airflow jobs check which imports the full Python framework on every invocation. On e2-standard-2 nodes this takes >20s. Startup probe failureThreshold set to 20 on scheduler and api-server. Liveness probe timeoutSeconds raised to 60 on scheduler, worker, triggerer, dag-processor.

Cleanup: Standalone kubernetes_cron_job_v1 (disabled by default, var.cleanup_enabled = false). The chart's built-in cleanup section doesn't support extraInitContainers, so the Cloud SQL Auth Proxy can't be injected there.

Resource requests (dev PoC -- 2-3x e2-standard-2 nodes):

scheduler:
  resources:
    requests: { cpu: 200m, memory: 512Mi }
    limits: { cpu: "1", memory: 1Gi }

apiServer:
  resources:
    requests: { cpu: 250m, memory: 512Mi }
    limits: { cpu: 500m, memory: 1Gi }

dagProcessor:
  resources:
    requests: { cpu: 150m, memory: 384Mi }
    limits: { cpu: 500m, memory: 1Gi }

workers:
  replicas: 1
  resources:
    requests: { cpu: 500m, memory: 1536Mi }
    limits: { cpu: "1.5", memory: 3Gi }

triggerer:
  resources:
    requests: { cpu: 100m, memory: 256Mi }
    limits: { cpu: 250m, memory: 512Mi }

redis:
  enabled: true
  resources:
    requests: { cpu: 50m, memory: 64Mi }

postgresql:
  enabled: false  # external Cloud SQL

Hardening note: ume-airflow has project-wide roles/storage.objectAdmin. After this story, scope the grant to the specific log and DAG buckets via bucket-level IAM.

# Design decisions

Airflow 3.2.0 / chart 1.20.0: Spec was for 2.10.3 / 1.15.0. Upgraded because 3.2.0 was latest stable at deployment time, which forced the apiServer, dagProcessor, shared KSA, and bootstrap Job changes below.
Stock image first: Validates the platform before adding Cosmos/dbt. Custom image in Story 4d.
Cloud SQL Auth Proxy sidecar (not Python connector): Stock Airflow image lacks cloud-sql-python-connector. Auth Proxy handles IAM token refresh as a sidecar with no image dependencies.
Shared KSA: Chart 1.20.0 creates per-component KSAs, none with WI. One Terraform-managed kubernetes_service_account_v1 avoids N separate WI bindings and keeps the base layer's [airflow/airflow] binding working.
Terraform bootstrap Job: The chart's migrateDatabaseJob is a post-install hook -- runs after the release resources exist. Cloud SQL IAM users start with zero DB privileges, so the hook fails on first install. The Terraform Job runs grants + migrate before the Helm release, then disables the chart's migration job. See backlog for investigating the chart's intended pattern.
waitForMigrations disabled: Chart 1.20.0 places extraInitContainers after the wait-for-airflow-migrations init container, so a native sidecar proxy there wouldn't be running when the check executes. Safe to disable because the Terraform bootstrap Job already ran migrations.
--private-ip: Cloud SQL instance is private-only (PSA). Without this flag the proxy tries public IP and fails.
GCS FUSE resource overrides: Default injection (250m CPU / 256Mi memory / 5Gi ephemeral per pod) is overkill for a read-only DAG mount. Annotations bring it down to 10m / 64Mi / 256Mi.
Probe timeout 60s: airflow jobs check imports the full framework. 20s is not enough on e2-standard-2.
Scheduler CPU limit 1000m: At 500m the scheduler was throttled during Python import and couldn't start within the probe window.
Pre-built connection Secrets: IAM DB user ume-airflow@poc-ume-data.iam has @ which breaks standard URI parsing in the Helm chart's template.
Port-forward for initial access: No ingress, DNS, or TLS on the critical path. Port-forward is already gated by kubectl / GKE IAM. External access in Story 4c.
Hybrid logging: Container logs go to Cloud Logging, task execution logs go to GCS (Airflow UI reads them natively).
GCS FUSE over git-sync: Auth handled by Workload Identity, no tokens or keys. CI pushes DAGs to GCS on merge to main.

# What to verify

# Outputs to export

airflow_namespace
airflow_logs_bucket (GCS log bucket name)
airflow_dags_bucket (GCS DAGs bucket name)

# Then

Story 4c adds ingress, TLS, DNS, and OIDC authentication for the API server.

# Story 4c — Ingress + TLS + DNS + IAP (Gateway API, three layers)

Stacks: layers/00-bootstrap/, environments/dev-01-base/, environments/dev-02-k8s-base/ (new), environments/dev-03-runtime/ (renamed from dev-02-runtime/) Agent: infra-terraform Status: DONE Depends on: Story 4b (Airflow running)

The original spec called for classic GKE Ingress + Flask-AppBuilder OAuth in webserver_config.py. Both were abandoned during execution: classic Ingress can't share a static IP across services (precluding shared-IP + wildcard DNS + per-app ingress), and Airflow 3 replaced Flask-AppBuilder auth with a pluggable auth_manager. The shipped design uses GKE Gateway API with IAP at the load balancer. See story-status.md for the PR-by-PR account.

# What was built

Layer structure reshuffle. New environments/dev-02-k8s-base/ platform layer (pulled forward from Story 8). Old dev-02-runtime/ renamed to dev-03-runtime/. DNS + shared static IP + wildcard cert moved to dev-01-base/ (zero k8s provider dependency).

layers/00-bootstrap/:

dns.googleapis.com, iap.googleapis.com APIs enabled.
roles/iap.admin on tf-apply-sa (brand/client write path).
Custom role tfIapReader on tf-plan-sa with clientauthconfig.{brands,clients}.{get,list}WithSecret variants (plan refresh).
Invariant added to CLAUDE.md: verify plan-SA + apply-SA permission coverage before every new downstream resource type.

environments/dev-01-base/:

google_dns_managed_zone ume-data-${env}-zone (delegated from GoDaddy).
google_compute_global_address ume-data-${env}-ingress-ip (shared across every service on the Gateway).
Wildcard A record *.${domain} → shared IP.
Certificate Manager DNS-01 authorization + auth CNAME + wildcard managed cert + certificate map + entry — all *.${domain} coverage.
New outputs: domain_name, dns_zone_name, dns_zone_nameservers, ingress_ip_name, ingress_ip_address, certificate_map_name.
modules/gke-standard/ gained gateway_api_config { channel = "CHANNEL_STANDARD" } (installs Gateway/HTTPRoute v1 CRDs on the cluster).

environments/dev-02-k8s-base/ (new stack):

google + kubernetes + helm providers wired via remote_state from dev-01-base.
Gateway namespace ume-data-${env}-gateway.
kubernetes_manifest Gateway: gatewayClassName = gke-l7-global-external-managed, NamedAddress to base's static IP, listeners https:443 and http:80 both with allowedRoutes.namespaces.from = All, annotation networking.gke.io/certmap to base's cert map.
kubernetes_manifest HTTPRoute on :80 with a catch-all PathPrefix: / match and a RequestRedirect filter (scheme https, 301).
Outputs: gateway_name, gateway_namespace.

New modules/iap-oauth/:

google_iap_client under a caller-provided brand (brand stays in the stack as a project singleton).
kubernetes_secret_v1 with exactly one key key = <oauth client secret> (GCPBackendPolicy expects a single-key secret).
kubernetes_manifest GCPBackendPolicy with spec.default.iap.{enabled, clientID, oauth2ClientSecret.name} and targetRef to the app Service.
google_project_iam_member unconditional bindings on roles/iap.httpsResourceAccessor for each member of the UNION of iap_allowed_domains/groups/users.

Extended modules/airflow-helm/:

Optional HTTPRoute (httproute_enabled) attaching to the shared Gateway via cross-namespace parentRef with sectionName = "https" (pins Airflow to the HTTPS listener, leaves :80 for the redirect HTTPRoute).
airflow_config.simple_auth_manager_all_admins flag. When true, the module also pins [core] auth_manager = SimpleAuthManager and force-disables the chart's createUserJob (both required to avoid FAB/SimpleAuthManager conflicts).

environments/dev-03-runtime/:

IAP brand passed in via var.iap_brand_name (brand is created manually in the GCP Console — see the iap.tf header for the runbook).
module "airflow_iap" wires IAP to airflow-api-server with per-user allow-list (ext_marcello.pontes@ume.com.br, wagner.jorge@ume.com.br, leonardo.luiz@ume.com.br).
Airflow HTTPRoute on https://airflow.${domain}.
airflow_config.simple_auth_manager_all_admins = true — users signed in through IAP land straight on the Airflow UI.

# Prerequisites (one-time manual)

GCP Console → APIs & Services → OAuth consent screen. For Workspace-owned projects pick Internal; for standalone projects pick External. App name, support email, developer contact. The IAP brand is auto-created.
gcloud iap oauth-brands list --project=<id> --format='value(name)' → paste into iap_brand_name in the runtime tfvars.
Delegate ${domain} NS records to Google from the apex registrar (GoDaddy in our case). Fetch nameservers with terraform output -raw dns_zone_nameservers on dev-01-base.

# Design decisions

Gateway API over classic Ingress. Enables shared IP + wildcard DNS + per-app ingress (classic Ingress pins one GCLB per Ingress, cannot share).
Wildcard Certificate Manager cert (DNS-01). Covers every *.${domain} subdomain. DNS-01 against our own zone activates in minutes. ManagedCertificate CRD is HTTP-01-only and doesn't support wildcards.
Three-layer split. 01-base pure GCP; 02-k8s-base k8s-platform singletons (Gateway today, Prometheus/CSI in Phase 2); 03-runtime apps. DNS in base keeps the k8s providers out of the base plan.
IAP at GCLB over Airflow-native OIDC. Airflow 3 pluggable auth would require the FAB provider + custom image. IAP is zero-image-change and aligns with DataHub's future auth.
Per-user IAP allow-list, unconditional binding. IAM conditions do not propagate to IAP's authorization path for Gateway-API backends — tried and rejected. Tight scoping via the allow-list.
simple_auth_manager_all_admins = true with auth_manager pinned to SimpleAuthManager. One login (IAP) is enough; the module pins both configs together to avoid FAB/Simple conflicts and also disables createUserJob.
IAP brand stays manual. google_iap_brand doesn't work for non-Workspace projects and the IAP OAuth Admin API is being phased out. Stack accepts the brand as an input.
Orthogonal module boundaries. iap-oauth is per-service (reused by DataHub in Phase 2). Gateway sits inline in dev-02-k8s-base for now (extract into modules/gke-gateway/ when prod replicates).

# What to verify

# Then

Story 4d adds the custom Airflow image with Cosmos and dbt.

# Story 4d — Custom Airflow Image + Cosmos/dbt

Location: today, the ume-data-dags repo (docker/, scripts/, .github/workflows/image.yml, .github/workflows/bot-pr.yml). On this side: the wait-for-image gate in .github/workflows/terraform-apply.yml and the airflow_image_tag line in environments/dev-03-runtime/terraform.tfvars. Agent: airflow-dags (image + requirements, in ume-data-dags) + infra-terraform (bootstrap SA + WIF, tfvars plumbing, in ume-data-infra) Depends on: Story 4c (Airflow running with ingress + auth)

Spec rewritten 2026-04-18 to match what actually shipped. Original spec targeted apache/airflow:2.10.3 and environments/dev-02-runtime/; Story 4b deployed Airflow 3.2.0 and Story 4c renamed the runtime stack to dev-03-runtime. The 4d base image must extend the deployed 3.2.0 image. Content was initially scaffolded under resources/ in this repo and moved to ume-data-dags once validated.

# What to build

Custom Docker image (ume-data-dags/docker/):

Dockerfile extending apache/airflow:3.2.0. Installs astronomer-cosmos~=1.14 in the Airflow Python env (constrained) and dbt-core~=1.9 + dbt-bigquery~=1.9 in an isolated /home/airflow/dbt-venv/ (required because Airflow 3.2's constraints clash with dbt-core on pathspec/protobuf).
Build-time guardrails (which dbt, import cosmos, FAB-provider check) fail fast on drift.
scripts/build-image.sh — local build helper that tags with the same 3.2.0-<sha> convention as CI.

CI workflows (in ume-data-dags):

.github/workflows/image.yml — builds + pushes 3.2.0-<sha> on merge to main when docker/ changes.
.github/workflows/dag-sync.yml — gcloud storage rsyncs dags/ + dbt/ to the bucket on merge when those paths change.
.github/workflows/pr-ci.yml — PR lint (hadolint + python -m py_compile + dbt parse); no GCP auth needed.
.github/workflows/bot-pr.yml — after image.yml succeeds on main, uses INFRA_PR_TOKEN (fine-grained PAT scoped to ume-data-infra only) to open a tfvars-bump PR on this repo.

CI workflows (in ume-data-infra):

.github/workflows/terraform-apply.yml — wait-for-image gate before the runtime apply (15-min poll) so Helm never starts a rollout against a missing tag.

Bootstrap and base-layer changes (ume-data-infra):

layers/00-bootstrap/main.tf — docker_config { immutable_tags = true } on the AR repo. A content-push SA (ume-datainfra-content-push) scoped to AR writer on ume-composer-images + WIF bound to 1edata/ume-data-dags. Three narrow custom roles on tf-apply-sa (tfWifProviderUpdater, tfCustomRoleManager, tfArRepoIamAdmin) for self-management of these resource types.
environments/dev-01-base/iam.tf — roles/bigquery.jobUser on ume-airflow and ume-airflow-kpo. Without it, dbt-bigquery cannot submit queries (bigquery.jobs.create denied; bigquery.dataEditor does not include it).
environments/dev-03-runtime/buckets.tf — bucket-scoped roles/storage.objectAdmin for the content-push SA on the dev DAGs bucket.

Runtime rollouts (continuous, via the bot-PR loop):

airflow_image_repository is set to the AR URL once; airflow_image_tag is bumped on every DAGs-repo merge by the bot-PR workflow. Merging the bot-PR triggers terraform-apply's wait-for-image gate, then Helm rolls the pods.

Ownership model:

ume-data-dags repo:
  └── docker/Dockerfile + dbt venv
  └── dags/ + dbt/
  └── CI: build image + push to AR
  └── CI: gcloud storage rsync dags/ + dbt/ → GCS DAGs bucket
  └── CI: open bot-PR against ume-data-infra bumping airflow_image_tag

ume-data-infra repo (this repo):
  └── environments/dev-03-runtime/terraform.tfvars
      └── airflow_image_repository = "us-east1-docker.pkg.dev/.../ume-composer-images/airflow"
      └── airflow_image_tag = "3.2.0-<sha>"  ← bumped by bot-PR
  └── .github/workflows/terraform-apply.yml
      └── wait-for-image gate before Helm rollout

Tag format: <airflow-version>-<commit-sha> (e.g., 3.2.0-a1b2c3d). Immutable — AR's docker_config.immutable_tags rejects overwrites.

Rollback: revert airflow_image_tag in tfvars to the previous value and apply. The db_bootstrap Job re-runs with the rolled-back image; airflow db migrate is idempotent but the target image must be known-good (otherwise the init container fails and the bootstrap stays Pending).

# What to verify

ume-data-dags's pr-ci.yml green on PR (hadolint, py_compile, dbt parse).
On merge in ume-data-dags: image present in AR (gcloud artifacts docker images list us-east1-docker.pkg.dev/poc-ume-data/ume-composer-images).
Immutable tags enabled (gcloud artifacts repositories describe ume-composer-images --location=us-east1 --format='value(dockerConfig.immutableTags)' → True).
roles/bigquery.jobUser present on both Airflow SAs.
Bot-PR opened on this repo, merged, pods restart with new image.
astronomer-cosmos importable (≥ 1.14); dbt --version works at /home/airflow/dbt-venv/bin/dbt.
No regression of IAP + SimpleAuthManager: browser sign-in at https://airflow.umedev.marpont.es/ lands on the UI with no Airflow-side login.
Cosmos execution mode (local) functional — validated in Story 5.

# Then

Story 5 is bundled — ume_dbt_example DAG is already in ume-data-dags/dags/.

# Story 5 — First Cosmos-Powered dbt DAG

Location: ume-data-dags/dags/ and ume-data-dags/dbt/ Agent: airflow-dags (in ume-data-dags) Depends on: Story 4d (custom image with Cosmos + dbt installed) Bundled with: Story 4d — initially shipped together under resources/ in this repo; content moved to ume-data-dags once validated.

# What to build

In ume-data-dags/dbt/:

dbt_project.yml with project configuration.
profiles.yml configured for BigQuery OAuth using the Airflow SA identity (workload identity, oauth method).
models/example/:
- ume_hello_world.sql — materialized as table, SELECT CURRENT_TIMESTAMP(), message, sentinel.
- ume_hello_world_downstream.sql — depends on ume_hello_world via {{ ref(...) }}. Having a ref() edge proves Cosmos renders the task-graph with a dependency, not just "dbt ran."
- schema.yml documenting both.

In ume-data-dags/dags/:

cosmos_dbt_dag.py — a Cosmos DAG using local execution mode (ExecutionMode.LOCAL). The DAG renders the dbt project as individual Airflow tasks, each dispatched to the Celery worker. Cosmos copies the project to a per-task tmp directory before invoking dbt, so the read-only GCS FUSE mount is not a problem.
- dbt_project_path = /opt/airflow/dags/dbt (GCS FUSE mounts the bucket root at /opt/airflow/dags/, so dbt/ is a sibling of dags/).
- dbt_executable_path = /home/airflow/dbt-venv/bin/dbt (isolated venv; see Story 4d note about Airflow 3.2 constraints vs dbt-core).
- is_paused_upon_creation = True, schedule = None, default_args with owner, retries=1.

# What to verify

DAGs and dbt project synced to GCS bucket: gsutil ls gs://ume-airflow-dags-poc-ume-data/dags/ gs://ume-airflow-dags-poc-ume-data/dbt/
Files visible in worker filesystem: kubectl exec deploy/airflow-worker -n airflow -c worker -- ls /opt/airflow/dags/dbt/
ume_dbt_example DAG visible in Airflow UI with two dbt-model tasks and a dependency edge (ume_hello_world → ume_hello_world_downstream)
Un-pause and trigger the DAG manually — all dbt tasks run successfully
bq show --format=prettyjson poc-ume-data:dbt_dev.ume_hello_world and ...ume_hello_world_downstream return expected schemas
Airflow task logs show dbt output (both Airflow UI and gs://ume-airflow-logs-poc-ume-data/logs/)
Tasks execute on the Celery worker (not scheduler) — verify in task instance details
Re-trigger the DAG once; materialized: table replaces tables idempotently (no accidental appends)
kubectl top pod -n airflow during the run — worker RSS stays well below the 3 Gi limit

# Then

Phase 1 is complete. The data pipeline (Airflow + dbt + BigQuery) is operational on GKE, and the content pipeline is split into a dedicated ume-data-dags repo. Next steps:

Scope roles/storage.objectAdmin on ume-airflow SA to specific buckets (see backlog).
Extend the DAGs repo workflows to cover prod when the prod project is provisioned (matrix or split workflow files).
Begin Phase 2 (DataHub) when priorities allow.

# Phase 2 — DataHub & Additional Infrastructure

Phase 2 adds DataHub to the existing GKE cluster with Strimzi Kafka and self-hosted OpenSearch as backing services. The GKE cluster, VPC, shared Cloud SQL instance, Gateway, wildcard cert, and IAP brand from Phase 1 are all reused.

Master plan: plans/datahub-deployment-plan.md — read this first. It covers architecture decisions, node-pool strategy, disk sizing, alerting, and the per-story execution strategy (one autonomous session per story, restricted profile).

Each story below is sized for one session. Specs below are the implementation contract; design rationale lives in the master plan.

# Story 6 — Workload Pool + DataHub SQL + Password Secret

Stack: environments/dev-01-base/ (update) + layers/00-bootstrap/ (CI IAM coverage, if needed) Agent: infra-terraform Depends on: Phase 1 complete

# What to build

Node pool (environments/dev-01-base/terraform.tfvars):

Add a new entry to gke_node_pools:

workload-pool = {
  machine_type = "e2-standard-4"
  min_count    = 1
  max_count    = 4
  spot         = false
  extra_labels = { pool = "workload" }
  # No taint — workload selector (pool=workload) is enough.
}

DataHub database + user + password (environments/dev-01-base/cloud-sql.tf):

google_sql_database.datahub — name = "datahub", instance = module.airflow_sql.instance_name.
random_password.datahub_db — length 32, special=false (avoids JDBC URL-encoding traps).
google_sql_user.datahub — type = BUILT_IN (password auth), name = "datahub", password = random_password.datahub_db.result.
google_secret_manager_secret.datahub_db_password — secret_id = "ume-data-dev-datahub-db-password", automatic replication.
google_secret_manager_secret_version.datahub_db_password_v1 — secret_data = random_password.datahub_db.result.

Outputs (environments/dev-01-base/outputs.tf):

datahub_db_name = "datahub"
datahub_db_user = google_sql_user.datahub.name
datahub_db_host = module.airflow_sql.private_ip_address
datahub_db_password_secret_id = google_secret_manager_secret.datahub_db_password.secret_id

Cloud Monitoring alert (environments/dev-02-k8s-base/alerts.tf — new file):

Policy "Cloud SQL disk > 75%" on metric cloudsql.googleapis.com/database/disk/utilization, filter instance ume-data-dev-airflow-pg, threshold 0.75, duration 10m.

Bootstrap CI IAM check (invariant #11):

Verify tf-plan-sa can read google_secret_manager_secret_version data sources (needed downstream by Story 11's Helm release). The existing tfK8sSecretsReader role (Story 4b era) covers secretmanager.versions.* — confirm during planning; add a custom role if gap found.

# Design decisions

Canonical in plans/datahub-deployment-plan.md §1, §2, §3, §5. Key points:

Shared SQL instance, not a new one. Saves ~$26/mo; dev workload fits.
Password auth, not IAM auth. Skips 5 Cloud SQL Auth Proxy sidecars in DataHub pods.
Secret Manager (not plaintext in Helm values). DataHub pods mount via Secrets Store CSI (Story 7 + Story 11).
workload-pool distinct from default-pool. Stateful workloads on their own nodes.
min=1 with soft anti-affinity for Kafka/OS pods — cold-start fits one node, scales out as needed.

# What to verify

terraform fmt -check -recursive + validate pass.
After CI apply: gcloud container node-pools list --cluster=ume-data-dev-gke --zone=us-east1-b shows workload-pool with locations=us-east1-b, machineType=e2-standard-4.
gcloud sql databases list --instance=ume-data-dev-airflow-pg shows datahub.
gcloud sql users list --instance=ume-data-dev-airflow-pg shows datahub user (type BUILT_IN).
gcloud secrets versions list ume-data-dev-datahub-db-password returns exactly one version.
gcloud alpha monitoring policies list shows the Cloud SQL disk policy.

# Then

Story 7 installs the Secret Manager CSI driver.

# Story 7 — Secrets Store CSI Driver

Stack: environments/dev-02-k8s-base/ Agent: infra-terraform Depends on: Story 6

# What to build

environments/dev-02-k8s-base/secrets-store-csi.tf (new file):

helm_release.secrets_store_csi_driver — chart secrets-store-csi-driver from https://kubernetes-sigs.github.io/secrets-store-csi-driver/charts, namespace kube-system, pinned chart version (verify latest at story time).
helm_release.secrets_store_csi_driver_gcp — chart secrets-store-csi-driver-provider-gcp from https://googlecloudplatform.github.io/secrets-store-csi-driver-provider-gcp, namespace kube-system, pinned chart version.
Values: syncSecret.enabled = true on the base driver (so mounted secrets can also be synced to native k8s Secrets — DataHub's chart expects env-var refs to k8s Secrets, not file paths).

Outputs: none needed (driver exposes cluster-wide SecretProviderClass CRD).

# Design decisions

kube-system namespace. The driver is a DaemonSet that must run on every node pool; standard convention places it in kube-system.
GCP provider alongside the base driver. The base driver is generic; the GCP provider is the Secret Manager plugin. Both are required.
syncSecret.enabled = true. DataHub's Helm chart and most upstream charts consume passwords via env.valueFrom.secretKeyRef, which requires a k8s Secret object. Sync mode creates one from the CSI mount.

# What to verify

kubectl -n kube-system get pods -l app=secrets-store-csi-driver all Running.
kubectl get crd secretproviderclasses.secrets-store.csi.x-k8s.io exists.
kubectl -n kube-system get pods -l app=csi-secrets-store-provider-gcp all Running.

# Then

Story 8 installs the Strimzi operator.

# Story 8 — Strimzi Kafka Operator

Stack: environments/dev-02-k8s-base/ Agent: infra-terraform Depends on: Story 6 (workload-pool exists; operator itself runs anywhere but its watched clusters target it)

# What to build

environments/dev-02-k8s-base/strimzi.tf (new file):

kubernetes_namespace_v1.strimzi_system — strimzi-system namespace with common labels.
helm_release.strimzi_kafka_operator — chart strimzi-kafka-operator from https://strimzi.io/charts/, pinned chart version (verify latest at story time).
Values:
- watchAnyNamespace: true — cluster-wide watch.
- resources.requests: { cpu: 200m, memory: 384Mi } — operator itself is small.
- nodeSelector: { pool: workload } — pin operator to workload-pool.

# Design decisions

Cluster-wide watch. Matches our shared-Gateway pattern — one operator, many namespaces possible later.
Operator on workload-pool. Keeps default-pool free of operator pods.
No Kafka CR yet. That's Story 9. Having the operator install PR separate means any CRD/operator upgrade is a clean roll-back.

# What to verify

kubectl -n strimzi-system get pods shows operator Running.
CRDs installed: kubectl get crd | grep strimzi.io lists kafkas, kafkanodepools, kafkatopics, kafkausers.
Operator scheduled on workload-pool: kubectl -n strimzi-system get pods -o wide → node has label pool=workload.

# Then

Story 9 provisions the Kafka cluster.

# Story 9 — Kafka Cluster (KRaft, 3 Controllers + 2 Brokers)

Stack: environments/dev-03-runtime/ + new modules/strimzi-kafka/ Agent: infra-terraform Depends on: Story 8

# What to build

modules/strimzi-kafka/ (new):

main.tf — namespace + KafkaNodePool (controllers) + KafkaNodePool (brokers) + Kafka CR via kubernetes_manifest.
variables.tf — namespace, cluster_name, kafka_version, controller_replicas (default 3), controller_memory (default 256Mi), controller_storage_size (default 1Gi), broker_replicas (default 2), broker_memory (default 1.5Gi), broker_cpu (default 500m), broker_storage_size (default 10Gi), broker_storage_class (default premium-rwo), log_retention_hours (default 72), log_retention_bytes (default 8589934592 = 8 GiB), min_insync_replicas (default 1), node_selector (default { pool = "workload" }).
outputs.tf — bootstrap_servers (= <cluster_name>-kafka-bootstrap.<namespace>.svc:9092), namespace, cluster_name.

environments/dev-03-runtime/kafka.tf (new file):

module "kafka" call with defaults; cluster_name = "ume-data-dev-kafka", namespace = "kafka".

Alert (environments/dev-02-k8s-base/alerts.tf):

Policy "Kafka broker PV > 70%" on metric kubernetes.io/node/persistentvolume/volume/used_bytes / capacity_bytes filter namespace kafka.

# Design decisions

Canonical in plans/datahub-deployment-plan.md §4, §5.

KRaft, not ZooKeeper. Strimzi 0.38+ supports KRaft; one fewer moving part.
Dedicated controllers. 2-broker combined-role clusters can't form an odd-quorum. 3 tiny controllers solve it.
Retention + size caps together. Time-based retention (72h) + byte-based cap (8 GiB) ensures the PV never fills even under a burst.
Soft anti-affinity. preferredDuringSchedulingIgnoredDuringExecution on topology.kubernetes.io/hostname. Lets brokers co-locate when there's only one node; spreads when autoscaler adds more.
PD-SSD. Kafka is IOPS-sensitive; pd-balanced is cheaper but can stall during retention sweeps.
No Cruise Control. Added to backlog for prod.
min.insync.replicas = 1. With RF=2, one broker can be down during rolling upgrade without losing write availability.

# What to verify

kubectl -n kafka get kafka ume-data-dev-kafka → READY=True.
kubectl -n kafka get pods shows 3 -controllers-* and 2 -brokers-* pods Running.
Brokers scheduled on workload-pool nodes.
kubectl -n kafka get pvc shows 5 PVCs bound (3 controller + 2 broker).
Bootstrap service reachable in-cluster: kubectl -n kafka run kcat --rm -it --image=edenhill/kcat:1.7.1 --restart=Never -- -b ume-data-dev-kafka-kafka-bootstrap:9092 -L (metadata listing).
PV alert policy exists.

# Then

Story 10 provisions OpenSearch.

# Story 10 — OpenSearch + Snapshots

Stack: environments/dev-02-k8s-base/ (operator) + environments/dev-03-runtime/ (cluster) + environments/dev-01-base/ (snapshot bucket) Agent: infra-terraform Depends on: Story 8 (pattern proven; independent of Kafka at runtime)

# What to build

Snapshot bucket (environments/dev-01-base/buckets.tf — new file, or append to an existing one):

Module call to modules/gcs-bucket/ for ume-opensearch-snapshots-poc-ume-data:
- versioning = false
- Lifecycle: delete objects older than 35 days.
- Expose in outputs as opensearch_snapshots_bucket.

OpenSearch GSA (environments/dev-01-base/iam.tf):

google_service_account.opensearch_snapshot — ume-opensearch-snapshot.
Bucket-scoped roles/storage.objectAdmin on the snapshot bucket.
Workload Identity binding: opensearch/opensearch-snapshot KSA → ume-opensearch-snapshot GSA.

Operator (environments/dev-02-k8s-base/opensearch.tf — new file):

kubernetes_namespace_v1.opensearch_operator — opensearch-operator namespace.
helm_release.opensearch_operator — chart opensearch-operator from https://opensearch-project.github.io/opensearch-k8s-operator/, pinned chart version.
Values: operator pinned to workload-pool.

Cluster (environments/dev-03-runtime/opensearch.tf — new file):

kubernetes_namespace_v1.opensearch — opensearch namespace.
kubernetes_service_account_v1.opensearch_snapshot — with WI annotation.
OpenSearchCluster CR via kubernetes_manifest:
- 1 data node (also master-eligible), 512Mi JVM heap, 1 CPU, 1.5Gi memory request.
- 5 GiB PD-SSD storage.
- nodeSelector: { pool: workload }.
- Security plugin disabled (dev only; Story 13 hardens with basic auth or mTLS).
SecretProviderClass (CSI) — mounts bucket name (not secret, just config; optional, can use direct env).
kubernetes_manifest ISM policy (JSON CRD) — delete indices > 30 days.
kubernetes_cron_job_v1.opensearch_snapshot — daily at 04:00 UTC, runs curl -XPUT opensearch-cluster/_snapshot/gcs_backup/$(date +%Y%m%d). Uses opensearch-snapshot KSA.

Alert (environments/dev-02-k8s-base/alerts.tf):

Policy "OpenSearch PV > 70%" (namespace opensearch).

# Design decisions

Single data node in dev. 3-node minimum is a prod concern; dev can take unassigned-shard risk. Snapshots provide the durability backstop.
OpenSearch 2.x. DataHub supports both ES 7.10+ and OS 2.x; OS has no license friction.
GCS snapshots over cross-zone replication. Cheaper, simpler, and the ops story is clear (restore from snapshot).
ISM + bucket lifecycle both. Indices deleted at 30 days inside OS; snapshots deleted at 35 days in GCS. Always a 5-day overlap for recovery.
Security plugin off in dev. Keeps the story small. Story 13 re-evaluates.

# What to verify

kubectl -n opensearch-operator get pods shows operator Running.
kubectl -n opensearch get opensearchcluster → READY.
kubectl -n opensearch get pods shows 1 data node Running on workload-pool.
gsutil ls gs://ume-opensearch-snapshots-poc-ume-data/ (may be empty before first run).
First CronJob run logs show a successful snapshot API call.
ISM policy exists: kubectl -n opensearch get opensearchismpolicy.

# Then

Story 11 deploys DataHub.

# Story 11 — DataHub Dry-Run

Stack: environments/dev-03-runtime/ + new modules/datahub-helm/ Agent: datahub-platform Depends on: Stories 6, 7, 9, 10

# What to build

modules/datahub-helm/ (new):

Wraps the upstream acryldata/datahub chart. Verify latest chart version at story time (the verify_versions invariant).
main.tf — namespace + KSA (no WI binding yet; ingestion adds it) + SecretProviderClass (CSI, syncs Secret Manager datahub-db-password → k8s Secret) + helm_release.
Helm values set via module:
- datahub-gms.replicaCount, datahub-frontend.replicaCount, datahub-mae-consumer.replicaCount, datahub-mce-consumer.replicaCount = 1 each.
- All pod nodeSelector: { pool: workload }.
- global.sql.datasource:
  - host: <sql_private_ip>
  - hostForMysqlClient: <sql_private_ip> (chart quirk; still set for postgres paths).
  - port: 5432
  - database: datahub
  - url: jdbc:postgresql://<ip>:5432/datahub
  - driver: org.postgresql.Driver
  - username: datahub
  - extraEnvs: [{ name: DATAHUB_DB_PASSWORD, valueFrom: { secretKeyRef: { name: datahub-db-password, key: password } } }]
- global.kafka.bootstrap.server: <kafka.bootstrap_servers>.
- global.elasticsearch.host: opensearch-cluster.opensearch.svc, port: 9200, useSSL: false, skipcheck: true (disables X-Pack check since OS isn't ES).
- elasticsearchSetupJob.enabled: true — creates DataHub indices.
- kafkaSetupJob.enabled: true — creates DataHub topics.
variables.tf — all knobs exposed (replicas, resources, versions, backing endpoints).
outputs.tf — namespace, release_name, frontend_service_name.

environments/dev-03-runtime/datahub.tf (new):

module "datahub" call wiring remote_state refs from dev-01-base (SQL) and reading Kafka/OpenSearch service DNS directly (same cluster, well-known names).

environments/dev-03-runtime/data.tf — add outputs passthrough if needed.

No IAP yet. Verify via kubectl port-forward svc/datahub-frontend 9002:9002 -n datahub.

# Design decisions

Canonical in plans/datahub-deployment-plan.md §7.

Module over inline. Env-scoped resource, replicates to prod.
CSI-synced k8s Secret for DB password. DataHub chart expects secretKeyRef; syncSecret fills it from Secret Manager.
Port-forward verification step. No ingress wiring yet — Story 12 adds it. Keeps each PR small.
elasticsearch.skipcheck: true — required when pointing DataHub at OpenSearch 2.x (X-Pack check fails otherwise).
No KSA → GSA WI binding yet. DataHub GMS does not make GCP API calls; ingestion recipes (in ume-data-dags) do. Adding the binding here would grant permissions nothing uses.

# What to verify

kubectl -n datahub get pods shows all DataHub pods Running, setup jobs Completed.
kubectl -n datahub logs deploy/datahub-gms shows successful SQL connection, Kafka producer connected, OpenSearch client initialized.
kubectl port-forward -n datahub svc/datahub-frontend 9002:9002 + browser http://localhost:9002 loads the UI.
datahub DB schema populated: gcloud sql connect ume-data-dev-airflow-pg --database=datahub --user=datahub → \dt (read-only check — prohibited per session rules; instead verify via GMS logs).
Kafka topics created: kubectl exec -n kafka ume-data-dev-kafka-brokers-0 -- bin/kafka-topics.sh --list --bootstrap-server localhost:9092 lists MetadataChangeLog_Versioned_v1 etc.
OpenSearch indices created: visit /_cat/indices via port-forward.

# Then

Story 12 wires IAP and public ingress.

# Story 12 — DataHub IAP + HTTPRoute + OIDC Auth

Stack: environments/dev-03-runtime/ (update) + small modules/datahub-helm/ addition Agent: datahub-platform Depends on: Story 11 Status: DONE — see story-status.md for the post-mortem. First-admin bootstrap still manual (local datahub JAAS user); groups/policies-as-code lands in Story 13.

# What to build

modules/datahub-helm/ — add:

httproute_enabled, gateway_name, gateway_namespace, hostname variables (match modules/airflow-helm/ surface).
Optional HTTPRoute resource attached to datahub-frontend Service on :9002.
DataHub OIDC values passthrough (see "DataHub OIDC" below).

environments/dev-03-runtime/datahub.tf — extend module call with HTTPRoute params + an iap-oauth module call:

module "datahub_iap" (new, uses modules/iap-oauth/):
- service_name = "datahub-frontend"
- namespace = "datahub"
- allowed_users = var.iap_allowed_users (same list as Airflow initially).

environments/dev-03-runtime/terraform.tfvars — add datahub_subdomain = "datahub".

DataHub OIDC (in-app identity, not the perimeter):

IAP alone collapses to "all-admin or all-reader" — doesn't meet per-user / per-dataset stewardship. Keep IAP as the perimeter (who can reach the host) and layer DataHub OIDC inside it for in-app identity + roles.

Separate OAuth client from the IAP client, created on the same GCP OAuth consent screen. clientId / clientSecret land in Secret Manager and mount into datahub-frontend via Secrets Store CSI (Story 7 driver).
Helm values on the frontend chart:
- authentication.enabled = true / authentication.provider = oidc
- oidcAuthentication.discoveryUri = https://accounts.google.com/.well-known/openid-configuration
- oidcAuthentication.userNameClaim = email
- oidcAuthentication.scopes = "openid profile email"
- oidcAuthentication.extractGroupsEnabled = false (Phase 1 — see"Phased migration" below).
JIT user provisioning is on by default; a new Google account landing through IAP becomes a DataHub user record on first login.

DataHub groups + policies bootstrap (idempotent, driven from a DataHub policies-as-code file checked into this repo or ume-data-dags — final home decided in Story 13 alongside ingestion recipes):

Groups (pre-create, membership managed by admins until Phase 2):

platform-admins
data-stewards (per-domain children: finance-stewards, marketing-stewards, …)
viewers

Domains: one per business area. Each domain has an owner group from data-stewards. Datasets join a domain via ingestion metadata (dbt tags / BigQuery labels / source-system owners surfaced through the recipe).

Policies (all bound to groups, never user URNs — see design decisions):

Platform: platform-admins → Admin role.
Platform: data-stewards → Editor role.
Platform: viewers → Reader (or rely on the Reader default).
Platform: finance-stewards → Manage Domain scoped to urn:li:domain:finance (templated per-domain via for_each).
Metadata: per-domain "edit metadata where domain=…", bound to the matching steward group.
Platform: ingestion SA (Airflow) → Manage Ingestion Sources + Manage Secrets. Runs unattended; no human role.

# Access control model

Stewardship on a specific dataset is Ownership of that entity with ownershipType = DATA_STEWARD — DataHub has no global "Steward" role. The global Editor role just gates who can propose edits at all; ownership gates which assets they can touch.

Actor	Setup	Can do
Platform admin	`platform-admins` group → Admin role	all policies, ingestion, user/group/domain admin
Ingestion SA (Airflow)	platform policy: Manage Ingestion + Manage Secrets	trigger ingestion runs, create/update recipes
Data steward (per domain)	Editor role + Owner of a Domain	edit tags / terms / documentation on their domain's assets; propose changes elsewhere
Domain owner	scoped "Manage Domain" policy	add/remove owners inside their domain without being a platform admin
Viewer	Reader (default)	browse, search, read

Ownership on a dataset is assigned by (a) admins via UI, (b) domain owners within their scope, (c) ingestion recipes carrying owners metadata. (c) is the scalable path — don't expect to click-assign owners on hundreds of datasets.

# Phased migration to Workspace groups

Google's accounts.google.com OIDC issuer has a fixed claim set — no custom per-user claims, and no groups claim outside Workspace. Plan around that.

Phase 1 (now, no Workspace access):

DataHub OIDC → Google; user identity via email.
Admins manually add users to the Phase-1 groups on first login.
Policies + domains + ownerships already bind to groups, so the Phase-1 work is throwaway-free.

Phase 2 (Workspace access returned):

Recreate the same group names as Google Groups under the Workspace domain (platform-admins@…, finance-stewards@…, …).
Flip oidcAuthentication.extractGroupsEnabled = true and set oidcAuthentication.groupsClaimName = groups. DataHub syncs group membership on each login.
Optional cleanup: remove the Phase-1 manual group memberships (dual membership is harmless during transition).
No policy rewrites — because nothing binds to user URNs.

# Design decisions

Reuse modules/iap-oauth/ verbatim for the perimeter. Confirmed working for Airflow; parameterized per service.
Same IAP allow-list initially. Expand in tfvars when needed.
Wildcard cert already covers datahub.umedev.marpont.es. No Certificate Manager changes.
IAP at perimeter + DataHub OIDC inside. IAP alone is binary; DataHub's role+policy+ownership layer does per-user and per-dataset work.
Bind every policy to a group, never to a user URN. Phase-2 migration to Workspace groups becomes a rename, not a rewrite.
Stewardship = Ownership on entity + Editor role, not a global role. Matches DataHub's data model and makes domain-based delegation natural.
Policies as code, not click-ops. The group/domain/policy bootstrap lives in a checked-in config so Phase 2 and prod rebuilds are deterministic. Exact location (this repo vs ume-data-dags) decided in Story 13 when ingestion recipes land.

# What to verify

# Then

Story 13 hardens cost + ops and finalizes where the policies-as-code bootstrap lives.

# Story 13 — Cost + Operations Hardening

Stacks: all dev stacks + ingestion cross-repo coordination Agent: infra-terraform + docs-infra Depends on: Story 12

# What to build

Label audit across all Terraform-managed resources (fail CI if labels missing).
Budget alerts at 50 / 80 / 100% of target in Cloud Billing.
PDB verification: simulate a node drain on workload-pool, confirm DataHub, Kafka, OpenSearch survive.
Maintenance window verification on the GKE cluster and Cloud SQL instance.
Ingestion DAGs added to ume-data-dags (BigQuery, Airflow, dbt) — cross-repo work, tracked here as coordination.
Runbook drill: at least one end-to-end scenario (e.g. Kafka broker restart, OpenSearch snapshot restore).
Consider re-enabling OpenSearch security plugin with basic auth backed by Secret Manager CSI.

# What to verify

CI label-lint passes on every stack.
Budget alert emails received at 50%.
kubectl drain <workload-pool-node> — no DataHub/Kafka/OS service disruption.
Runbook entry for at least one end-to-end recovery scenario merged.

# Monthly Cost Summary

# Phase 1 — Airflow only (~$81/mo)

Resource	Spec	Monthly
GKE cluster mgmt	Free tier (zonal)	$0
`default-pool`	1x `e2-standard-2`	$49
`kpo-pool`	0 nodes idle; spot when active	$0 idle
Cloud SQL	`db-g1-small`	$26
Cloud SQL storage	10 GiB SSD	$2
Boot + PD	20 GiB pd-balanced	$2
Cloud NAT	2 NICs min	$2
Total		~$81

# Phase 2 — Add DataHub (~$200-310/mo incremental, depending on autoscaler)

Addition	Spec	Monthly
`workload-pool` steady state	1x `e2-standard-4`	+$98
`workload-pool` scaled	up to 4x `e2-standard-4`	up to +$392
PD-SSD (Kafka 2×10 GiB + OS 5 GiB + controllers 3×1 GiB)	~28 GiB	+$6
Cloud SQL — shared instance, no tier bump	same instance	+$0
Snapshot bucket	Standard, 35-day retention	~$1
Phase 2 incremental (steady)		~$105
Phase 2 incremental (worst case)		~$400

Savings vs original plan: ~$100/mo by reusing Cloud SQL + dropping the Auth Proxy sidecars + dev-sized Kafka (2 brokers vs 3, no Cruise Control) + single-node OpenSearch.

Note: GKE free tier covers one zonal cluster. Regional cluster in prod costs an additional ~$74/mo.

# After Phase 2

Once all stories are completed and verified on dev:

Review lessons learned. Update docs where reality diverged from plan.
Provision prod GCP projects (externally, by org admin).
Create prod-01-base, prod-02-runtime stacks (mirror dev structure, different terraform.tfvars).
Execute Phase 2 stories against prod, with the GitHub Environment approval gate.
Promote the dev-validated custom Airflow image tag to prod.
Enable DataHub ingestion recipes against prod BigQuery datasets.

# Deployment Stories

# Phase 1 — Airflow on GKE

# Story 0 — Repository Scaffold

# Story 1 — Bootstrap

# Story 2 — Platform Shared (Airflow-focused) → Doc Restructure

# What happened

# What this story delivered

# Design decisions

# Then

# Story 3a — Networking

# What to build

# Design decisions

# Outputs to export

# What to verify

# Then

# Story 3b — Cloud SQL

# What to build

# Design decisions

# Outputs to export (added)

# What to verify

# Then

# Story 3c — Airflow IAM

# What to build

# Design decisions

# Outputs to export (added)

# What to verify

# Then

# Story 3d — GKE Cluster + Module Extraction

# What to build

# Design decisions

# Phase 1 resource budget (1x e2-standard-2 default-pool)

# Outputs to export

# What to verify

# Then

# Story 4a — Runtime Stack Scaffolding + GCS Buckets

# What to build

# Design decisions

# What to verify

# Then

# Story 4b -- Airflow Helm Release (Stock Image, Port-Forward)

# What to build

# Design decisions

# What to verify

# Outputs to export

# Then

# Story 4c — Ingress + TLS + DNS + IAP (Gateway API, three layers)

# What was built

# Prerequisites (one-time manual)

# Design decisions

# What to verify

# Then

# Story 4d — Custom Airflow Image + Cosmos/dbt

# What to build

# What to verify

# Then

# Story 5 — First Cosmos-Powered dbt DAG

# What to build

# What to verify

# Then

# Phase 2 — DataHub & Additional Infrastructure

# Story 6 — Workload Pool + DataHub SQL + Password Secret

# What to build

# Design decisions

# What to verify

# Then

# Story 7 — Secrets Store CSI Driver

# What to build

# Design decisions

# What to verify

# Then

# Story 8 — Strimzi Kafka Operator

# What to build

# Design decisions

# What to verify

# Then

# Story 9 — Kafka Cluster (KRaft, 3 Controllers + 2 Brokers)

# What to build

# Design decisions

# What to verify

# Then

# Phase 1 resource budget (1x `e2-standard-2` default-pool)