#
Deployment Stories
This section defines the implementation sequence for wave-1. Each story is designed to be a single PR (or a small set of closely related PRs) that delivers a verifiable outcome.
Stories are ordered by dependency: each story builds on the output of the previous ones. Do not skip ahead.
These stories are designed to be executable by both humans and Claude Code agents. Each story specifies: context, what to build, what to verify, and which agent (if any) should be invoked.
#
Phase 1 — Airflow on GKE
Phase 1 provisions a GKE Standard cluster and deploys Airflow via the official Apache Airflow Helm chart with CeleryExecutor + Redis. DataHub and its dependencies (Kafka, OpenSearch) are deferred to Phase 2 — they'll be added to the same cluster.
Why GKE Standard instead of Cloud Composer: Composer 3's minimum dev cost floor is ~$300-400/mo. Airflow on a single e2-standard-2 node + Cloud SQL db-g1-small costs ~$81/mo. The 4-5x cost difference is the primary driver. The operational burden is acceptable because the GKE cluster is already planned for DataHub, and all Phase 1 infrastructure is reused in Phase 2 — no throwaway work.
Content repo: DAG, dbt, and Docker image work lives in a sibling repo
ume-data-dags. That repo's
merges build + push the custom Airflow image, rsync dags/ and dbt/
to the GCS DAGs bucket, and auto-open a tfvars-bump PR on this repo via
the INFRA_PR_TOKEN-authenticated bot-PR workflow. ume-data-infra now
only tracks the image tag in environments/dev-03-runtime/terraform.tfvars.
Initial Phase 1 content was scaffolded here under resources/ (Stories
4d + 5) and moved out once validated. See
story-status.md for the migration record.
#
Story 0 — Repository Scaffold
Repo: github.com/1edata/ume-data-infra
Agent: infra-terraform
Status: DONE
Initialize the ume-data-infra repository with directory skeleton, CI workflows, and the bootstrap stack stub. See story-status.md for details.
#
Story 1 — Bootstrap
Stack: layers/00-bootstrap/
Agent: infra-terraform
Status: DONE
Terraform state bucket, Artifact Registry, WIF pool + provider, CI service accounts, API enablement. See story-status.md for details.
#
Story 2 — Platform Shared (Airflow-focused) → Doc Restructure
Scope: Documentation only (no Terraform resources)
Agent: docs-infra
Status: DONE
#
What happened
Airflow service accounts are environment-scoped, not shared. The Workload Identity bindings reference a specific project's identity pool ({project}.svc.id.goog), and in the multi-project future each project gets its own SAs for its own cluster.
Decision: SA + WI binding creation moved to Story 3c (environments/dev-01-base/). layers/10-platform-shared/ deferred to Phase 2 when cross-environment resources appear (DataHub SA, KMS, logging sink).
#
What this story delivered
- Updated all docs to reflect the restructured SA location
- Fixed SA naming to follow the
ume-{purpose}convention:ume-airflow,ume-airflow-kpo - Fixed KSA naming:
airflow(notairflow-scheduler— the Helm chart applies one KSA to all components) - Updated inter-stack contracts:
dev-01-baseexports SA emails,dev-02-runtimereads from one stack - Updated Story 3c spec to absorb SA + WI binding creation
#
Design decisions
- SA naming:
ume-airflowandume-airflow-kpo(followsume-{purpose}convention from naming table) - KSA naming:
airflow— the Helm chart'sserviceAccount.nameapplies to scheduler, worker, webserver, and triggerer. A generic name is accurate. - SAs belong in
environments/, notlayers/: In the multi-project setup, each project has its own SAs for its own cluster.layers/is for resources shared across all environments and projects (state bucket, WIF, AR). layers/10-platform-shared/deferred: No cross-environment resources exist in Phase 1. Created in Story 6 when DataHub work begins.storage.objectAdminproject-wide for PoC: The log bucket doesn't exist until Story 4. Scope to specific buckets as a hardening task in Story 4.
#
Then
Stories 3a–3d provision networking, Cloud SQL, Airflow IAM, and GKE (one PR each).
#
Story 3a — Networking
Stack: environments/dev-01-base/
Agent: infra-terraform
#
What to build
Creates the environments/dev-01-base/ directory with stack scaffolding and networking resources.
Stack scaffolding: versions.tf, variables.tf, outputs.tf, locals.tf, backend.hcl, terraform.tfvars, data.tf
Networking (networking.tf):
- VPC
ume-data-dev-vpc(custom mode, regional routing). - Subnet
ume-data-dev-gke-nodes(10.0.0.0/20) with secondary ranges:gke-pods(10.4.0.0/14),gke-services(10.8.0.0/20). - Private Google Access enabled on subnet (for GCS, AR, Secret Manager, BigQuery API access).
- Static IP
ume-data-dev-nat-ipfor Cloud NAT egress. - Cloud Router
ume-data-dev-router+ Cloud NATume-data-dev-natfor outbound internet from GKE nodes (private cluster, no public IPs). NAT applies to all subnets, error-only logging enabled.
Remote state (data.tf):
terraform_remote_statedata source reading00-bootstrapoutputs. Separated from networking.tf because it is a stack-level concern shared by Stories 3b-3d.
#
Design decisions
- Direct resources (modularized in Story 3d): Originally used direct resources. Extracted into
modules/vpc/in Story 3d viamovedblocks. ume-data-{env}naming prefix: Changed fromume-{env}to avoid generic collisions in shared GCP projects. Updated naming table in04-terraform-structure.md.- Static NAT IP: Reserved
google_compute_addressfor predictable egress. Allows allowlisting by external services. - ALL_SUBNETWORKS_ALL_IP_RANGES: No public subnets planned. Cloud NAT only affects VMs without external IPs, so this is safe even if public-IP VMs are added later.
- Remote state in
data.tf: Stack-level concern. Stories 3b-3d will add files to this stack that reference bootstrap outputs. Shared data source avoids duplication. - Zone variable in scaffolding:
zone = us-east1-bincluded in variables.tf for Story 3d's zonal GKE cluster. - No
composersubnet: Composer is not used. VPC design only needs GKE subnets. - No Private Service Access (PSA) here: PSA is only needed for Cloud SQL private IP — provisioned in Story 3b alongside the SQL instance.
#
Outputs to export
vpc_id,vpc_self_link,subnet_self_link,pod_secondary_range_name,service_secondary_range_name,nat_ip_address
#
What to verify
-
terraform fmt -check -recursive environments/dev-01-base/ -
terraform init -backend-config=backend.hcl && terraform validatepasses -
terraform planshows 5 resources (VPC, subnet, static IP, router, NAT) - After CI apply: VPC and subnets exist:
gcloud compute networks subnets list --project=poc-ume-data - After CI apply: Private Google Access enabled:
gcloud compute networks subnets describe ume-data-dev-gke-nodes --region=us-east1 --format='value(privateIpGoogleAccess)' - After CI apply: Cloud NAT configured:
gcloud compute routers list --project=poc-ume-data - After CI apply: Static IP reserved:
gcloud compute addresses list --project=poc-ume-data --filter='name=ume-data-dev-nat-ip'
#
Then
Story 3b adds Cloud SQL on this network.
#
Story 3b — Cloud SQL
Stack: environments/dev-01-base/
Agent: infra-terraform
Depends on: Story 3a (VPC for PSA peering)
#
What to build
Cloud SQL (cloud-sql.tf):
- Private Service Access (PSA) —
google_compute_global_address(ume-data-dev-psa-range,10.64.0.0/20) +google_service_networking_connection. PSA is only needed for Cloud SQL private IP; GCS/AR/Secret Manager use Private Google Access (enabled in Story 3a), not PSA. - PostgreSQL 16 instance
ume-data-dev-airflow-pg, tierdb-g1-small(shared core, 1.7 GB RAM). - Private IP via PSA (no public IP).
enable_private_path_for_google_cloud_services = true. - IAM authentication flag enabled (
cloudsql.iam_authentication = on). The actual IAM user (google_sql_user) androles/cloudsql.clientbinding are created in Story 3c alongside theume-airflowSA. - 10 GB SSD, auto-increase enabled, limit 50 GB (safety cap).
- Automated daily backups at 3 AM UTC, 7-day retention. No PITR (deferred to prod).
- Maintenance window: Sunday 4 AM UTC, stable track.
deletion_protection = false(PoC only).airflowdatabase created viagoogle_sql_databaseso Story 4's Helm chart can connect immediately.- Break-glass admin password:
google_secret_manager_secretshell (ume-data-dev-cloudsql-admin-password). Value populated out-of-band. Defaultpostgresuser password set manually — no separate Terraform-managed admin user.
#
Design decisions
db-g1-smalloverdb-f1-micro:db-f1-microhas 614 MB RAM — OOM risk under write load.db-g1-smallat 1.7 GB is sufficient for Airflow metadata. Cost: $26 vs $8/mo.- PostgreSQL 16: Latest GA on Cloud SQL with improved query performance. Airflow supports 12-16.
- PSA range /20 not /24: Zero cost difference (just an IP allocation). Expanding PSA ranges later requires deleting/recreating the peering connection (downtime). /20 is future-proof for DataHub, replicas.
- PSA range hardcoded at
10.64.0.0: Deterministic, reproducible plans. Safely outside all existing allocations (nodes10.0.0.0/20, pods10.4.0.0/14, services10.8.0.0/20). airflowdatabase created here, not in Story 4: Story 4's Helm chart expectsmetadataConnection.db: airflow. Creating the database alongside the instance avoids a manual prerequisite.- No
google_sql_userfor admin: The defaultpostgresuser is created automatically by Cloud SQL. Break-glass access usespostgres+ password from Secret Manager. disk_autoresize_limit = 50: Safety cap prevents runaway growth on a PoC instance.- File name
cloud-sql.tf(notpersistence.tf): More specific, consistent withnetworking.tfandgke.tf. Updated04-terraform-structure.mdto match. - No labels on PSA range:
google_compute_global_addresswithpurpose = VPC_PEERINGrejects labels (GCP API limitation). - Shared instance strategy: When DataHub arrives in Phase 2, evaluate whether to create a second logical database on this instance (cheaper) or a separate instance (better isolation).
#
Outputs to export (added)
sql_connection_name,sql_private_ip,sql_instance_name
#
What to verify
-
terraform fmt -check -recursive environments/dev-01-base/ -
terraform validatepasses - After CI apply: Cloud SQL running:
gcloud sql instances list --project=poc-ume-data - After CI apply: Private IP assigned (no public):
gcloud sql instances describe ume-data-dev-airflow-pg --format='value(ipAddresses)' - After CI apply: PSA range allocated:
gcloud compute addresses list --global --filter='purpose=VPC_PEERING' --project=poc-ume-data - After CI apply:
airflowdatabase exists:gcloud sql databases list --instance=ume-data-dev-airflow-pg --project=poc-ume-data - After CI apply: Secret shell exists:
gcloud secrets list --project=poc-ume-data --filter='name:cloudsql-admin-password'
#
Then
Story 3c creates the Airflow service accounts.
#
Story 3c — Airflow IAM
Stack: environments/dev-01-base/
Agent: infra-terraform
Depends on: Story 3a (stack scaffolding), Story 3b (Cloud SQL instance for IAM database user)
#
What to build
Airflow service accounts and IAM (iam.tf):
ume-airflowservice account withroles/bigquery.dataEditor,roles/cloudsql.client,roles/secretmanager.secretAccessor,roles/storage.objectAdmin(project-wide for PoC; scope to specific buckets in Story 4).ume-airflow-kposervice account withroles/bigquery.dataEditor,roles/storage.objectViewer(scoped identity for KPO tasks — separate from main Airflow SA for security isolation).- Workload Identity bindings for both SAs (
depends_on = [module.gke]— GCP validates the WI pool exists, so these must wait for the cluster):airflowKSA inairflownamespace →ume-airflowGSAairflow-kpoKSA inairflow-kponamespace →ume-airflow-kpoGSA
- Cloud SQL IAM database user (
google_sql_userwithtype = CLOUD_IAM_SERVICE_ACCOUNT) for theume-airflowSA — deferred from Story 3b.
#
Design decisions
google_sql_useriniam.tf, notcloud-sql.tf: IAM concern (granting SA database auth). Keeps Story 3c's PR self-contained.google_project_iam_member(additive): Same pattern as bootstrap. Authoritative bindings would revoke other members from shared roles likeroles/bigquery.dataEditor.for_eachover role sets: Role bindings usetoset()locals withfor_each. Adding/removing a role is a one-line change. Plan output is self-documenting (keys are full role strings).trimsuffixfor SQL user name: The GCP API expects the SA email without.gserviceaccount.com. Usingtrimsuffix(google_service_account.airflow.email, ".gserviceaccount.com")maintains the Terraform dependency graph.- No labels on any resources:
google_service_account,google_project_iam_member,google_service_account_iam_member, andgoogle_sql_userdo not support GCP labels. Not a label-invariant violation. - WI bindings depend on GKE: GCP validates the Workload Identity pool (
{project}.svc.id.goog) exists — it is created when a GKE cluster enables Workload Identity. The bindings usedepends_on = [module.gke]to ensure correct ordering. GCP does NOT validate that the KSA exists (Story 4 creates them via Helm). - Broad permissions flagged for scoping:
roles/storage.objectAdminandroles/secretmanager.secretAccessorare project-wide for PoC. InlineTODO(narrow-scope)comments mark these for Story 4 / future hardening.
#
Outputs to export (added)
airflow_sa_email,airflow_kpo_sa_email
#
What to verify
-
terraform fmt -check -recursive environments/dev-01-base/ -
terraform validatepasses - After CI apply:
gcloud iam service-accounts list --project=poc-ume-data | grep ume-airflow - After CI apply: Both SAs created with correct roles
- After CI apply: Workload Identity bindings exist:
gcloud iam service-accounts get-iam-policy ume-airflow@poc-ume-data.iam.gserviceaccount.com - After CI apply: Workload Identity bindings exist:
gcloud iam service-accounts get-iam-policy ume-airflow-kpo@poc-ume-data.iam.gserviceaccount.com - After CI apply: Cloud SQL IAM user exists:
gcloud sql users list --instance=ume-data-dev-airflow-pg --project=poc-ume-data
#
Then
Story 3d provisions the GKE cluster.
#
Story 3d — GKE Cluster + Module Extraction
Stack: environments/dev-01-base/ + modules/gke-standard/ + modules/vpc/ + modules/cloud-sql-postgres/
Agent: infra-terraform
Depends on: Story 3a (VPC subnets for nodes/pods/services)
#
What to build
Module extraction (applied first): Extract existing flat resources from Stories 3a-3c into reusable modules. State migrated via moved blocks (declarative, CI-friendly — no manual terraform state mv).
modules/vpc/— VPC, subnet with GKE secondary ranges, Cloud NAT, Cloud Router. Singlenetwork_cidr_base(/12) parameter derives all CIDRs viacidrsubnet().modules/cloud-sql-postgres/— PSA peering, Cloud SQL instance, database, admin password secret. Includes PSA because its sole purpose is Cloud SQL private networking.- IAM stays flat in the env layer (policy layer, not infrastructure pattern).
Bootstrap fix: Custom role tfIamPolicyAdmin on tf-apply-sa with {get,set}IamPolicy for both projects and service accounts. roles/editor omits these permissions, which are needed for google_project_iam_member and google_service_account_iam_member. Applied manually before CI can manage IAM bindings.
GKE module (modules/gke-standard/): Reusable module encapsulating cluster creation, node pool management, naming, labels, and security defaults. All settings exposed as variables with sensible defaults. Called from environments/dev-01-base/gke.tf.
GKE cluster (via module):
- Cluster
ume-data-dev-gke, zonal (us-east1-b) for dev PoC. Regional deferred to prod. - Private cluster: private nodes, public endpoint with authorized networks (default
0.0.0.0/0for dev, variable-driven for future Cloudflare WARP/VPN restriction). - Master CIDR:
172.16.0.0/28(control plane VPC peering, outside all existing allocations). - Workload Identity enabled (
${project_id}.svc.id.goog). - Release channel: Regular.
- Dataplane V2 (
ADVANCED_DATAPATH) for built-in network policy enforcement via Cilium/eBPF. Chosen over Calico (spec's original choice) because it is Google's strategic direction and avoidsLEGACY_DATAPATH. - Maintenance window: weekdays 02:00-06:00 UTC.
deletion_protection = true.
Node pools:
Both pools: shielded instances (secure boot + integrity monitoring), Workload Identity metadata mode, legacy metadata endpoint disabled, surge upgrade (max_surge=1, max_unavailable=0).
The kpo-pool scales to zero nodes when idle. When Airflow triggers a KPO task, the pod is created with a toleration for the workload=kpo:NoSchedule taint and a nodeSelector for pool: kpo. The Cluster Autoscaler detects the pending pod and provisions a spot node (~60-90s cold start). After ~10 minutes idle, the node is removed. Max 3 nodes in dev (tightened from 10 to limit blast radius from runaway DAGs).
#
Design decisions
- Local module (
modules/gke-standard/): Encapsulates cluster + node pools + naming + labels + security defaults. Environment stacks call the module with different parameters (machine types, node counts, location). Prod replication requires changing ~10 values in the module call instead of duplicating 160 lines of Terraform. All settings exposed as variables with defaults for maximum configurability per environment. - Zonal cluster for dev: Halves node count vs regional. Regional deferred to prod when HA is required.
e2-standard-2is the smallest viable machine: Shared-core machines (e2-small,e2-medium) lose ~1060m to flat CPU reservation. Withe2-standard-2(2 vCPU, 8 GiB), allocatable is ~1930m CPU / ~6.1 GiB RAM.- Dataplane V2 over Calico: Irreversible choice (requires cluster recreation to change). Cilium/eBPF is more performant than iptables-based Calico. Built-in network policy enforcement without a separate
network_policyblock. Known limitations reviewed:anetdCPU usage under high TCP churn (not applicable for Airflow), no manual internal passthrough NLBs (not needed). - Authorized networks
0.0.0.0/0: API server still requires authentication regardless. Variable-drivenlist(object)makes restricting to Cloudflare WARP CIDRs a one-line tfvars change. - Master CIDR
172.16.0.0/28hardcoded: Architectural decision, not per-environment. In a different RFC 1918 block from all existing allocations (nodes10.0.0.0/20, pods10.4.0.0/14, services10.8.0.0/20, PSA10.64.0.0/20). - kpo-pool max=3: Tightened from 10 for dev PoC. Limits cost exposure from runaway DAGs while allowing some parallelism.
deletion_protection = true: Deliberate two-step teardown (flip flag, then destroy). Safer default even for PoC.oauth_scopes = ["cloud-platform"]: Broad scope is standard practice because Workload Identity provides fine-grained pod-level auth. Node-level scopes are a legacy mechanism.
#
Phase 1 resource budget (1x e2-standard-2 default-pool)
Snug but workable — dbt-bigquery is I/O-bound (submits SQL and waits). See Airflow on GKE — Scaling signals for when to upgrade to e2-standard-4.
#
Outputs to export
gke_cluster_name,gke_endpoint,gke_ca_cert(sensitive)
#
What to verify
-
terraform fmt -check -recursive environments/dev-01-base/ -
terraform validatepasses - After CI apply: GKE cluster running:
gcloud container clusters list --project=poc-ume-data - After CI apply: kubectl works:
gcloud container clusters get-credentials ume-data-dev-gke --zone=us-east1-b --project=poc-ume-data && kubectl get nodes - After CI apply: One default-pool node visible, zero kpo-pool nodes
- After CI apply: Both pools listed:
gcloud container node-pools list --cluster=ume-data-dev-gke --zone=us-east1-b --project=poc-ume-data
#
Then
Story 4 deploys Airflow onto the cluster.
#
Story 4a — Runtime Stack Scaffolding + GCS Buckets
Stack: environments/dev-02-runtime/ + modules/gcs-bucket/ + updates to modules/gke-standard/, environments/dev-01-base/, layers/00-bootstrap/
Agent: infra-terraform
Depends on: Story 3d (dev-01-base complete)
#
What to build
New module — modules/gcs-bucket/:
google_storage_bucketwith configurable name, location, storage class, lifecycle rules, versioning.- Hardcoded: uniform bucket-level access.
- Variables:
name,project_id,location,storage_class,versioning(bool),force_destroy(bool, default false),lifecycle_rules(list of objects supporting Delete and SetStorageClass actions with age, created_before, num_newer_versions, with_state conditions),labels.
GKE module update — modules/gke-standard/:
- Add
gcs_fuse_csi_enabledvariable (defaulttrue). - Enable
gcs_fuse_csi_driver_configadd-on on the cluster viaaddons_configblock. Required for GCS-based DAG sync in Story 4b.
Prerequisite fixes (gaps from Story 3d):
environments/dev-01-base/outputs.tf— Add missing GKE outputs:gke_cluster_name,gke_endpoint,gke_ca_cert(sensitive). Required by dev-02-runtime's kubernetes/helm providers via remote state.environments/dev-01-base/moved.tf— Delete (moves applied in Story 3d, file is dead weight).layers/00-bootstrap/main.tf— Addroles/container.viewerto plan SA. Required for terraform plan on kubernetes/helm resources (Story 4b onward).roles/viewerdoes not grant k8s API access.
Stack scaffolding — environments/dev-02-runtime/:
versions.tf— Terraform + google + google-beta + kubernetes + helm providers. Kubernetes and Helm providers usedata.google_client_config.default.access_tokenfor auth and read endpoint + CA cert fromdev-01-baseremote state.variables.tf— Active:project_id,environment,region,zone,state_bucket. Commented out (wired by later stories):airflow_image_repository,airflow_image_tag,domain_name,airflow_subdomain.outputs.tf—airflow_logs_bucket,airflow_dags_bucket.locals.tf—common_labels(layer=runtime).backend.hcl— GCS backend:ume-tf-state-poc-ume-data/environments/dev-02-runtime/.terraform.tfvars— dev values.data.tf—terraform_remote_statereadingdev-01-base+00-bootstrap, plusgoogle_client_configfor access token.
GCS buckets (buckets.tf):
- Log bucket via module:
ume-airflow-logs-poc-ume-data, 90-day delete lifecycle, no versioning. - DAGs bucket via module:
ume-airflow-dags-poc-ume-data, no lifecycle (synced from CI), versioning enabled (rollback support).
#
Design decisions
modules/gcs-bucket/module: Log bucket, DAG bucket, and future data buckets share the same pattern (lifecycle, labels, uniform access, versioning). Module-first strategy, justified by multiple callers within Phase 1 alone.- Full lifecycle rule support:
lifecycle_rulesvariable accepts a list of objects with action type (Delete/SetStorageClass) and multiple condition types. Handles tiering rules, not just age-based delete. force_destroyas variable: Module invariant says expose all configurable settings. Defaults to false (safe), dev can override for easy teardown.- GCS FUSE CSI over git-sync: Workload Identity handles auth to GCS (already configured in Story 3c). No tokens, SSH keys, or deploy keys needed. GCS FUSE is a native GKE add-on. See Story 4b for the mount configuration.
- Layer named
dev-02-runtime(wasdev-03-runtime): Thedev-02-k8s-baselayer was planned for Phase 2 (Strimzi, OpenSearch, ingress). Skipping fromdev-01todev-03is confusing whendev-02doesn't exist. Renumber if Phase 2 needs an intermediate layer. - Provider auth pattern: kubernetes/helm providers use
data.google_client_config.default.access_token+ GKE endpoint/CA from remote state. Nogcloud get-credentialscalls. Providers initialize lazily, so Story 4a (no k8s resources) doesn't require cluster connectivity during plan. roles/container.vieweron plan SA:roles/viewerdoes not map to any k8s RBAC role, so the plan SA cannot read k8s state for drift detection.roles/container.viewergrants read-only k8s API access via theviewClusterRole.- Two remote state sources: dev-02-runtime reads from both
dev-01-base(GKE, SQL, SA outputs) and00-bootstrap(AR URL, state bucket). Clear provenance over pass-through outputs.
#
What to verify
-
terraform fmt -check -recursivepasses across all changed stacks -
terraform init -backend=false && terraform validatepasses on modules/gcs-bucket, environments/dev-01-base, environments/dev-02-runtime, layers/00-bootstrap -
terraform planshows: 2 GCS buckets + GKE cluster update (FUSE add-on) + 3 new outputs on dev-01-base + 1 new IAM binding on bootstrap - After CI apply: buckets exist:
gsutil ls gs://ume-airflow-logs-poc-ume-data/andgsutil ls gs://ume-airflow-dags-poc-ume-data/ - After CI apply: GCS FUSE CSI enabled on cluster
#
Then
Story 4b deploys Airflow onto the cluster.
#
Story 4b -- Airflow Helm Release (Stock Image, Port-Forward)
Stack: environments/dev-02-runtime/ + modules/airflow-helm/, with base-layer changes in environments/dev-01-base/ and modules/cloud-sql-postgres/
Agent: infra-terraform
Depends on: Story 4a (buckets created, providers configured, GCS FUSE enabled)
#
What to build
New module -- modules/airflow-helm/: Namespace, shared service account, connection secrets, DB bootstrap Job, and Helm release. All settings exposed as variables with defaults. Called from environments/dev-02-runtime/airflow.tf as module "airflow".
Airflow Helm release (via module):
- Official Apache Airflow Helm chart
1.20.0deployed viahelm_release. - Stock image:
apache/airflow:3.2.0(parametrized viavar.airflow_image_repository+var.airflow_image_tag). Custom image with Cosmos/dbt added in Story 4d. - Executor: CeleryExecutor with Redis.
- 1 Celery worker (min=1, always on).
- Triggerer enabled (for deferrable operators).
- DAG processor enabled (mandatory standalone component in Airflow 3).
- API server enabled (replaces webserver in Airflow 3 -- serves UI and REST API).
- Namespace:
airflow. - No external auth -- basic admin user created via Helm
createUserJob. Port-forward access is already gated by kubectl / GKE IAM.
Airflow 3 component changes (vs. Airflow 2):
- Chart 1.20.0 uses semver gates in templates:
apiServerrenders for Airflow >= 3.0.0,webserverrenders for < 3.0.0. dagProcessoris mandatory -- DAG parsing moved out of the scheduler into a standalone process.webserverblock kept only fordefaultUserconfig consumed bycreateUserJob. Its deployment template does not render.
Workload Identity:
- Chart 1.20.0 creates per-component KSAs by default (
airflow-scheduler,airflow-api-server, etc.), none of which carry the WI annotation. - A single
kubernetes_service_account_v1is created in Terraform with the WI annotation, and every component references it viaserviceAccount = { create = false, name = "airflow" }. - The base layer's WI binding targets
[airflow/airflow].
Cloud SQL connection (via Auth Proxy sidecar):
- Cloud SQL Auth Proxy
gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.14.3added asextraContainerson scheduler, workers, api-server, triggerer, dag-processor. - Proxy flags:
--structured-logs --auto-iam-authn --private-ip --port=5432. --private-ipis required because the Cloud SQL instance has only a private IP (PSA networking).--auto-iam-authnlets the proxy handle IAM token refresh via Workload Identity.- Connection string: Pre-built
kubernetes_secret_v1with URL-encoded IAM user (the@inume-airflow@poc-ume-data.iambreaks the Helm chart's URI template). Referenced viadata.metadataSecretName/data.resultBackendSecretName.
Bootstrap Job (kubernetes_job_v1.db_bootstrap):
- Runs before the Helm release via
depends_on. - Cloud SQL Auth Proxy native sidecar (init container with
restartPolicy: Always). - Step 1 (
grantsinit container): connects as postgres admin, GRANTs privileges to the IAM user on theairflowdatabase. Cloud SQL IAM users are created without any DB privileges. - Step 2 (
migrateinit container): runsairflow db migrateas the IAM user via the proxy. - The chart's
migrateDatabaseJobis disabled because the chart's hook runs after the main release resources and failed when privileges didn't exist. - The postgres admin password is fetched at runtime from Secret Manager via Workload Identity. No long-lived credentials in Kubernetes.
Base-layer changes (required for the bootstrap to work):
roles/cloudsql.instanceUseradded to the Airflow SA. This is required for IAM database authentication (cloudsql.instances.login), separate fromroles/cloudsql.clientwhich only allows proxy connections.cloud-sql-postgresmodule: automated postgres admin password viarandom_password+google_sql_user+google_secret_manager_secret_version. No manual password setup.- Default pool
max_countraised from 2 to 3 (7 Airflow pods with sidecars need room on e2-standard-2 nodes).
DAG sync via GCS FUSE:
dags.gitSync.enabled = false.- Per-component
extraVolumes+extraVolumeMountson scheduler, workers, triggerer, dag-processor. - Pod annotations override GCS FUSE sidecar resources: GKE default injection is 250m CPU / 256Mi memory / 5Gi ephemeral, overridden to 10m / 64Mi / 256Mi (read-only DAG mount barely uses any CPU). Frees ~960m CPU requests across 4 pods.
- Mounted at
/opt/airflow/dags/(read-only).
Remote logging to GCS (hybrid with Cloud Logging):
- Container stdout/stderr goes to Cloud Logging automatically (GKE default, zero config).
- Airflow task execution logs go to GCS via built-in
remote_logging:
env:
- name: AIRFLOW__LOGGING__REMOTE_LOGGING
value: "True"
- name: AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER
value: "gs://ume-airflow-logs-poc-ume-data/logs"
- name: AIRFLOW__LOGGING__DELETE_LOCAL_LOGS
value: "True"
Probe tuning: Chart default probes run airflow jobs check which imports the full Python framework on every invocation. On e2-standard-2 nodes this takes >20s. Startup probe failureThreshold set to 20 on scheduler and api-server. Liveness probe timeoutSeconds raised to 60 on scheduler, worker, triggerer, dag-processor.
Cleanup: Standalone kubernetes_cron_job_v1 (disabled by default, var.cleanup_enabled = false). The chart's built-in cleanup section doesn't support extraInitContainers, so the Cloud SQL Auth Proxy can't be injected there.
Resource requests (dev PoC -- 2-3x e2-standard-2 nodes):
scheduler:
resources:
requests: { cpu: 200m, memory: 512Mi }
limits: { cpu: "1", memory: 1Gi }
apiServer:
resources:
requests: { cpu: 250m, memory: 512Mi }
limits: { cpu: 500m, memory: 1Gi }
dagProcessor:
resources:
requests: { cpu: 150m, memory: 384Mi }
limits: { cpu: 500m, memory: 1Gi }
workers:
replicas: 1
resources:
requests: { cpu: 500m, memory: 1536Mi }
limits: { cpu: "1.5", memory: 3Gi }
triggerer:
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 250m, memory: 512Mi }
redis:
enabled: true
resources:
requests: { cpu: 50m, memory: 64Mi }
postgresql:
enabled: false # external Cloud SQL
Hardening note: ume-airflow has project-wide roles/storage.objectAdmin. After this story, scope the grant to the specific log and DAG buckets via bucket-level IAM.
#
Design decisions
- Airflow 3.2.0 / chart 1.20.0: Spec was for 2.10.3 / 1.15.0. Upgraded because 3.2.0 was latest stable at deployment time, which forced the apiServer, dagProcessor, shared KSA, and bootstrap Job changes below.
- Stock image first: Validates the platform before adding Cosmos/dbt. Custom image in Story 4d.
- Cloud SQL Auth Proxy sidecar (not Python connector): Stock Airflow image lacks
cloud-sql-python-connector. Auth Proxy handles IAM token refresh as a sidecar with no image dependencies. - Shared KSA: Chart 1.20.0 creates per-component KSAs, none with WI. One Terraform-managed
kubernetes_service_account_v1avoids N separate WI bindings and keeps the base layer's[airflow/airflow]binding working. - Terraform bootstrap Job: The chart's
migrateDatabaseJobis a post-install hook -- runs after the release resources exist. Cloud SQL IAM users start with zero DB privileges, so the hook fails on first install. The Terraform Job runs grants + migrate before the Helm release, then disables the chart's migration job. See backlog for investigating the chart's intended pattern. waitForMigrationsdisabled: Chart 1.20.0 placesextraInitContainersafter thewait-for-airflow-migrationsinit container, so a native sidecar proxy there wouldn't be running when the check executes. Safe to disable because the Terraform bootstrap Job already ran migrations.--private-ip: Cloud SQL instance is private-only (PSA). Without this flag the proxy tries public IP and fails.- GCS FUSE resource overrides: Default injection (250m CPU / 256Mi memory / 5Gi ephemeral per pod) is overkill for a read-only DAG mount. Annotations bring it down to 10m / 64Mi / 256Mi.
- Probe timeout 60s:
airflow jobs checkimports the full framework. 20s is not enough on e2-standard-2. - Scheduler CPU limit 1000m: At 500m the scheduler was throttled during Python import and couldn't start within the probe window.
- Pre-built connection Secrets: IAM DB user
ume-airflow@poc-ume-data.iamhas@which breaks standard URI parsing in the Helm chart's template. - Port-forward for initial access: No ingress, DNS, or TLS on the critical path. Port-forward is already gated by kubectl / GKE IAM. External access in Story 4c.
- Hybrid logging: Container logs go to Cloud Logging, task execution logs go to GCS (Airflow UI reads them natively).
- GCS FUSE over git-sync: Auth handled by Workload Identity, no tokens or keys. CI pushes DAGs to GCS on merge to main.
#
What to verify
-
terraform fmt -check -recursivepasses -
terraform init -backend=false && terraform validatepasses on dev-02-runtime -
terraform planclean on both base and runtime stacks - All Airflow pods running: api-server 2/2, scheduler 4/4, dag-processor 4/4, triggerer 4/4, worker 4/4, redis 1/1, statsd 1/1
- Auth Proxy sidecars running with successful DB connections:
kubectl logs deploy/airflow-scheduler -c cloud-sql-proxy -n airflow - Bootstrap Job completed (grants + migrations)
- Airflow UI accessible:
kubectl port-forward svc/airflow-api-server 8080:8080 -n airflow - Push a hello-world DAG to GCS DAGs bucket, appears in Airflow UI
- Hello-world DAG runs on the Celery worker
- Logs appear in GCS:
gsutil ls gs://ume-airflow-logs-poc-ume-data/logs/ - Cloud Logging shows container logs from airflow namespace
#
Outputs to export
airflow_namespaceairflow_logs_bucket(GCS log bucket name)airflow_dags_bucket(GCS DAGs bucket name)
#
Then
Story 4c adds ingress, TLS, DNS, and OIDC authentication for the API server.
#
Story 4c — Ingress + TLS + DNS + IAP (Gateway API, three layers)
Stacks: layers/00-bootstrap/, environments/dev-01-base/, environments/dev-02-k8s-base/ (new), environments/dev-03-runtime/ (renamed from dev-02-runtime/)
Agent: infra-terraform
Status: DONE
Depends on: Story 4b (Airflow running)
The original spec called for classic GKE Ingress + Flask-AppBuilder OAuth in webserver_config.py. Both were abandoned during execution: classic Ingress can't share a static IP across services (precluding shared-IP + wildcard DNS + per-app ingress), and Airflow 3 replaced Flask-AppBuilder auth with a pluggable auth_manager. The shipped design uses GKE Gateway API with IAP at the load balancer. See story-status.md for the PR-by-PR account.
#
What was built
Layer structure reshuffle. New environments/dev-02-k8s-base/ platform layer (pulled forward from Story 8). Old dev-02-runtime/ renamed to dev-03-runtime/. DNS + shared static IP + wildcard cert moved to dev-01-base/ (zero k8s provider dependency).
layers/00-bootstrap/:
dns.googleapis.com,iap.googleapis.comAPIs enabled.roles/iap.adminontf-apply-sa(brand/client write path).- Custom role
tfIapReaderontf-plan-sawithclientauthconfig.{brands,clients}.{get,list}WithSecretvariants (plan refresh). - Invariant added to
CLAUDE.md: verify plan-SA + apply-SA permission coverage before every new downstream resource type.
environments/dev-01-base/:
google_dns_managed_zoneume-data-${env}-zone(delegated from GoDaddy).google_compute_global_addressume-data-${env}-ingress-ip(shared across every service on the Gateway).- Wildcard A record
*.${domain}→ shared IP. - Certificate Manager DNS-01 authorization + auth CNAME + wildcard managed cert + certificate map + entry — all
*.${domain}coverage. - New outputs:
domain_name,dns_zone_name,dns_zone_nameservers,ingress_ip_name,ingress_ip_address,certificate_map_name. modules/gke-standard/gainedgateway_api_config { channel = "CHANNEL_STANDARD" }(installs Gateway/HTTPRoute v1 CRDs on the cluster).
environments/dev-02-k8s-base/ (new stack):
google+kubernetes+helmproviders wired via remote_state fromdev-01-base.- Gateway namespace
ume-data-${env}-gateway. kubernetes_manifestGateway:gatewayClassName = gke-l7-global-external-managed,NamedAddressto base's static IP, listenershttps:443andhttp:80both withallowedRoutes.namespaces.from = All, annotationnetworking.gke.io/certmapto base's cert map.kubernetes_manifestHTTPRoute on:80with a catch-allPathPrefix: /match and aRequestRedirectfilter (scheme https, 301).- Outputs:
gateway_name,gateway_namespace.
New modules/iap-oauth/:
google_iap_clientunder a caller-provided brand (brand stays in the stack as a project singleton).kubernetes_secret_v1with exactly one keykey = <oauth client secret>(GCPBackendPolicy expects a single-key secret).kubernetes_manifestGCPBackendPolicywithspec.default.iap.{enabled, clientID, oauth2ClientSecret.name}andtargetRefto the app Service.google_project_iam_memberunconditional bindings onroles/iap.httpsResourceAccessorfor each member of the UNION ofiap_allowed_domains/groups/users.
Extended modules/airflow-helm/:
- Optional HTTPRoute (
httproute_enabled) attaching to the shared Gateway via cross-namespace parentRef withsectionName = "https"(pins Airflow to the HTTPS listener, leaves:80for the redirect HTTPRoute). airflow_config.simple_auth_manager_all_adminsflag. Whentrue, the module also pins[core] auth_manager = SimpleAuthManagerand force-disables the chart'screateUserJob(both required to avoid FAB/SimpleAuthManager conflicts).
environments/dev-03-runtime/:
- IAP brand passed in via
var.iap_brand_name(brand is created manually in the GCP Console — see theiap.tfheader for the runbook). module "airflow_iap"wires IAP toairflow-api-serverwith per-user allow-list (ext_marcello.pontes@ume.com.br,wagner.jorge@ume.com.br,leonardo.luiz@ume.com.br).- Airflow HTTPRoute on
https://airflow.${domain}. airflow_config.simple_auth_manager_all_admins = true— users signed in through IAP land straight on the Airflow UI.
#
Prerequisites (one-time manual)
- GCP Console → APIs & Services → OAuth consent screen. For Workspace-owned projects pick Internal; for standalone projects pick External. App name, support email, developer contact. The IAP brand is auto-created.
gcloud iap oauth-brands list --project=<id> --format='value(name)'→ paste intoiap_brand_namein the runtime tfvars.- Delegate
${domain}NS records to Google from the apex registrar (GoDaddy in our case). Fetch nameservers withterraform output -raw dns_zone_nameserversondev-01-base.
#
Design decisions
- Gateway API over classic Ingress. Enables shared IP + wildcard DNS + per-app ingress (classic Ingress pins one GCLB per Ingress, cannot share).
- Wildcard Certificate Manager cert (DNS-01). Covers every
*.${domain}subdomain. DNS-01 against our own zone activates in minutes.ManagedCertificateCRD is HTTP-01-only and doesn't support wildcards. - Three-layer split.
01-basepure GCP;02-k8s-basek8s-platform singletons (Gateway today, Prometheus/CSI in Phase 2);03-runtimeapps. DNS in base keeps the k8s providers out of the base plan. - IAP at GCLB over Airflow-native OIDC. Airflow 3 pluggable auth would require the FAB provider + custom image. IAP is zero-image-change and aligns with DataHub's future auth.
- Per-user IAP allow-list, unconditional binding. IAM conditions do not propagate to IAP's authorization path for Gateway-API backends — tried and rejected. Tight scoping via the allow-list.
simple_auth_manager_all_admins = truewithauth_managerpinned to SimpleAuthManager. One login (IAP) is enough; the module pins both configs together to avoid FAB/Simple conflicts and also disablescreateUserJob.- IAP brand stays manual.
google_iap_branddoesn't work for non-Workspace projects and the IAP OAuth Admin API is being phased out. Stack accepts the brand as an input. - Orthogonal module boundaries.
iap-oauthis per-service (reused by DataHub in Phase 2). Gateway sits inline indev-02-k8s-basefor now (extract intomodules/gke-gateway/when prod replicates).
#
What to verify
-
terraform fmt -check -recursive+terraform validatepass across all changed stacks and modules. - DNS:
dig NS umedev.marpont.es @8.8.8.8returns 4 Google nameservers. - Cert:
gcloud certificate-manager certificates describe ume-data-dev-wildcard --location=globalreachesstate: ACTIVE. - Gateway:
kubectl get gateway -n ume-data-dev-gatewayshowsPROGRAMMED=True. - HTTPRoute:
kubectl get httproute -n airflowshowsairflowaccepted (bound tohttpssection). - BackendPolicy:
kubectl describe gcpbackendpolicy airflow-api-server-iap -n airflowshowsType: Attached, Status: True. - Backend service:
gcloud compute backend-services list --format='table(name,iap.enabled)'showsiap.enabled = Trueongkegw1-…-airflow-api-server-…. - IAM: three user bindings on
roles/iap.httpsResourceAccessor, unconditional. - HTTP redirect:
curl -sI http://airflow.umedev.marpont.es/→ 301 to https. - IAP:
curl -sI https://airflow.umedev.marpont.es/→ 302 toaccounts.google.com/o/oauth2/v2/auth?client_id=.... - Browser sign-in as allow-listed user lands on the Airflow UI with no second login.
#
Then
Story 4d adds the custom Airflow image with Cosmos and dbt.
#
Story 4d — Custom Airflow Image + Cosmos/dbt
Location: today, the ume-data-dags repo (docker/, scripts/, .github/workflows/image.yml, .github/workflows/bot-pr.yml). On this side: the wait-for-image gate in .github/workflows/terraform-apply.yml and the airflow_image_tag line in environments/dev-03-runtime/terraform.tfvars.
Agent: airflow-dags (image + requirements, in ume-data-dags) + infra-terraform (bootstrap SA + WIF, tfvars plumbing, in ume-data-infra)
Depends on: Story 4c (Airflow running with ingress + auth)
Spec rewritten 2026-04-18 to match what actually shipped. Original spec
targeted apache/airflow:2.10.3 and environments/dev-02-runtime/; Story
4b deployed Airflow 3.2.0 and Story 4c renamed the runtime stack to
dev-03-runtime. The 4d base image must extend the deployed 3.2.0 image.
Content was initially scaffolded under resources/ in this repo and moved
to ume-data-dags once validated.
#
What to build
Custom Docker image (ume-data-dags/docker/):
Dockerfileextendingapache/airflow:3.2.0. Installsastronomer-cosmos~=1.14in the Airflow Python env (constrained) anddbt-core~=1.9+dbt-bigquery~=1.9in an isolated/home/airflow/dbt-venv/(required because Airflow 3.2's constraints clash with dbt-core onpathspec/protobuf).- Build-time guardrails (
which dbt,import cosmos, FAB-provider check) fail fast on drift. scripts/build-image.sh— local build helper that tags with the same3.2.0-<sha>convention as CI.
CI workflows (in ume-data-dags):
.github/workflows/image.yml— builds + pushes3.2.0-<sha>on merge to main whendocker/changes..github/workflows/dag-sync.yml—gcloud storage rsyncsdags/+dbt/to the bucket on merge when those paths change..github/workflows/pr-ci.yml— PR lint (hadolint +python -m py_compile+dbt parse); no GCP auth needed..github/workflows/bot-pr.yml— afterimage.ymlsucceeds on main, usesINFRA_PR_TOKEN(fine-grained PAT scoped toume-data-infraonly) to open a tfvars-bump PR on this repo.
CI workflows (in ume-data-infra):
.github/workflows/terraform-apply.yml— wait-for-image gate before the runtime apply (15-min poll) so Helm never starts a rollout against a missing tag.
Bootstrap and base-layer changes (ume-data-infra):
layers/00-bootstrap/main.tf—docker_config { immutable_tags = true }on the AR repo. A content-push SA (ume-datainfra-content-push) scoped to AR writer onume-composer-images+ WIF bound to1edata/ume-data-dags. Three narrow custom roles ontf-apply-sa(tfWifProviderUpdater,tfCustomRoleManager,tfArRepoIamAdmin) for self-management of these resource types.environments/dev-01-base/iam.tf—roles/bigquery.jobUseronume-airflowandume-airflow-kpo. Without it, dbt-bigquery cannot submit queries (bigquery.jobs.createdenied;bigquery.dataEditordoes not include it).environments/dev-03-runtime/buckets.tf— bucket-scopedroles/storage.objectAdminfor the content-push SA on the dev DAGs bucket.
Runtime rollouts (continuous, via the bot-PR loop):
airflow_image_repositoryis set to the AR URL once;airflow_image_tagis bumped on every DAGs-repo merge by the bot-PR workflow. Merging the bot-PR triggers terraform-apply's wait-for-image gate, then Helm rolls the pods.
Ownership model:
ume-data-dags repo:
└── docker/Dockerfile + dbt venv
└── dags/ + dbt/
└── CI: build image + push to AR
└── CI: gcloud storage rsync dags/ + dbt/ → GCS DAGs bucket
└── CI: open bot-PR against ume-data-infra bumping airflow_image_tag
ume-data-infra repo (this repo):
└── environments/dev-03-runtime/terraform.tfvars
└── airflow_image_repository = "us-east1-docker.pkg.dev/.../ume-composer-images/airflow"
└── airflow_image_tag = "3.2.0-<sha>" ← bumped by bot-PR
└── .github/workflows/terraform-apply.yml
└── wait-for-image gate before Helm rollout
Tag format: <airflow-version>-<commit-sha> (e.g., 3.2.0-a1b2c3d). Immutable — AR's docker_config.immutable_tags rejects overwrites.
Rollback: revert airflow_image_tag in tfvars to the previous value and apply. The db_bootstrap Job re-runs with the rolled-back image; airflow db migrate is idempotent but the target image must be known-good (otherwise the init container fails and the bootstrap stays Pending).
#
What to verify
-
ume-data-dags'spr-ci.ymlgreen on PR (hadolint, py_compile, dbt parse). - On merge in ume-data-dags: image present in AR (
gcloud artifacts docker images list us-east1-docker.pkg.dev/poc-ume-data/ume-composer-images). - Immutable tags enabled (
gcloud artifacts repositories describe ume-composer-images --location=us-east1 --format='value(dockerConfig.immutableTags)'→True). -
roles/bigquery.jobUserpresent on both Airflow SAs. - Bot-PR opened on this repo, merged, pods restart with new image.
-
astronomer-cosmosimportable (≥ 1.14);dbt --versionworks at/home/airflow/dbt-venv/bin/dbt. - No regression of IAP + SimpleAuthManager: browser sign-in at
https://airflow.umedev.marpont.es/lands on the UI with no Airflow-side login. - Cosmos execution mode (local) functional — validated in Story 5.
#
Then
Story 5 is bundled — ume_dbt_example DAG is already in ume-data-dags/dags/.
#
Story 5 — First Cosmos-Powered dbt DAG
Location: ume-data-dags/dags/ and ume-data-dags/dbt/
Agent: airflow-dags (in ume-data-dags)
Depends on: Story 4d (custom image with Cosmos + dbt installed)
Bundled with: Story 4d — initially shipped together under resources/ in this repo; content moved to ume-data-dags once validated.
#
What to build
In ume-data-dags/dbt/:
dbt_project.ymlwith project configuration.profiles.ymlconfigured for BigQuery OAuth using the Airflow SA identity (workload identity,oauthmethod).models/example/:ume_hello_world.sql— materialized astable,SELECT CURRENT_TIMESTAMP(), message, sentinel.ume_hello_world_downstream.sql— depends onume_hello_worldvia{{ ref(...) }}. Having aref()edge proves Cosmos renders the task-graph with a dependency, not just "dbt ran."schema.ymldocumenting both.
In ume-data-dags/dags/:
cosmos_dbt_dag.py— a Cosmos DAG using local execution mode (ExecutionMode.LOCAL). The DAG renders the dbt project as individual Airflow tasks, each dispatched to the Celery worker. Cosmos copies the project to a per-task tmp directory before invoking dbt, so the read-only GCS FUSE mount is not a problem.dbt_project_path = /opt/airflow/dags/dbt(GCS FUSE mounts the bucket root at/opt/airflow/dags/, sodbt/is a sibling ofdags/).dbt_executable_path = /home/airflow/dbt-venv/bin/dbt(isolated venv; see Story 4d note about Airflow 3.2 constraints vs dbt-core).is_paused_upon_creation = True,schedule = None,default_argswithowner,retries=1.
#
What to verify
- DAGs and dbt project synced to GCS bucket:
gsutil ls gs://ume-airflow-dags-poc-ume-data/dags/ gs://ume-airflow-dags-poc-ume-data/dbt/ - Files visible in worker filesystem:
kubectl exec deploy/airflow-worker -n airflow -c worker -- ls /opt/airflow/dags/dbt/ -
ume_dbt_exampleDAG visible in Airflow UI with two dbt-model tasks and a dependency edge (ume_hello_world→ume_hello_world_downstream) - Un-pause and trigger the DAG manually — all dbt tasks run successfully
-
bq show --format=prettyjson poc-ume-data:dbt_dev.ume_hello_worldand...ume_hello_world_downstreamreturn expected schemas - Airflow task logs show dbt output (both Airflow UI and
gs://ume-airflow-logs-poc-ume-data/logs/) - Tasks execute on the Celery worker (not scheduler) — verify in task instance details
- Re-trigger the DAG once;
materialized: tablereplaces tables idempotently (no accidental appends) -
kubectl top pod -n airflowduring the run — worker RSS stays well below the 3 Gi limit
#
Then
Phase 1 is complete. The data pipeline (Airflow + dbt + BigQuery) is operational on GKE, and the content pipeline is split into a dedicated ume-data-dags repo. Next steps:
- Scope
roles/storage.objectAdminonume-airflowSA to specific buckets (see backlog). - Extend the DAGs repo workflows to cover prod when the prod project is provisioned (matrix or split workflow files).
- Begin Phase 2 (DataHub) when priorities allow.
#
Phase 2 — DataHub & Additional Infrastructure
Phase 2 adds DataHub to the existing GKE cluster with Strimzi Kafka and self-hosted OpenSearch as backing services. The GKE cluster, VPC, shared Cloud SQL instance, Gateway, wildcard cert, and IAP brand from Phase 1 are all reused.
Master plan: plans/datahub-deployment-plan.md
— read this first. It covers architecture decisions, node-pool strategy,
disk sizing, alerting, and the per-story execution strategy (one
autonomous session per story, restricted profile).
Each story below is sized for one session. Specs below are the implementation contract; design rationale lives in the master plan.
#
Story 6 — Workload Pool + DataHub SQL + Password Secret
Stack: environments/dev-01-base/ (update) + layers/00-bootstrap/ (CI IAM coverage, if needed)
Agent: infra-terraform
Depends on: Phase 1 complete
#
What to build
Node pool (environments/dev-01-base/terraform.tfvars):
Add a new entry to gke_node_pools:
workload-pool = {
machine_type = "e2-standard-4"
min_count = 1
max_count = 4
spot = false
extra_labels = { pool = "workload" }
# No taint — workload selector (pool=workload) is enough.
}
DataHub database + user + password (environments/dev-01-base/cloud-sql.tf):
google_sql_database.datahub—name = "datahub",instance = module.airflow_sql.instance_name.random_password.datahub_db— length 32, special=false (avoids JDBC URL-encoding traps).google_sql_user.datahub—type = BUILT_IN(password auth),name = "datahub",password = random_password.datahub_db.result.google_secret_manager_secret.datahub_db_password—secret_id = "ume-data-dev-datahub-db-password", automatic replication.google_secret_manager_secret_version.datahub_db_password_v1—secret_data = random_password.datahub_db.result.
Outputs (environments/dev-01-base/outputs.tf):
datahub_db_name = "datahub"datahub_db_user = google_sql_user.datahub.namedatahub_db_host = module.airflow_sql.private_ip_addressdatahub_db_password_secret_id = google_secret_manager_secret.datahub_db_password.secret_id
Cloud Monitoring alert (environments/dev-02-k8s-base/alerts.tf — new file):
- Policy "Cloud SQL disk > 75%" on metric
cloudsql.googleapis.com/database/disk/utilization, filter instanceume-data-dev-airflow-pg, threshold 0.75, duration 10m.
Bootstrap CI IAM check (invariant #11):
Verify tf-plan-sa can read google_secret_manager_secret_version data
sources (needed downstream by Story 11's Helm release). The existing
tfK8sSecretsReader role (Story 4b era) covers secretmanager.versions.*
— confirm during planning; add a custom role if gap found.
#
Design decisions
Canonical in plans/datahub-deployment-plan.md §1, §2, §3, §5. Key points:
- Shared SQL instance, not a new one. Saves ~$26/mo; dev workload fits.
- Password auth, not IAM auth. Skips 5 Cloud SQL Auth Proxy sidecars in DataHub pods.
- Secret Manager (not plaintext in Helm values). DataHub pods mount via Secrets Store CSI (Story 7 + Story 11).
workload-pooldistinct fromdefault-pool. Stateful workloads on their own nodes.min=1with soft anti-affinity for Kafka/OS pods — cold-start fits one node, scales out as needed.
#
What to verify
-
terraform fmt -check -recursive+validatepass. - After CI apply:
gcloud container node-pools list --cluster=ume-data-dev-gke --zone=us-east1-bshowsworkload-poolwithlocations=us-east1-b,machineType=e2-standard-4. -
gcloud sql databases list --instance=ume-data-dev-airflow-pgshowsdatahub. -
gcloud sql users list --instance=ume-data-dev-airflow-pgshowsdatahubuser (typeBUILT_IN). -
gcloud secrets versions list ume-data-dev-datahub-db-passwordreturns exactly one version. -
gcloud alpha monitoring policies listshows the Cloud SQL disk policy.
#
Then
Story 7 installs the Secret Manager CSI driver.
#
Story 7 — Secrets Store CSI Driver
Stack: environments/dev-02-k8s-base/
Agent: infra-terraform
Depends on: Story 6
#
What to build
environments/dev-02-k8s-base/secrets-store-csi.tf (new file):
helm_release.secrets_store_csi_driver— chartsecrets-store-csi-driverfromhttps://kubernetes-sigs.github.io/secrets-store-csi-driver/charts, namespacekube-system, pinned chart version (verify latest at story time).helm_release.secrets_store_csi_driver_gcp— chartsecrets-store-csi-driver-provider-gcpfromhttps://googlecloudplatform.github.io/secrets-store-csi-driver-provider-gcp, namespacekube-system, pinned chart version.- Values:
syncSecret.enabled = trueon the base driver (so mounted secrets can also be synced to native k8s Secrets — DataHub's chart expects env-var refs to k8s Secrets, not file paths).
Outputs: none needed (driver exposes cluster-wide SecretProviderClass CRD).
#
Design decisions
kube-systemnamespace. The driver is aDaemonSetthat must run on every node pool; standard convention places it inkube-system.- GCP provider alongside the base driver. The base driver is generic; the GCP provider is the Secret Manager plugin. Both are required.
syncSecret.enabled = true. DataHub's Helm chart and most upstream charts consume passwords viaenv.valueFrom.secretKeyRef, which requires a k8sSecretobject. Sync mode creates one from the CSI mount.
#
What to verify
-
kubectl -n kube-system get pods -l app=secrets-store-csi-driverall Running. -
kubectl get crd secretproviderclasses.secrets-store.csi.x-k8s.ioexists. -
kubectl -n kube-system get pods -l app=csi-secrets-store-provider-gcpall Running.
#
Then
Story 8 installs the Strimzi operator.
#
Story 8 — Strimzi Kafka Operator
Stack: environments/dev-02-k8s-base/
Agent: infra-terraform
Depends on: Story 6 (workload-pool exists; operator itself runs anywhere but its watched clusters target it)
#
What to build
environments/dev-02-k8s-base/strimzi.tf (new file):
kubernetes_namespace_v1.strimzi_system—strimzi-systemnamespace with common labels.helm_release.strimzi_kafka_operator— chartstrimzi-kafka-operatorfromhttps://strimzi.io/charts/, pinned chart version (verify latest at story time).- Values:
watchAnyNamespace: true— cluster-wide watch.resources.requests: { cpu: 200m, memory: 384Mi }— operator itself is small.nodeSelector: { pool: workload }— pin operator to workload-pool.
#
Design decisions
- Cluster-wide watch. Matches our shared-Gateway pattern — one operator, many namespaces possible later.
- Operator on workload-pool. Keeps default-pool free of operator pods.
- No Kafka CR yet. That's Story 9. Having the operator install PR separate means any CRD/operator upgrade is a clean roll-back.
#
What to verify
-
kubectl -n strimzi-system get podsshows operator Running. - CRDs installed:
kubectl get crd | grep strimzi.iolistskafkas,kafkanodepools,kafkatopics,kafkausers. - Operator scheduled on workload-pool:
kubectl -n strimzi-system get pods -o wide→ node has labelpool=workload.
#
Then
Story 9 provisions the Kafka cluster.
#
Story 9 — Kafka Cluster (KRaft, 3 Controllers + 2 Brokers)
Stack: environments/dev-03-runtime/ + new modules/strimzi-kafka/
Agent: infra-terraform
Depends on: Story 8
#
What to build
modules/strimzi-kafka/ (new):
main.tf— namespace +KafkaNodePool(controllers) +KafkaNodePool(brokers) +KafkaCR viakubernetes_manifest.variables.tf—namespace,cluster_name,kafka_version,controller_replicas(default 3),controller_memory(default 256Mi),controller_storage_size(default 1Gi),broker_replicas(default 2),broker_memory(default 1.5Gi),broker_cpu(default 500m),broker_storage_size(default 10Gi),broker_storage_class(defaultpremium-rwo),log_retention_hours(default 72),log_retention_bytes(default 8589934592 = 8 GiB),min_insync_replicas(default 1),node_selector(default{ pool = "workload" }).outputs.tf—bootstrap_servers(=<cluster_name>-kafka-bootstrap.<namespace>.svc:9092),namespace,cluster_name.
environments/dev-03-runtime/kafka.tf (new file):
module "kafka"call with defaults;cluster_name = "ume-data-dev-kafka",namespace = "kafka".
Alert (environments/dev-02-k8s-base/alerts.tf):
- Policy "Kafka broker PV > 70%" on metric
kubernetes.io/node/persistentvolume/volume/used_bytes / capacity_bytesfilter namespacekafka.
#
Design decisions
Canonical in plans/datahub-deployment-plan.md §4, §5.
- KRaft, not ZooKeeper. Strimzi 0.38+ supports KRaft; one fewer moving part.
- Dedicated controllers. 2-broker combined-role clusters can't form an odd-quorum. 3 tiny controllers solve it.
- Retention + size caps together. Time-based retention (72h) + byte-based cap (8 GiB) ensures the PV never fills even under a burst.
- Soft anti-affinity.
preferredDuringSchedulingIgnoredDuringExecutionontopology.kubernetes.io/hostname. Lets brokers co-locate when there's only one node; spreads when autoscaler adds more. - PD-SSD. Kafka is IOPS-sensitive; pd-balanced is cheaper but can stall during retention sweeps.
- No Cruise Control. Added to backlog for prod.
min.insync.replicas = 1. With RF=2, one broker can be down during rolling upgrade without losing write availability.
#
What to verify
-
kubectl -n kafka get kafka ume-data-dev-kafka→READY=True. -
kubectl -n kafka get podsshows 3-controllers-*and 2-brokers-*pods Running. - Brokers scheduled on workload-pool nodes.
-
kubectl -n kafka get pvcshows 5 PVCs bound (3 controller + 2 broker). - Bootstrap service reachable in-cluster:
kubectl -n kafka run kcat --rm -it --image=edenhill/kcat:1.7.1 --restart=Never -- -b ume-data-dev-kafka-kafka-bootstrap:9092 -L(metadata listing). - PV alert policy exists.
#
Then
Story 10 provisions OpenSearch.
#
Story 10 — OpenSearch + Snapshots
Stack: environments/dev-02-k8s-base/ (operator) + environments/dev-03-runtime/ (cluster) + environments/dev-01-base/ (snapshot bucket)
Agent: infra-terraform
Depends on: Story 8 (pattern proven; independent of Kafka at runtime)
#
What to build
Snapshot bucket (environments/dev-01-base/buckets.tf — new file, or append to an existing one):
- Module call to
modules/gcs-bucket/forume-opensearch-snapshots-poc-ume-data:versioning = false- Lifecycle: delete objects older than 35 days.
- Expose in outputs as
opensearch_snapshots_bucket.
OpenSearch GSA (environments/dev-01-base/iam.tf):
google_service_account.opensearch_snapshot—ume-opensearch-snapshot.- Bucket-scoped
roles/storage.objectAdminon the snapshot bucket. - Workload Identity binding:
opensearch/opensearch-snapshotKSA →ume-opensearch-snapshotGSA.
Operator (environments/dev-02-k8s-base/opensearch.tf — new file):
kubernetes_namespace_v1.opensearch_operator—opensearch-operatornamespace.helm_release.opensearch_operator— chartopensearch-operatorfromhttps://opensearch-project.github.io/opensearch-k8s-operator/, pinned chart version.- Values: operator pinned to workload-pool.
Cluster (environments/dev-03-runtime/opensearch.tf — new file):
kubernetes_namespace_v1.opensearch—opensearchnamespace.kubernetes_service_account_v1.opensearch_snapshot— with WI annotation.OpenSearchClusterCR viakubernetes_manifest:- 1 data node (also master-eligible), 512Mi JVM heap, 1 CPU, 1.5Gi memory request.
- 5 GiB PD-SSD storage.
nodeSelector: { pool: workload }.- Security plugin disabled (dev only; Story 13 hardens with basic auth or mTLS).
SecretProviderClass(CSI) — mounts bucket name (not secret, just config; optional, can use direct env).kubernetes_manifestISM policy (JSON CRD) — delete indices > 30 days.kubernetes_cron_job_v1.opensearch_snapshot— daily at 04:00 UTC, runscurl -XPUT opensearch-cluster/_snapshot/gcs_backup/$(date +%Y%m%d). Usesopensearch-snapshotKSA.
Alert (environments/dev-02-k8s-base/alerts.tf):
- Policy "OpenSearch PV > 70%" (namespace
opensearch).
#
Design decisions
- Single data node in dev. 3-node minimum is a prod concern; dev can take unassigned-shard risk. Snapshots provide the durability backstop.
- OpenSearch 2.x. DataHub supports both ES 7.10+ and OS 2.x; OS has no license friction.
- GCS snapshots over cross-zone replication. Cheaper, simpler, and the ops story is clear (restore from snapshot).
- ISM + bucket lifecycle both. Indices deleted at 30 days inside OS; snapshots deleted at 35 days in GCS. Always a 5-day overlap for recovery.
- Security plugin off in dev. Keeps the story small. Story 13 re-evaluates.
#
What to verify
-
kubectl -n opensearch-operator get podsshows operator Running. -
kubectl -n opensearch get opensearchcluster→READY. -
kubectl -n opensearch get podsshows 1 data node Running on workload-pool. -
gsutil ls gs://ume-opensearch-snapshots-poc-ume-data/(may be empty before first run). - First CronJob run logs show a successful snapshot API call.
- ISM policy exists:
kubectl -n opensearch get opensearchismpolicy.
#
Then
Story 11 deploys DataHub.
#
Story 11 — DataHub Dry-Run
Stack: environments/dev-03-runtime/ + new modules/datahub-helm/
Agent: datahub-platform
Depends on: Stories 6, 7, 9, 10
#
What to build
modules/datahub-helm/ (new):
- Wraps the upstream
acryldata/datahubchart. Verify latest chart version at story time (theverify_versionsinvariant). main.tf— namespace + KSA (no WI binding yet; ingestion adds it) +SecretProviderClass(CSI, syncs Secret Managerdatahub-db-password→ k8s Secret) +helm_release.- Helm values set via module:
datahub-gms.replicaCount,datahub-frontend.replicaCount,datahub-mae-consumer.replicaCount,datahub-mce-consumer.replicaCount= 1 each.- All pod
nodeSelector: { pool: workload }. global.sql.datasource:host: <sql_private_ip>hostForMysqlClient: <sql_private_ip>(chart quirk; still set for postgres paths).port: 5432database: datahuburl: jdbc:postgresql://<ip>:5432/datahubdriver: org.postgresql.Driverusername: datahubextraEnvs: [{ name: DATAHUB_DB_PASSWORD, valueFrom: { secretKeyRef: { name: datahub-db-password, key: password } } }]
global.kafka.bootstrap.server: <kafka.bootstrap_servers>.global.elasticsearch.host: opensearch-cluster.opensearch.svc,port: 9200,useSSL: false,skipcheck: true(disables X-Pack check since OS isn't ES).elasticsearchSetupJob.enabled: true— creates DataHub indices.kafkaSetupJob.enabled: true— creates DataHub topics.
variables.tf— all knobs exposed (replicas, resources, versions, backing endpoints).outputs.tf—namespace,release_name,frontend_service_name.
environments/dev-03-runtime/datahub.tf (new):
module "datahub"call wiring remote_state refs fromdev-01-base(SQL) and reading Kafka/OpenSearch service DNS directly (same cluster, well-known names).
environments/dev-03-runtime/data.tf — add outputs passthrough if needed.
No IAP yet. Verify via kubectl port-forward svc/datahub-frontend 9002:9002 -n datahub.
#
Design decisions
Canonical in plans/datahub-deployment-plan.md §7.
- Module over inline. Env-scoped resource, replicates to prod.
- CSI-synced k8s Secret for DB password. DataHub chart expects
secretKeyRef; syncSecret fills it from Secret Manager. - Port-forward verification step. No ingress wiring yet — Story 12 adds it. Keeps each PR small.
elasticsearch.skipcheck: true— required when pointing DataHub at OpenSearch 2.x (X-Pack check fails otherwise).- No KSA → GSA WI binding yet. DataHub GMS does not make GCP API calls; ingestion recipes (in
ume-data-dags) do. Adding the binding here would grant permissions nothing uses.
#
What to verify
-
kubectl -n datahub get podsshows all DataHub pods Running, setup jobs Completed. -
kubectl -n datahub logs deploy/datahub-gmsshows successful SQL connection, Kafka producer connected, OpenSearch client initialized. -
kubectl port-forward -n datahub svc/datahub-frontend 9002:9002+ browserhttp://localhost:9002loads the UI. -
datahubDB schema populated:gcloud sql connect ume-data-dev-airflow-pg --database=datahub --user=datahub→\dt(read-only check — prohibited per session rules; instead verify via GMS logs). - Kafka topics created:
kubectl exec -n kafka ume-data-dev-kafka-brokers-0 -- bin/kafka-topics.sh --list --bootstrap-server localhost:9092listsMetadataChangeLog_Versioned_v1etc. - OpenSearch indices created: visit
/_cat/indicesvia port-forward.
#
Then
Story 12 wires IAP and public ingress.
#
Story 12 — DataHub IAP + HTTPRoute + OIDC Auth
Stack: environments/dev-03-runtime/ (update) + small modules/datahub-helm/ addition
Agent: datahub-platform
Depends on: Story 11
Status: DONE — see story-status.md for the post-mortem. First-admin bootstrap still manual (local datahub JAAS user); groups/policies-as-code lands in Story 13.
#
What to build
modules/datahub-helm/ — add:
httproute_enabled,gateway_name,gateway_namespace,hostnamevariables (matchmodules/airflow-helm/surface).- Optional
HTTPRouteresource attached todatahub-frontendService on:9002. - DataHub OIDC values passthrough (see "DataHub OIDC" below).
environments/dev-03-runtime/datahub.tf — extend module call with HTTPRoute params + an iap-oauth module call:
module "datahub_iap"(new, usesmodules/iap-oauth/):service_name = "datahub-frontend"namespace = "datahub"allowed_users = var.iap_allowed_users(same list as Airflow initially).
environments/dev-03-runtime/terraform.tfvars — add datahub_subdomain = "datahub".
DataHub OIDC (in-app identity, not the perimeter):
IAP alone collapses to "all-admin or all-reader" — doesn't meet per-user / per-dataset stewardship. Keep IAP as the perimeter (who can reach the host) and layer DataHub OIDC inside it for in-app identity + roles.
Separate OAuth client from the IAP client, created on the same GCP OAuth consent screen.
clientId/clientSecretland in Secret Manager and mount intodatahub-frontendvia Secrets Store CSI (Story 7 driver).Helm values on the frontend chart:
authentication.enabled = true/authentication.provider = oidcoidcAuthentication.discoveryUri = https://accounts.google.com/.well-known/openid-configurationoidcAuthentication.userNameClaim = emailoidcAuthentication.scopes = "openid profile email"oidcAuthentication.extractGroupsEnabled = false(Phase 1 — see"Phased migration" below).
JIT user provisioning is on by default; a new Google account landing through IAP becomes a DataHub user record on first login.
DataHub groups + policies bootstrap (idempotent, driven from a
DataHub policies-as-code file checked into this repo or ume-data-dags
— final home decided in Story 13 alongside ingestion recipes):
Groups (pre-create, membership managed by admins until Phase 2):
platform-adminsdata-stewards(per-domain children:finance-stewards,marketing-stewards, …)viewers
Domains: one per business area. Each domain has an owner group from
data-stewards. Datasets join a domain via ingestion metadata (dbt
tags / BigQuery labels / source-system owners surfaced through the
recipe).
Policies (all bound to groups, never user URNs — see design decisions):
- Platform:
platform-admins → Adminrole. - Platform:
data-stewards → Editorrole. - Platform:
viewers → Reader(or rely on the Reader default). - Platform:
finance-stewards → Manage Domainscoped tourn:li:domain:finance(templated per-domain viafor_each). - Metadata: per-domain "edit metadata where domain=…", bound to the matching steward group.
- Platform: ingestion SA (Airflow) →
Manage Ingestion Sources+Manage Secrets. Runs unattended; no human role.
#
Access control model
Stewardship on a specific dataset is Ownership of that entity with
ownershipType = DATA_STEWARD — DataHub has no global "Steward" role.
The global Editor role just gates who can propose edits at all;
ownership gates which assets they can touch.
Ownership on a dataset is assigned by (a) admins via UI, (b) domain
owners within their scope, (c) ingestion recipes carrying owners
metadata. (c) is the scalable path — don't expect to click-assign
owners on hundreds of datasets.
#
Phased migration to Workspace groups
Google's accounts.google.com OIDC issuer has a fixed claim set — no
custom per-user claims, and no groups claim outside Workspace. Plan
around that.
Phase 1 (now, no Workspace access):
- DataHub OIDC → Google; user identity via
email. - Admins manually add users to the Phase-1 groups on first login.
- Policies + domains + ownerships already bind to groups, so the Phase-1 work is throwaway-free.
Phase 2 (Workspace access returned):
Recreate the same group names as Google Groups under the Workspace domain (
platform-admins@…,finance-stewards@…, …).Flip
oidcAuthentication.extractGroupsEnabled = trueand setoidcAuthentication.groupsClaimName = groups. DataHub syncs group membership on each login.Optional cleanup: remove the Phase-1 manual group memberships (dual membership is harmless during transition).
No policy rewrites — because nothing binds to user URNs.
#
Design decisions
- Reuse
modules/iap-oauth/verbatim for the perimeter. Confirmed working for Airflow; parameterized per service. - Same IAP allow-list initially. Expand in tfvars when needed.
- Wildcard cert already covers
datahub.umedev.marpont.es. No Certificate Manager changes. - IAP at perimeter + DataHub OIDC inside. IAP alone is binary; DataHub's role+policy+ownership layer does per-user and per-dataset work.
- Bind every policy to a group, never to a user URN. Phase-2 migration to Workspace groups becomes a rename, not a rewrite.
- Stewardship = Ownership on entity + Editor role, not a global role. Matches DataHub's data model and makes domain-based delegation natural.
- Policies as code, not click-ops. The group/domain/policy bootstrap lives in a checked-in config so Phase 2 and prod rebuilds are deterministic. Exact location (this repo vs
ume-data-dags) decided in Story 13 when ingestion recipes land.
#
What to verify
-
kubectl -n datahub get httprouteshowsdatahubaccepted. -
kubectl -n datahub describe gcpbackendpolicy datahub-frontend-iap→Attached. -
gcloud compute backend-services list --format='table(name,iap.enabled)'showsiap.enabled = Trueon the DataHub backend. -
curl -sI http://datahub.umedev.marpont.es/→ 301 to https. -
curl -sI https://datahub.umedev.marpont.es/→ 302 toaccounts.google.com. - Browser sign-in as allow-listed user lands on DataHub UI.
- DataHub
/loginshows "Sign in with Google" after OIDC config applies. - A non-allowlisted Google account hits IAP 403 before reaching DataHub (perimeter works independently of DataHub OIDC).
- Allowlisted user signs in → DataHub user record auto-created with their email as
userName. - Admin user can reach
/settings/policiesand create a policy; non-admin user gets 403 on the same path. - Steward user can edit a tag on a dataset inside their domain; cannot edit a tag on a dataset outside it.
#
Then
Story 13 hardens cost + ops and finalizes where the policies-as-code bootstrap lives.
#
Story 13 — Cost + Operations Hardening
Stacks: all dev stacks + ingestion cross-repo coordination
Agent: infra-terraform + docs-infra
Depends on: Story 12
#
What to build
- Label audit across all Terraform-managed resources (fail CI if labels missing).
- Budget alerts at 50 / 80 / 100% of target in Cloud Billing.
- PDB verification: simulate a node drain on workload-pool, confirm DataHub, Kafka, OpenSearch survive.
- Maintenance window verification on the GKE cluster and Cloud SQL instance.
- Ingestion DAGs added to
ume-data-dags(BigQuery, Airflow, dbt) — cross-repo work, tracked here as coordination. - Runbook drill: at least one end-to-end scenario (e.g. Kafka broker restart, OpenSearch snapshot restore).
- Consider re-enabling OpenSearch security plugin with basic auth backed by Secret Manager CSI.
#
What to verify
- CI label-lint passes on every stack.
- Budget alert emails received at 50%.
-
kubectl drain <workload-pool-node>— no DataHub/Kafka/OS service disruption. - Runbook entry for at least one end-to-end recovery scenario merged.
#
Monthly Cost Summary
#
Phase 1 — Airflow only (~$81/mo)
#
Phase 2 — Add DataHub (~$200-310/mo incremental, depending on autoscaler)
Savings vs original plan: ~$100/mo by reusing Cloud SQL + dropping the Auth Proxy sidecars + dev-sized Kafka (2 brokers vs 3, no Cruise Control) + single-node OpenSearch.
Note: GKE free tier covers one zonal cluster. Regional cluster in prod costs an additional ~$74/mo.
#
After Phase 2
Once all stories are completed and verified on dev:
- Review lessons learned. Update docs where reality diverged from plan.
- Provision prod GCP projects (externally, by org admin).
- Create
prod-01-base,prod-02-runtimestacks (mirror dev structure, differentterraform.tfvars). - Execute Phase 2 stories against prod, with the GitHub Environment approval gate.
- Promote the dev-validated custom Airflow image tag to prod.
- Enable DataHub ingestion recipes against prod BigQuery datasets.