# Terraform Structure

This section is the canonical reference for how the ume-data-infra repository is organized. Agents and engineers should read this before proposing any Terraform changes.

# Repository Layout

ume-data-infra/
├── .github/
│   └── workflows/
│       ├── terraform-plan.yml
│       ├── terraform-apply.yml
│       └── terraform-drift.yml
├── scripts/
│   └── detect-changed-stacks.sh       # Identifies which stacks changed in a PR
├── layers/                             # Global, applied once across the org
│   ├── 00-bootstrap/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── versions.tf
│   │   └── backend.hcl
│   └── 10-platform-shared/             # Phase 2 — created when cross-env resources appear
│       ├── main.tf
│       ├── variables.tf
│       ├── outputs.tf
│       ├── versions.tf
│       └── backend.hcl
├── environments/                       # Per environment
│   ├── dev-01-base/                    # Phase 1 — pure GCP, no k8s providers
│   │   ├── networking.tf
│   │   ├── gke.tf
│   │   ├── cloud-sql.tf
│   │   ├── iam.tf                      # Airflow SAs, WI bindings, SQL IAM user
│   │   ├── dns.tf                      # DNS zone, shared ingress IP, wildcard A (Story 4c / PR 3a)
│   │   ├── certificate.tf              # Certificate Manager wildcard cert + map (Story 4c / PR 3a)
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── versions.tf
│   │   ├── locals.tf                   # Standard labels, naming
│   │   ├── terraform.tfvars
│   │   └── backend.hcl
│   ├── dev-02-k8s-base/                # Phase 1 — Kubernetes platform layer (Story 4c / PR 3b.2)
│   │   ├── gateway.tf                  # Shared GKE Gateway + redirect HTTPRoute
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── versions.tf                 # google + kubernetes + helm providers
│   │   ├── locals.tf
│   │   ├── data.tf                     # Reads dev-01-base remote state
│   │   ├── terraform.tfvars
│   │   └── backend.hcl
│   ├── dev-03-runtime/                 # Phase 1 (Airflow), Phase 2 (+ DataHub)
│   │   ├── airflow.tf
│   │   ├── iap.tf                      # Per-app IAP module call (Story 4c / PR 3c)
│   │   ├── buckets.tf                  # GCS log + DAG buckets (via modules/gcs-bucket)
│   │   ├── datahub.tf                  # Added in Phase 2
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── versions.tf
│   │   ├── locals.tf
│   │   ├── terraform.tfvars
│   │   └── backend.hcl
│   ├── prod-01-base/
│   ├── prod-02-k8s-base/
│   └── prod-03-runtime/
├── resources/                          # DAGs, dbt, Docker (temp; ports to own repo)
│   ├── docker/
│   │   ├── Dockerfile
│   │   └── requirements.txt
│   ├── dags/
│   ├── dbt/
│   └── scripts/
│       └── build-image.sh
└── modules/
    ├── vpc/
    ├── gke-standard/
    ├── cloud-sql-postgres/
    ├── gcs-bucket/                     # Story 4a
    ├── airflow-helm/                   # Story 4b (evaluate)
    ├── datahub-helm/                   # Phase 2
    ├── artifact-registry/
    ├── wif-pool/
    └── iam-grants/

# Layers vs Environments

# Layers (`layers/`)

Layers are global resources applied once, independent of any specific environment. They set up the shared foundation that all environments depend on.

Layer	Purpose	Apply frequency
`00-bootstrap`	Terraform state bucket, Artifact Registry, Workload Identity Federation pool + provider, CI runner service account, GCP API enablement	Once at repo inception; rarely changes
`10-platform-shared`	Cross-environment resources: `datahub-sa`, KMS, secrets, logging (Phase 2). Airflow SAs live in `{env}-01-base` because they are environment-scoped.	Phase 2 — when cross-env resources appear

Key constraint: 00-bootstrap is bootstrapped with a local backend (terraform init without -backend-config), then migrated to GCS after the state bucket exists. This is a one-time manual operation documented in Deployment Stories.

# Environments (`environments/`)

Environments contain per-environment stacks. Each stack is an independent Terraform root module with its own state.

Stack	Contents	Lifecycle	Phase
`{env}-01-base`	VPC, GKE cluster, Cloud SQL, Airflow SAs + Workload Identity bindings, DNS zone, shared ingress IP, wildcard TLS cert (Certificate Manager)	Slow-changing structural infra	1
`{env}-02-k8s-base`	Shared GKE Gateway + redirect HTTPRoute; platform-level k8s singletons (future: Managed Prometheus, Secret Manager CSI, Strimzi, OpenSearch operators)	Changes when platform tooling is added	1 (Gateway), 2 (more)
`{env}-03-runtime`	GCS buckets, Airflow Helm release, per-app IAP wiring, per-app HTTPRoute (Phase 1); DataHub Helm release + its IAP + HTTPRoute (Phase 2)	Changes when workloads are deployed or updated	1 (Airflow), 2 (DataHub)

The three-stack split reflects lifecycle differences and provider boundaries:

01-base is pure GCP (no k8s providers) — survives months untouched; changes rarely.
02-k8s-base is the first layer that pulls in kubernetes + helm providers; contains environment-singleton k8s platform infra shared across all apps (Gateway, future cert manager add-ons, observability).
03-runtime is per-app workloads; churns with image bumps and DAG syncs.

Dev-03-runtime was renamed from dev-02-runtime in Story 4c when the k8s-base layer was introduced.

# Modules

# Module strategy

Local modules are the standard pattern for environment-scoped resources. The justification is forward-looking: if a resource will be replicated to prod, it gets a module now. Don't wait for a second caller to exist — prod is planned, and that's the justification.

Why modules from the start:

Reusability — prod calls the same module with different parameters instead of copying hundreds of lines
Encapsulation — naming prefixes (ume-data-{env}-{purpose}), label merges, and security defaults are written once inside the module, not repeated in every resource
Conciseness — the environment stack becomes a short module call that shows only what matters for that environment, making differences between dev and prod visible at a glance

Design guidelines:

Expose all configurable settings as variables with sensible defaults. Different environments have different needs. Auto-repair, auto-upgrade, machine types, node counts, maintenance windows — all exposed. Don't force callers into the module code to change a setting.
Modules encapsulate repetition. The caller passes environment = "dev" and name_prefix = "ume-data" once; the module computes ume-data-dev-gke, applies labels to every resource, and enforces security baselines.
Upstream terraform-google-modules/* when they add value. Evaluate per-module. If the upstream handles complexity we'd otherwise reimplement (VPC peering edge cases, NAT quirks), wrap it inside our local module. If our needs are straightforward (GKE cluster with a few pools), direct resources inside our module are cleaner. Either way, the caller's interface stays the same — switching from direct resources to upstream is an internal refactor.
Layer-scoped one-offs stay direct. Bootstrap (state bucket, WIF, AR) is applied once and never replicated across environments. Direct resources are fine there.

Module sources are local paths in wave-1 (e.g., source = "../../modules/gke-standard"). Before prod rollout, evaluate migrating to tagged git refs for pinning.

# Module catalog

Module	Implementation	Used by	Status
`modules/gke-standard/`	Direct resources (`google_container_cluster`, `google_container_node_pool`, Gateway API enablement)	`{env}-01-base/gke.tf`	Created (Story 3d)
`modules/vpc/`	Direct resources (`google_compute_network`, subnet, Cloud Router, Cloud NAT)	`{env}-01-base/networking.tf`	Created (Story 3d follow-up)
`modules/cloud-sql-postgres/`	Direct resources (`google_sql_database_instance`, PSA, database, Secret Manager)	`{env}-01-base/cloud-sql.tf`	Created (Story 3d follow-up)
`modules/gcs-bucket/`	Direct resources (`google_storage_bucket`) with lifecycle rules, labels, uniform access	`{env}-03-runtime/buckets.tf`	Created (Story 4a)
`modules/airflow-helm/`	`helm_release` + values templating + HTTPRoute (WI, logging, Cloud SQL Auth Proxy sidecar, GCS FUSE)	`{env}-03-runtime/airflow.tf`	Created (Story 4b, HTTPRoute added Story 4c)
`modules/iap-oauth/`	Per-service IAP wiring: `google_iap_client`, k8s Secret with OAuth creds, `GCPBackendPolicy` on the target Service, `roles/iap.httpsResourceAccessor` bindings	`{env}-03-runtime/iap.tf` (per app)	Created (Story 4c)
`modules/datahub-helm/`	`helm_release` + values templating + HTTPRoute	`{env}-03-runtime/datahub.tf`	Phase 2

Inlined in {env}-02-k8s-base/gateway.tf (pending module extraction): the shared Gateway + redirect HTTPRoute are direct kubernetes_manifest resources rather than a module today — singleton per env and five resources total, so flat is clearer. Extract into modules/gke-gateway/ when prod adds a second Gateway.

Strimzi and OpenSearch are consumed as upstream Helm charts directly in {env}-02-k8s-base/ (Phase 2). They will be wrapped into local modules when environment replication begins.

# State Management

# Backend configuration

Every stack has a backend.hcl file:

bucket = "ume-tf-state-poc-ume-data"
prefix = "environments/dev-01-base"

Applied via: terraform init -backend-config=backend.hcl

# State layout in GCS

gs://ume-tf-state-poc-ume-data/
  layers/00-bootstrap/default.tfstate
  layers/10-platform-shared/default.tfstate      # Phase 2
  environments/dev-01-base/default.tfstate
  environments/dev-02-k8s-base/default.tfstate
  environments/dev-03-runtime/default.tfstate

The environments/dev-02-runtime/ prefix from before the Story 4c rename is retained under GCS versioning until a cleanup window; it is no longer read or written.

Prod will use a separate state bucket in the shared-services project when it exists.

# Locking

GCS backend provides native state locking via object generation numbers. No external locking service (e.g., DynamoDB) is needed.

# State access

tf-plan-sa has roles/storage.objectViewer on the state bucket (read-only for plans).
tf-apply-sa has roles/storage.objectAdmin on the state bucket (read-write for applies).

# Inter-Stack Contracts

Stacks communicate exclusively via terraform_remote_state data sources. No output value is duplicated into another stack's tfvars.

Example: dev-02-runtime reads from dev-01-base:

data "terraform_remote_state" "base" {
  backend = "gcs"
  config = {
    bucket = var.state_bucket
    prefix = "environments/dev-01-base"
  }
}

# Usage
locals {
  gke_cluster_name = data.terraform_remote_state.base.outputs.gke_cluster_name
  gke_endpoint     = data.terraform_remote_state.base.outputs.gke_endpoint
  gke_ca_cert      = data.terraform_remote_state.base.outputs.gke_ca_cert
  vpc_id           = data.terraform_remote_state.base.outputs.vpc_id
}

# Required outputs per stack

Stack	Must export
`00-bootstrap`	`state_bucket_name`, `artifact_registry_url`, `wif_pool_name`, `wif_provider_name`
`10-platform-shared`	`shared_kms_keyring`, `logging_sink_id`, `datahub_sa_email` (Phase 2 — layer created then)
`{env}-01-base`	`vpc_id`, `vpc_self_link`, `gke_cluster_name`, `gke_endpoint`, `gke_ca_cert`, `sql_connection_name`, `sql_private_ip`, `sql_instance_name`, `airflow_sa_email`, `airflow_kpo_sa_email`, `domain_name`, `dns_zone_name`, `dns_zone_nameservers`, `ingress_ip_name`, `ingress_ip_address`, `certificate_map_name`
`{env}-02-k8s-base`	`gateway_name`, `gateway_namespace`
`{env}-03-runtime`	`airflow_namespace`, `airflow_logs_bucket`, `airflow_dags_bucket`, `airflow_url`, `iap_client_id` (Phase 1); add `datahub_url`, `datahub_iap_client_id` in Phase 2

# Label Conventions

Every Terraform-managed GCP resource must carry these labels:

Label	Example value	Purpose
`env`	`dev`, `prod`	Environment
`layer`	`bootstrap`, `base`, `k8s-base`, `runtime`	Which stack manages this resource
`service`	`gke`, `airflow`, `datahub`, `kafka`, `opensearch`, `cloudsql`	Logical service
`owner`	`platform-team`	Responsible team
`cost_center`	`data-platform`	Cost attribution group

Labels are defined in each stack's locals.tf:

locals {
  common_labels = {
    env         = var.environment
    layer       = "base"
    owner       = "platform-team"
    cost_center = "data-platform"
  }
}

Individual resources merge local.common_labels with a service-specific label:

labels = merge(local.common_labels, { service = "gke" })

CI will lint for missing labels via a pre-plan check (documented in CI/CD).

# Security

Security is embedded in the Terraform structure, not bolted on. This section covers the security posture enforced by the infrastructure code.

# Identity and access

# Workload Identity Federation (WIF)

CI/CD authenticates to GCP via WIF. No service-account key files exist anywhere in the system.

WIF pool + provider are created in 00-bootstrap, trusting the ume-data-infra GitHub repository.
Attribute conditions restrict which repo, branch, and workflow can assume which service account.
Known gotcha: if the GitHub repo moves orgs or is renamed, the WIF provider's attribute_condition must be updated. Document this as a runbook in Operations.

# Service accounts

Service Account	Created in	Purpose	Permissions
`tf-plan-sa`	`00-bootstrap`	CI plan jobs	`roles/viewer` on project + custom `tfStateLocker` on state bucket
`tf-apply-sa`	`00-bootstrap`	CI apply jobs	`roles/editor` on project + `roles/storage.objectAdmin` on state bucket
`ume-airflow`	`{env}-01-base`	Airflow scheduler + workers identity (via Workload Identity)	`roles/bigquery.dataEditor`, `roles/cloudsql.client`, `roles/secretmanager.secretAccessor`, `roles/storage.objectAdmin`
`ume-airflow-kpo`	`{env}-01-base`	KPO task identity (via Workload Identity)	`roles/bigquery.dataEditor`, `roles/storage.objectViewer`
`datahub-sa`	`10-platform-shared`	DataHub GMS identity (Phase 2, via Workload Identity)	`roles/cloudsql.client`, `roles/secretmanager.secretAccessor`

Least-privilege: each SA gets only the roles it needs. Broad roles like roles/owner or roles/editor are only on tf-apply-sa and scoped to the project level.

# Cloud SQL IAM authentication

Service accounts (Airflow SA, DataHub SA) authenticate to Cloud SQL via IAM — no passwords for programmatic access. A break-glass admin password is stored in Secret Manager for manual intervention.

# Secrets management

Source of truth: Google Secret Manager.
Terraform's role: create the secret resource (name + IAM bindings). Never store secret values in Terraform state or code. Values are populated out-of-band via gcloud secrets versions add.
In-cluster injection: Secret Manager CSI driver mounts secrets as files into pods.
Rotation: documented runbook; not automated in wave-1.

# Network security

Private GKE nodes: no public IPs on nodes; outbound via Cloud NAT.
Private Cloud SQL: accessible only via Private Service Access within the VPC.
GKE authorized networks: API server restricted to CI runner IPs and operator IPs.
No public Kafka or OpenSearch endpoints: these services are cluster-internal only.
GKE Ingress (GCLB): the only public-facing endpoint; protected by TLS (Certificate Manager) and optionally by IAP.

# Encryption

At rest: GCP default encryption (Google-managed keys). Optionally CMEK via the KMS keyring in 10-platform-shared for Cloud SQL and GCS buckets if compliance requires it.
In transit: TLS everywhere. GCLB terminates TLS at the load balancer; in-cluster traffic uses GKE's built-in mTLS via Workload Identity certificates where applicable.

# Naming Conventions

Environment-scoped resources use the ume-data-{env}-{purpose} prefix to distinguish data platform infrastructure from other resources in shared GCP projects. Global resources (bootstrap layer) keep the shorter ume- prefix since they are org-wide.

Resource type	Pattern	Example
GCS bucket	`ume-{purpose}-{project}`	`ume-tf-state-poc-ume-data`
VPC	`ume-data-{env}-vpc`	`ume-data-dev-vpc`
Subnet	`ume-data-{env}-{purpose}`	`ume-data-dev-gke-nodes`
GKE cluster	`ume-data-{env}-gke`	`ume-data-dev-gke`
Cloud SQL instance	`ume-data-{env}-{service}-pg`	`ume-data-dev-airflow-pg`
Cloud Router	`ume-data-{env}-router`	`ume-data-dev-router`
Cloud NAT	`ume-data-{env}-nat`	`ume-data-dev-nat`
DNS zone	`ume-data-{env}-zone`	`ume-data-dev-zone`
Global static IP (ingress)	`ume-data-{env}-ingress-ip`	`ume-data-dev-ingress-ip`
Certificate Manager cert	`ume-data-{env}-wildcard`	`ume-data-dev-wildcard`
Certificate Manager map	`ume-data-{env}-certmap`	`ume-data-dev-certmap`
GKE Gateway	`ume-data-{env}-gateway`	`ume-data-dev-gateway`
Gateway namespace	`ume-data-{env}-gateway`	`ume-data-dev-gateway`
Service account	`ume-{purpose}@{project}.iam`	`ume-tf-plan@poc-ume-data.iam.gserviceaccount.com`
Artifact Registry repo	`ume-{purpose}`	`ume-composer-images`