# Terraform Structure

This section is the canonical reference for how the ume-data-infra repository is organized. Agents and engineers should read this before proposing any Terraform changes.

# Repository Layout

ume-data-infra/
├── .github/
│   └── workflows/
│       ├── terraform-plan.yml
│       ├── terraform-apply.yml
│       └── terraform-drift.yml
├── scripts/
│   └── detect-changed-stacks.sh       # Identifies which stacks changed in a PR
├── layers/                             # Global, applied once across the org
│   ├── 00-bootstrap/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── versions.tf
│   │   └── backend.hcl
│   └── 10-platform-shared/             # Phase 2 — created when cross-env resources appear
│       ├── main.tf
│       ├── variables.tf
│       ├── outputs.tf
│       ├── versions.tf
│       └── backend.hcl
├── environments/                       # Per environment
│   ├── dev-01-base/                    # Phase 1 — pure GCP, no k8s providers
│   │   ├── networking.tf
│   │   ├── gke.tf
│   │   ├── cloud-sql.tf
│   │   ├── iam.tf                      # Airflow SAs, WI bindings, SQL IAM user
│   │   ├── dns.tf                      # DNS zone, shared ingress IP, wildcard A (Story 4c / PR 3a)
│   │   ├── certificate.tf              # Certificate Manager wildcard cert + map (Story 4c / PR 3a)
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── versions.tf
│   │   ├── locals.tf                   # Standard labels, naming
│   │   ├── terraform.tfvars
│   │   └── backend.hcl
│   ├── dev-02-k8s-base/                # Phase 1 — Kubernetes platform layer (Story 4c / PR 3b.2)
│   │   ├── gateway.tf                  # Shared GKE Gateway + redirect HTTPRoute
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── versions.tf                 # google + kubernetes + helm providers
│   │   ├── locals.tf
│   │   ├── data.tf                     # Reads dev-01-base remote state
│   │   ├── terraform.tfvars
│   │   └── backend.hcl
│   ├── dev-03-runtime/                 # Phase 1 (Airflow), Phase 2 (+ DataHub)
│   │   ├── airflow.tf
│   │   ├── iap.tf                      # Per-app IAP module call (Story 4c / PR 3c)
│   │   ├── buckets.tf                  # GCS log + DAG buckets (via modules/gcs-bucket)
│   │   ├── datahub.tf                  # Added in Phase 2
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── versions.tf
│   │   ├── locals.tf
│   │   ├── terraform.tfvars
│   │   └── backend.hcl
│   ├── prod-01-base/
│   ├── prod-02-k8s-base/
│   └── prod-03-runtime/
├── resources/                          # DAGs, dbt, Docker (temp; ports to own repo)
│   ├── docker/
│   │   ├── Dockerfile
│   │   └── requirements.txt
│   ├── dags/
│   ├── dbt/
│   └── scripts/
│       └── build-image.sh
└── modules/
    ├── vpc/
    ├── gke-standard/
    ├── cloud-sql-postgres/
    ├── gcs-bucket/                     # Story 4a
    ├── airflow-helm/                   # Story 4b (evaluate)
    ├── datahub-helm/                   # Phase 2
    ├── artifact-registry/
    ├── wif-pool/
    └── iam-grants/

# Layers vs Environments

# Layers (layers/)

Layers are global resources applied once, independent of any specific environment. They set up the shared foundation that all environments depend on.

Layer Purpose Apply frequency
00-bootstrap Terraform state bucket, Artifact Registry, Workload Identity Federation pool + provider, CI runner service account, GCP API enablement Once at repo inception; rarely changes
10-platform-shared Cross-environment resources: datahub-sa, KMS, secrets, logging (Phase 2). Airflow SAs live in {env}-01-base because they are environment-scoped. Phase 2 — when cross-env resources appear

Key constraint: 00-bootstrap is bootstrapped with a local backend (terraform init without -backend-config), then migrated to GCS after the state bucket exists. This is a one-time manual operation documented in Deployment Stories.

# Environments (environments/)

Environments contain per-environment stacks. Each stack is an independent Terraform root module with its own state.

Stack Contents Lifecycle Phase
{env}-01-base VPC, GKE cluster, Cloud SQL, Airflow SAs + Workload Identity bindings, DNS zone, shared ingress IP, wildcard TLS cert (Certificate Manager) Slow-changing structural infra 1
{env}-02-k8s-base Shared GKE Gateway + redirect HTTPRoute; platform-level k8s singletons (future: Managed Prometheus, Secret Manager CSI, Strimzi, OpenSearch operators) Changes when platform tooling is added 1 (Gateway), 2 (more)
{env}-03-runtime GCS buckets, Airflow Helm release, per-app IAP wiring, per-app HTTPRoute (Phase 1); DataHub Helm release + its IAP + HTTPRoute (Phase 2) Changes when workloads are deployed or updated 1 (Airflow), 2 (DataHub)

The three-stack split reflects lifecycle differences and provider boundaries:

  • 01-base is pure GCP (no k8s providers) — survives months untouched; changes rarely.
  • 02-k8s-base is the first layer that pulls in kubernetes + helm providers; contains environment-singleton k8s platform infra shared across all apps (Gateway, future cert manager add-ons, observability).
  • 03-runtime is per-app workloads; churns with image bumps and DAG syncs.

Dev-03-runtime was renamed from dev-02-runtime in Story 4c when the k8s-base layer was introduced.

# Modules

# Module strategy

Local modules are the standard pattern for environment-scoped resources. The justification is forward-looking: if a resource will be replicated to prod, it gets a module now. Don't wait for a second caller to exist — prod is planned, and that's the justification.

Why modules from the start:

  1. Reusability — prod calls the same module with different parameters instead of copying hundreds of lines
  2. Encapsulation — naming prefixes (ume-data-{env}-{purpose}), label merges, and security defaults are written once inside the module, not repeated in every resource
  3. Conciseness — the environment stack becomes a short module call that shows only what matters for that environment, making differences between dev and prod visible at a glance

Design guidelines:

  1. Expose all configurable settings as variables with sensible defaults. Different environments have different needs. Auto-repair, auto-upgrade, machine types, node counts, maintenance windows — all exposed. Don't force callers into the module code to change a setting.
  2. Modules encapsulate repetition. The caller passes environment = "dev" and name_prefix = "ume-data" once; the module computes ume-data-dev-gke, applies labels to every resource, and enforces security baselines.
  3. Upstream terraform-google-modules/* when they add value. Evaluate per-module. If the upstream handles complexity we'd otherwise reimplement (VPC peering edge cases, NAT quirks), wrap it inside our local module. If our needs are straightforward (GKE cluster with a few pools), direct resources inside our module are cleaner. Either way, the caller's interface stays the same — switching from direct resources to upstream is an internal refactor.
  4. Layer-scoped one-offs stay direct. Bootstrap (state bucket, WIF, AR) is applied once and never replicated across environments. Direct resources are fine there.

Module sources are local paths in wave-1 (e.g., source = "../../modules/gke-standard"). Before prod rollout, evaluate migrating to tagged git refs for pinning.

# Module catalog

Module Implementation Used by Status
modules/gke-standard/ Direct resources (google_container_cluster, google_container_node_pool, Gateway API enablement) {env}-01-base/gke.tf Created (Story 3d)
modules/vpc/ Direct resources (google_compute_network, subnet, Cloud Router, Cloud NAT) {env}-01-base/networking.tf Created (Story 3d follow-up)
modules/cloud-sql-postgres/ Direct resources (google_sql_database_instance, PSA, database, Secret Manager) {env}-01-base/cloud-sql.tf Created (Story 3d follow-up)
modules/gcs-bucket/ Direct resources (google_storage_bucket) with lifecycle rules, labels, uniform access {env}-03-runtime/buckets.tf Created (Story 4a)
modules/airflow-helm/ helm_release + values templating + HTTPRoute (WI, logging, Cloud SQL Auth Proxy sidecar, GCS FUSE) {env}-03-runtime/airflow.tf Created (Story 4b, HTTPRoute added Story 4c)
modules/iap-oauth/ Per-service IAP wiring: google_iap_client, k8s Secret with OAuth creds, GCPBackendPolicy on the target Service, roles/iap.httpsResourceAccessor bindings {env}-03-runtime/iap.tf (per app) Created (Story 4c)
modules/datahub-helm/ helm_release + values templating + HTTPRoute {env}-03-runtime/datahub.tf Phase 2

Inlined in {env}-02-k8s-base/gateway.tf (pending module extraction): the shared Gateway + redirect HTTPRoute are direct kubernetes_manifest resources rather than a module today — singleton per env and five resources total, so flat is clearer. Extract into modules/gke-gateway/ when prod adds a second Gateway.

Strimzi and OpenSearch are consumed as upstream Helm charts directly in {env}-02-k8s-base/ (Phase 2). They will be wrapped into local modules when environment replication begins.

# State Management

# Backend configuration

Every stack has a backend.hcl file:

bucket = "ume-tf-state-poc-ume-data"
prefix = "environments/dev-01-base"

Applied via: terraform init -backend-config=backend.hcl

# State layout in GCS

gs://ume-tf-state-poc-ume-data/
  layers/00-bootstrap/default.tfstate
  layers/10-platform-shared/default.tfstate      # Phase 2
  environments/dev-01-base/default.tfstate
  environments/dev-02-k8s-base/default.tfstate
  environments/dev-03-runtime/default.tfstate

The environments/dev-02-runtime/ prefix from before the Story 4c rename is retained under GCS versioning until a cleanup window; it is no longer read or written.

Prod will use a separate state bucket in the shared-services project when it exists.

# Locking

GCS backend provides native state locking via object generation numbers. No external locking service (e.g., DynamoDB) is needed.

# State access

  • tf-plan-sa has roles/storage.objectViewer on the state bucket (read-only for plans).
  • tf-apply-sa has roles/storage.objectAdmin on the state bucket (read-write for applies).

# Inter-Stack Contracts

Stacks communicate exclusively via terraform_remote_state data sources. No output value is duplicated into another stack's tfvars.

Example: dev-02-runtime reads from dev-01-base:

data "terraform_remote_state" "base" {
  backend = "gcs"
  config = {
    bucket = var.state_bucket
    prefix = "environments/dev-01-base"
  }
}

# Usage
locals {
  gke_cluster_name = data.terraform_remote_state.base.outputs.gke_cluster_name
  gke_endpoint     = data.terraform_remote_state.base.outputs.gke_endpoint
  gke_ca_cert      = data.terraform_remote_state.base.outputs.gke_ca_cert
  vpc_id           = data.terraform_remote_state.base.outputs.vpc_id
}

# Required outputs per stack

Stack Must export
00-bootstrap state_bucket_name, artifact_registry_url, wif_pool_name, wif_provider_name
10-platform-shared shared_kms_keyring, logging_sink_id, datahub_sa_email (Phase 2 — layer created then)
{env}-01-base vpc_id, vpc_self_link, gke_cluster_name, gke_endpoint, gke_ca_cert, sql_connection_name, sql_private_ip, sql_instance_name, airflow_sa_email, airflow_kpo_sa_email, domain_name, dns_zone_name, dns_zone_nameservers, ingress_ip_name, ingress_ip_address, certificate_map_name
{env}-02-k8s-base gateway_name, gateway_namespace
{env}-03-runtime airflow_namespace, airflow_logs_bucket, airflow_dags_bucket, airflow_url, iap_client_id (Phase 1); add datahub_url, datahub_iap_client_id in Phase 2

# Label Conventions

Every Terraform-managed GCP resource must carry these labels:

Label Example value Purpose
env dev, prod Environment
layer bootstrap, base, k8s-base, runtime Which stack manages this resource
service gke, airflow, datahub, kafka, opensearch, cloudsql Logical service
owner platform-team Responsible team
cost_center data-platform Cost attribution group

Labels are defined in each stack's locals.tf:

locals {
  common_labels = {
    env         = var.environment
    layer       = "base"
    owner       = "platform-team"
    cost_center = "data-platform"
  }
}

Individual resources merge local.common_labels with a service-specific label:

labels = merge(local.common_labels, { service = "gke" })

CI will lint for missing labels via a pre-plan check (documented in CI/CD).

# Security

Security is embedded in the Terraform structure, not bolted on. This section covers the security posture enforced by the infrastructure code.

# Identity and access

# Workload Identity Federation (WIF)

CI/CD authenticates to GCP via WIF. No service-account key files exist anywhere in the system.

  • WIF pool + provider are created in 00-bootstrap, trusting the ume-data-infra GitHub repository.
  • Attribute conditions restrict which repo, branch, and workflow can assume which service account.
  • Known gotcha: if the GitHub repo moves orgs or is renamed, the WIF provider's attribute_condition must be updated. Document this as a runbook in Operations.

# Service accounts

Service Account Created in Purpose Permissions
tf-plan-sa 00-bootstrap CI plan jobs roles/viewer on project + custom tfStateLocker on state bucket
tf-apply-sa 00-bootstrap CI apply jobs roles/editor on project + roles/storage.objectAdmin on state bucket
ume-airflow {env}-01-base Airflow scheduler + workers identity (via Workload Identity) roles/bigquery.dataEditor, roles/cloudsql.client, roles/secretmanager.secretAccessor, roles/storage.objectAdmin
ume-airflow-kpo {env}-01-base KPO task identity (via Workload Identity) roles/bigquery.dataEditor, roles/storage.objectViewer
datahub-sa 10-platform-shared DataHub GMS identity (Phase 2, via Workload Identity) roles/cloudsql.client, roles/secretmanager.secretAccessor

Least-privilege: each SA gets only the roles it needs. Broad roles like roles/owner or roles/editor are only on tf-apply-sa and scoped to the project level.

# Cloud SQL IAM authentication

Service accounts (Airflow SA, DataHub SA) authenticate to Cloud SQL via IAM — no passwords for programmatic access. A break-glass admin password is stored in Secret Manager for manual intervention.

# Secrets management

  • Source of truth: Google Secret Manager.
  • Terraform's role: create the secret resource (name + IAM bindings). Never store secret values in Terraform state or code. Values are populated out-of-band via gcloud secrets versions add.
  • In-cluster injection: Secret Manager CSI driver mounts secrets as files into pods.
  • Rotation: documented runbook; not automated in wave-1.

# Network security

  • Private GKE nodes: no public IPs on nodes; outbound via Cloud NAT.
  • Private Cloud SQL: accessible only via Private Service Access within the VPC.
  • GKE authorized networks: API server restricted to CI runner IPs and operator IPs.
  • No public Kafka or OpenSearch endpoints: these services are cluster-internal only.
  • GKE Ingress (GCLB): the only public-facing endpoint; protected by TLS (Certificate Manager) and optionally by IAP.

# Encryption

  • At rest: GCP default encryption (Google-managed keys). Optionally CMEK via the KMS keyring in 10-platform-shared for Cloud SQL and GCS buckets if compliance requires it.
  • In transit: TLS everywhere. GCLB terminates TLS at the load balancer; in-cluster traffic uses GKE's built-in mTLS via Workload Identity certificates where applicable.

# Naming Conventions

Environment-scoped resources use the ume-data-{env}-{purpose} prefix to distinguish data platform infrastructure from other resources in shared GCP projects. Global resources (bootstrap layer) keep the shorter ume- prefix since they are org-wide.

Resource type Pattern Example
GCS bucket ume-{purpose}-{project} ume-tf-state-poc-ume-data
VPC ume-data-{env}-vpc ume-data-dev-vpc
Subnet ume-data-{env}-{purpose} ume-data-dev-gke-nodes
GKE cluster ume-data-{env}-gke ume-data-dev-gke
Cloud SQL instance ume-data-{env}-{service}-pg ume-data-dev-airflow-pg
Cloud Router ume-data-{env}-router ume-data-dev-router
Cloud NAT ume-data-{env}-nat ume-data-dev-nat
DNS zone ume-data-{env}-zone ume-data-dev-zone
Global static IP (ingress) ume-data-{env}-ingress-ip ume-data-dev-ingress-ip
Certificate Manager cert ume-data-{env}-wildcard ume-data-dev-wildcard
Certificate Manager map ume-data-{env}-certmap ume-data-dev-certmap
GKE Gateway ume-data-{env}-gateway ume-data-dev-gateway
Gateway namespace ume-data-{env}-gateway ume-data-dev-gateway
Service account ume-{purpose}@{project}.iam ume-tf-plan@poc-ume-data.iam.gserviceaccount.com
Artifact Registry repo ume-{purpose} ume-composer-images