#
Terraform Structure
This section is the canonical reference for how the ume-data-infra repository is organized. Agents and engineers should read this before proposing any Terraform changes.
#
Repository Layout
ume-data-infra/
├── .github/
│ └── workflows/
│ ├── terraform-plan.yml
│ ├── terraform-apply.yml
│ └── terraform-drift.yml
├── scripts/
│ └── detect-changed-stacks.sh # Identifies which stacks changed in a PR
├── layers/ # Global, applied once across the org
│ ├── 00-bootstrap/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── versions.tf
│ │ └── backend.hcl
│ └── 10-platform-shared/ # Phase 2 — created when cross-env resources appear
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ ├── versions.tf
│ └── backend.hcl
├── environments/ # Per environment
│ ├── dev-01-base/ # Phase 1 — pure GCP, no k8s providers
│ │ ├── networking.tf
│ │ ├── gke.tf
│ │ ├── cloud-sql.tf
│ │ ├── iam.tf # Airflow SAs, WI bindings, SQL IAM user
│ │ ├── dns.tf # DNS zone, shared ingress IP, wildcard A (Story 4c / PR 3a)
│ │ ├── certificate.tf # Certificate Manager wildcard cert + map (Story 4c / PR 3a)
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── versions.tf
│ │ ├── locals.tf # Standard labels, naming
│ │ ├── terraform.tfvars
│ │ └── backend.hcl
│ ├── dev-02-k8s-base/ # Phase 1 — Kubernetes platform layer (Story 4c / PR 3b.2)
│ │ ├── gateway.tf # Shared GKE Gateway + redirect HTTPRoute
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── versions.tf # google + kubernetes + helm providers
│ │ ├── locals.tf
│ │ ├── data.tf # Reads dev-01-base remote state
│ │ ├── terraform.tfvars
│ │ └── backend.hcl
│ ├── dev-03-runtime/ # Phase 1 (Airflow), Phase 2 (+ DataHub)
│ │ ├── airflow.tf
│ │ ├── iap.tf # Per-app IAP module call (Story 4c / PR 3c)
│ │ ├── buckets.tf # GCS log + DAG buckets (via modules/gcs-bucket)
│ │ ├── datahub.tf # Added in Phase 2
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── versions.tf
│ │ ├── locals.tf
│ │ ├── terraform.tfvars
│ │ └── backend.hcl
│ ├── prod-01-base/
│ ├── prod-02-k8s-base/
│ └── prod-03-runtime/
├── resources/ # DAGs, dbt, Docker (temp; ports to own repo)
│ ├── docker/
│ │ ├── Dockerfile
│ │ └── requirements.txt
│ ├── dags/
│ ├── dbt/
│ └── scripts/
│ └── build-image.sh
└── modules/
├── vpc/
├── gke-standard/
├── cloud-sql-postgres/
├── gcs-bucket/ # Story 4a
├── airflow-helm/ # Story 4b (evaluate)
├── datahub-helm/ # Phase 2
├── artifact-registry/
├── wif-pool/
└── iam-grants/
#
Layers vs Environments
#
Layers (layers/)
Layers are global resources applied once, independent of any specific environment. They set up the shared foundation that all environments depend on.
Key constraint: 00-bootstrap is bootstrapped with a local backend (terraform init without -backend-config), then migrated to GCS after the state bucket exists. This is a one-time manual operation documented in Deployment Stories.
#
Environments (environments/)
Environments contain per-environment stacks. Each stack is an independent Terraform root module with its own state.
The three-stack split reflects lifecycle differences and provider boundaries:
01-baseis pure GCP (no k8s providers) — survives months untouched; changes rarely.02-k8s-baseis the first layer that pulls inkubernetes+helmproviders; contains environment-singleton k8s platform infra shared across all apps (Gateway, future cert manager add-ons, observability).03-runtimeis per-app workloads; churns with image bumps and DAG syncs.
Dev-03-runtime was renamed from dev-02-runtime in Story 4c when the k8s-base layer was introduced.
#
Modules
#
Module strategy
Local modules are the standard pattern for environment-scoped resources. The justification is forward-looking: if a resource will be replicated to prod, it gets a module now. Don't wait for a second caller to exist — prod is planned, and that's the justification.
Why modules from the start:
- Reusability — prod calls the same module with different parameters instead of copying hundreds of lines
- Encapsulation — naming prefixes (
ume-data-{env}-{purpose}), label merges, and security defaults are written once inside the module, not repeated in every resource - Conciseness — the environment stack becomes a short module call that shows only what matters for that environment, making differences between dev and prod visible at a glance
Design guidelines:
- Expose all configurable settings as variables with sensible defaults. Different environments have different needs. Auto-repair, auto-upgrade, machine types, node counts, maintenance windows — all exposed. Don't force callers into the module code to change a setting.
- Modules encapsulate repetition. The caller passes
environment = "dev"andname_prefix = "ume-data"once; the module computesume-data-dev-gke, applies labels to every resource, and enforces security baselines. - Upstream
terraform-google-modules/*when they add value. Evaluate per-module. If the upstream handles complexity we'd otherwise reimplement (VPC peering edge cases, NAT quirks), wrap it inside our local module. If our needs are straightforward (GKE cluster with a few pools), direct resources inside our module are cleaner. Either way, the caller's interface stays the same — switching from direct resources to upstream is an internal refactor. - Layer-scoped one-offs stay direct. Bootstrap (state bucket, WIF, AR) is applied once and never replicated across environments. Direct resources are fine there.
Module sources are local paths in wave-1 (e.g., source = "../../modules/gke-standard"). Before prod rollout, evaluate migrating to tagged git refs for pinning.
#
Module catalog
Inlined in {env}-02-k8s-base/gateway.tf (pending module extraction): the shared Gateway + redirect HTTPRoute are direct kubernetes_manifest resources rather than a module today — singleton per env and five resources total, so flat is clearer. Extract into modules/gke-gateway/ when prod adds a second Gateway.
Strimzi and OpenSearch are consumed as upstream Helm charts directly in {env}-02-k8s-base/ (Phase 2). They will be wrapped into local modules when environment replication begins.
#
State Management
#
Backend configuration
Every stack has a backend.hcl file:
bucket = "ume-tf-state-poc-ume-data"
prefix = "environments/dev-01-base"
Applied via: terraform init -backend-config=backend.hcl
#
State layout in GCS
gs://ume-tf-state-poc-ume-data/
layers/00-bootstrap/default.tfstate
layers/10-platform-shared/default.tfstate # Phase 2
environments/dev-01-base/default.tfstate
environments/dev-02-k8s-base/default.tfstate
environments/dev-03-runtime/default.tfstate
The environments/dev-02-runtime/ prefix from before the Story 4c rename is retained under GCS versioning until a cleanup window; it is no longer read or written.
Prod will use a separate state bucket in the shared-services project when it exists.
#
Locking
GCS backend provides native state locking via object generation numbers. No external locking service (e.g., DynamoDB) is needed.
#
State access
tf-plan-sahasroles/storage.objectVieweron the state bucket (read-only for plans).tf-apply-sahasroles/storage.objectAdminon the state bucket (read-write for applies).
#
Inter-Stack Contracts
Stacks communicate exclusively via terraform_remote_state data sources. No output value is duplicated into another stack's tfvars.
Example: dev-02-runtime reads from dev-01-base:
data "terraform_remote_state" "base" {
backend = "gcs"
config = {
bucket = var.state_bucket
prefix = "environments/dev-01-base"
}
}
# Usage
locals {
gke_cluster_name = data.terraform_remote_state.base.outputs.gke_cluster_name
gke_endpoint = data.terraform_remote_state.base.outputs.gke_endpoint
gke_ca_cert = data.terraform_remote_state.base.outputs.gke_ca_cert
vpc_id = data.terraform_remote_state.base.outputs.vpc_id
}
#
Required outputs per stack
#
Label Conventions
Every Terraform-managed GCP resource must carry these labels:
Labels are defined in each stack's locals.tf:
locals {
common_labels = {
env = var.environment
layer = "base"
owner = "platform-team"
cost_center = "data-platform"
}
}
Individual resources merge local.common_labels with a service-specific label:
labels = merge(local.common_labels, { service = "gke" })
CI will lint for missing labels via a pre-plan check (documented in CI/CD).
#
Security
Security is embedded in the Terraform structure, not bolted on. This section covers the security posture enforced by the infrastructure code.
#
Identity and access
#
Workload Identity Federation (WIF)
CI/CD authenticates to GCP via WIF. No service-account key files exist anywhere in the system.
- WIF pool + provider are created in
00-bootstrap, trusting theume-data-infraGitHub repository. - Attribute conditions restrict which repo, branch, and workflow can assume which service account.
- Known gotcha: if the GitHub repo moves orgs or is renamed, the WIF provider's
attribute_conditionmust be updated. Document this as a runbook in Operations.
#
Service accounts
Least-privilege: each SA gets only the roles it needs. Broad roles like roles/owner or roles/editor are only on tf-apply-sa and scoped to the project level.
#
Cloud SQL IAM authentication
Service accounts (Airflow SA, DataHub SA) authenticate to Cloud SQL via IAM — no passwords for programmatic access. A break-glass admin password is stored in Secret Manager for manual intervention.
#
Secrets management
- Source of truth: Google Secret Manager.
- Terraform's role: create the secret resource (name + IAM bindings). Never store secret values in Terraform state or code. Values are populated out-of-band via
gcloud secrets versions add. - In-cluster injection: Secret Manager CSI driver mounts secrets as files into pods.
- Rotation: documented runbook; not automated in wave-1.
#
Network security
- Private GKE nodes: no public IPs on nodes; outbound via Cloud NAT.
- Private Cloud SQL: accessible only via Private Service Access within the VPC.
- GKE authorized networks: API server restricted to CI runner IPs and operator IPs.
- No public Kafka or OpenSearch endpoints: these services are cluster-internal only.
- GKE Ingress (GCLB): the only public-facing endpoint; protected by TLS (Certificate Manager) and optionally by IAP.
#
Encryption
- At rest: GCP default encryption (Google-managed keys). Optionally CMEK via the KMS keyring in
10-platform-sharedfor Cloud SQL and GCS buckets if compliance requires it. - In transit: TLS everywhere. GCLB terminates TLS at the load balancer; in-cluster traffic uses GKE's built-in mTLS via Workload Identity certificates where applicable.
#
Naming Conventions
Environment-scoped resources use the ume-data-{env}-{purpose} prefix to distinguish data platform infrastructure from other resources in shared GCP projects. Global resources (bootstrap layer) keep the shorter ume- prefix since they are org-wide.