#
Infrastructure
This section documents the infrastructure layer of the UME data platform. It is designed to serve as both a human reference and an agent bible for automated sessions that provision and maintain the platform.
The Architecture and Tools section defines what we want to build and why. This section defines how we build and operate it. Every component described here supports the analytical workloads, governance, and developer experience goals outlined in the executive summary.
#
Design Principles
- Infrastructure as Code - All resources are managed via Terraform. No manual provisioning in the GCP console for anything that lives beyond a spike. Reproducibility and auditability come from version-controlled definitions.
- Multi-project readiness - Wave-1 collapses to a single GCP project (
poc-ume-data), but every Terraform module accepts project IDs as variables. When production brings dedicated projects, onlytfvarschange. - Layered separation - Global layers (bootstrap, platform-shared) are distinct from per-environment stacks (base, k8s-base, runtime). Each stack has its own state file and blast radius.
- Prefer managed services, but own what matters - We use GCP-managed offerings (Cloud SQL, GKE, Certificate Manager, Managed Prometheus) wherever the cost-to-ops tradeoff is favorable. Airflow is self-hosted on GKE (Cloud Composer's 4-5x cost premium is not justified for dev). We self-host Kafka (Strimzi) and OpenSearch on GKE where managed pricing is prohibitive or vendor lock-in is unacceptable.
- Cost awareness from day one - Mandatory resource labels (
env,layer,service,owner,cost_center) on every Terraform-managed resource. Budget alerts on every project. Per-service cost attribution is a future enhancement, but the groundwork is laid now. - No long-lived credentials - Workload Identity Federation for CI/CD. Cloud SQL IAM authentication for service accounts. Secret Manager for anything that must be a secret. No service-account key files.
- Zero-downtime operations - GKE surge upgrades, PodDisruptionBudgets, topology spreading, and maintenance windows ensure that platform upgrades do not interrupt workloads.