# Infrastructure

This section documents the infrastructure layer of the UME data platform. It is designed to serve as both a human reference and an agent bible for automated sessions that provision and maintain the platform.

The Architecture and Tools section defines what we want to build and why. This section defines how we build and operate it. Every component described here supports the analytical workloads, governance, and developer experience goals outlined in the executive summary.

# Design Principles

Infrastructure as Code - All resources are managed via Terraform. No manual provisioning in the GCP console for anything that lives beyond a spike. Reproducibility and auditability come from version-controlled definitions.
Multi-project readiness - Wave-1 collapses to a single GCP project (poc-ume-data), but every Terraform module accepts project IDs as variables. When production brings dedicated projects, only tfvars change.
Layered separation - Global layers (bootstrap, platform-shared) are distinct from per-environment stacks (base, k8s-base, runtime). Each stack has its own state file and blast radius.
Prefer managed services, but own what matters - We use GCP-managed offerings (Cloud SQL, GKE, Certificate Manager, Managed Prometheus) wherever the cost-to-ops tradeoff is favorable. Airflow is self-hosted on GKE (Cloud Composer's 4-5x cost premium is not justified for dev). We self-host Kafka (Strimzi) and OpenSearch on GKE where managed pricing is prohibitive or vendor lock-in is unacceptable.
Cost awareness from day one - Mandatory resource labels (env, layer, service, owner, cost_center) on every Terraform-managed resource. Budget alerts on every project. Per-service cost attribution is a future enhancement, but the groundwork is laid now.
No long-lived credentials - Workload Identity Federation for CI/CD. Cloud SQL IAM authentication for service accounts. Secret Manager for anything that must be a secret. No service-account key files.
Zero-downtime operations - GKE surge upgrades, PodDisruptionBudgets, topology spreading, and maintenance windows ensure that platform upgrades do not interrupt workloads.

# Stack at a Glance

Component	Tool	Environment	Notes
IaC	Terraform + GitHub Actions	all	WIF-based auth, no SA keys
Orchestration	Airflow on GKE Standard (Helm)	per env	CeleryExecutor, custom image via Artifact Registry
Data Catalog	DataHub (self-hosted)	per env	GKE + Cloud SQL + Strimzi Kafka + OpenSearch
Compute	GKE Standard (zonal dev, regional prod)	per env	Multi-pool, autoscaling, surge upgrades
Database	Cloud SQL for PostgreSQL	per env	IAM auth, private IP
Event Bus	Kafka via Strimzi on GKE	per env	Self-hosted; managed Kafka as future upgrade
Search	OpenSearch on GKE	per env	Operator-managed, GCS snapshots
Ingress	GKE Ingress (GCLB)	per env	Certificate Manager for GCP-issued certs
Observability	Google Managed Prometheus + Cloud Ops	per env	No self-hosted Grafana in wave-1
Secrets	Secret Manager + CSI driver	per env	Values populated out-of-band

# Table of Contents

#	Section	Description
01	Introduction	Scope, assumptions, what is in and out of wave-1
02	Requirements	Functional, non-functional, cost targets
03	Architecture	Architecture diagram and per-block narrative
04	Terraform Structure	Layers, environments, modules, state, security
05	CI/CD	GitHub Actions, WIF, plan/apply/drift workflows
06	Airflow on GKE	Helm chart, CeleryExecutor, Cosmos, git-sync, custom image
07	GKE Platform	GKE Standard, ingress, certificates, zero-downtime
08	DataHub	DataHub on GKE: DB, Kafka, OpenSearch, OAuth
09	Observability and Cost	Monitoring, alerts, cost attribution groundwork
10	Operations	Runbooks for upgrades, restores, rotations
11	Deployment Stories	Step-by-step implementation sequence
12	Locomotive	Silver layer + BQ governance: the data-layer initiative on top of wave-1

# Glossary

Term	Meaning
Cosmos	`astronomer-cosmos` - Airflow provider that renders dbt projects as Airflow DAGs
CSI driver	Container Storage Interface driver - mounts Secret Manager secrets into pods
GCLB	Google Cloud Load Balancer - L7 load balancer backing GKE Ingress
GMP	Google Managed Prometheus - managed metrics collection and alerting
IAP	Identity-Aware Proxy - GCP service for zero-trust access to web apps
NAP	Node Auto-Provisioning - GKE feature to create node pools on demand
PDB	PodDisruptionBudget - Kubernetes object protecting pods during voluntary disruptions
PSA	Private Service Access - GCP feature enabling private IP for managed services
Strimzi	Kubernetes operator for running Apache Kafka on GKE
WIF	Workload Identity Federation - GCP mechanism for keyless authentication from external identity providers

# Diagram Reference

Diagram	Location	Description
Infrastructure Architecture	diagrams/infra-architecture.drawio	Full architecture view: projects, networking, GKE, Airflow, DataHub

# Repositories

Repository	Purpose
`github.com/1edata/ume-data-infra`	Terraform: layers, modules, environments, DataHub Helm values
(existing DAGs repo)	Airflow DAGs, custom Airflow image, dbt project, DataHub ingestion recipes