# Infrastructure

The Architecture and Tools section defines what we want to build and why. This section defines how we build and operate it. Every component described here supports the analytical workloads, governance, and developer experience goals outlined in the executive summary.

# Design Principles

  1. Infrastructure as Code - All resources are managed via Terraform. No manual provisioning in the GCP console for anything that lives beyond a spike. Reproducibility and auditability come from version-controlled definitions.
  2. Multi-project readiness - Wave-1 collapses to a single GCP project (poc-ume-data), but every Terraform module accepts project IDs as variables. When production brings dedicated projects, only tfvars change.
  3. Layered separation - Global layers (bootstrap, platform-shared) are distinct from per-environment stacks (base, k8s-base, runtime). Each stack has its own state file and blast radius.
  4. Prefer managed services, but own what matters - We use GCP-managed offerings (Cloud SQL, GKE, Certificate Manager, Managed Prometheus) wherever the cost-to-ops tradeoff is favorable. Airflow is self-hosted on GKE (Cloud Composer's 4-5x cost premium is not justified for dev). We self-host Kafka (Strimzi) and OpenSearch on GKE where managed pricing is prohibitive or vendor lock-in is unacceptable.
  5. Cost awareness from day one - Mandatory resource labels (env, layer, service, owner, cost_center) on every Terraform-managed resource. Budget alerts on every project. Per-service cost attribution is a future enhancement, but the groundwork is laid now.
  6. No long-lived credentials - Workload Identity Federation for CI/CD. Cloud SQL IAM authentication for service accounts. Secret Manager for anything that must be a secret. No service-account key files.
  7. Zero-downtime operations - GKE surge upgrades, PodDisruptionBudgets, topology spreading, and maintenance windows ensure that platform upgrades do not interrupt workloads.

# Stack at a Glance

Component Tool Environment Notes
IaC Terraform + GitHub Actions all WIF-based auth, no SA keys
Orchestration Airflow on GKE Standard (Helm) per env CeleryExecutor, custom image via Artifact Registry
Data Catalog DataHub (self-hosted) per env GKE + Cloud SQL + Strimzi Kafka + OpenSearch
Compute GKE Standard (zonal dev, regional prod) per env Multi-pool, autoscaling, surge upgrades
Database Cloud SQL for PostgreSQL per env IAM auth, private IP
Event Bus Kafka via Strimzi on GKE per env Self-hosted; managed Kafka as future upgrade
Search OpenSearch on GKE per env Operator-managed, GCS snapshots
Ingress GKE Ingress (GCLB) per env Certificate Manager for GCP-issued certs
Observability Google Managed Prometheus + Cloud Ops per env No self-hosted Grafana in wave-1
Secrets Secret Manager + CSI driver per env Values populated out-of-band

# Table of Contents

# Section Description
01 Introduction Scope, assumptions, what is in and out of wave-1
02 Requirements Functional, non-functional, cost targets
03 Architecture Architecture diagram and per-block narrative
04 Terraform Structure Layers, environments, modules, state, security
05 CI/CD GitHub Actions, WIF, plan/apply/drift workflows
06 Airflow on GKE Helm chart, CeleryExecutor, Cosmos, git-sync, custom image
07 GKE Platform GKE Standard, ingress, certificates, zero-downtime
08 DataHub DataHub on GKE: DB, Kafka, OpenSearch, OAuth
09 Observability and Cost Monitoring, alerts, cost attribution groundwork
10 Operations Runbooks for upgrades, restores, rotations
11 Deployment Stories Step-by-step implementation sequence
12 Locomotive Silver layer + BQ governance: the data-layer initiative on top of wave-1

# Glossary

Term Meaning
Cosmos astronomer-cosmos - Airflow provider that renders dbt projects as Airflow DAGs
CSI driver Container Storage Interface driver - mounts Secret Manager secrets into pods
GCLB Google Cloud Load Balancer - L7 load balancer backing GKE Ingress
GMP Google Managed Prometheus - managed metrics collection and alerting
IAP Identity-Aware Proxy - GCP service for zero-trust access to web apps
NAP Node Auto-Provisioning - GKE feature to create node pools on demand
PDB PodDisruptionBudget - Kubernetes object protecting pods during voluntary disruptions
PSA Private Service Access - GCP feature enabling private IP for managed services
Strimzi Kubernetes operator for running Apache Kafka on GKE
WIF Workload Identity Federation - GCP mechanism for keyless authentication from external identity providers

# Diagram Reference

Diagram Location Description
Infrastructure Architecture diagrams/infra-architecture.drawio Full architecture view: projects, networking, GKE, Airflow, DataHub

# Repositories

Repository Purpose
github.com/1edata/ume-data-infra Terraform: layers, modules, environments, DataHub Helm values
(existing DAGs repo) Airflow DAGs, custom Airflow image, dbt project, DataHub ingestion recipes