#
GKE Platform
The GKE Standard cluster hosts Airflow (Phase 1) and DataHub with its dependencies (Phase 2) as well as platform add-ons (ingress, observability, secrets injection). This section covers the cluster design, node pool strategy, zero-downtime recipe, and ingress/TLS configuration.
Key decisions:
- GKE Standard (not Autopilot) for cost control and stateful workload flexibility
- Zonal cluster for dev PoC; regional for prod
- Three node pools:
default-pool(Airflow + system),kpo-pool(spot, scale-to-zero batch),workload-pool(DataHub, Phase 2) - Dataplane V2 (Cilium/eBPF) for built-in network policy enforcement
- GKE Ingress (GCLB) + Certificate Manager for GCP-native TLS
- Zero-downtime operations via surge upgrades, PDBs, and topology spreading
#
Cluster Design
#
Why GKE Standard over Autopilot
Autopilot simplifies operations but imposes constraints and cost premiums that work against our needs:
For Kafka and OpenSearch on GKE (Phase 2), Standard provides the necessary flexibility. The ops burden is mitigated by the zero-downtime recipe documented below.
#
Cluster configuration
Why zonal for dev PoC: halves node count compared to regional (1 node per pool instead of 3). Regional deferred to prod or when HA is required. The GKE free tier ($74.40/mo credit) covers one zonal cluster — regional clusters pay the full $0.10/hr management fee (~$74/mo).
#
Node resource reservations
GKE reserves CPU and memory on every worker node for kubelet, kube-proxy, containerd, and eviction thresholds. You pay for the full VM but can only schedule pods into the allocatable portion.
CPU reservation (dedicated-core machines):
CPU reservation (shared-core E2): A flat 1060 millicores — these machines lose over half their nominal vCPU. Avoid shared-core for GKE.
Memory reservation:
#
Machine type selection
#
Node Pools
#
default-pool — Airflow + system services
Hosts Airflow (scheduler, Celery worker, webserver, triggerer, Redis) and lightweight system components (ingress controller, CSI driver, metrics agent). Single node in Phase 1.
Phase 1 resource budget (1x e2-standard-2, ~1930m CPU / ~6.1 GiB allocatable):
This budget is snug but workable because dbt-bigquery is I/O-bound (submits SQL to BigQuery and waits). CPU spikes during dbt compile are brief and burst above requests up to limits.
Scaling signals and upgrade path — see Airflow on GKE — Monitoring and Alerting for thresholds. Upgrade path: e2-standard-2 ($49/mo) → e2-standard-4 ($98/mo, 3920m / 13.3 GiB), or keep e2-standard-2 with min_nodes=2 ($98/mo, 3860m aggregate, better fault tolerance).
#
kpo-pool — KubernetesPodOperator tasks
Ephemeral nodes for on-demand batch work: dbt runs via KPO, data quality checks, ingestion jobs. Scales from 0 to 10 nodes. Uses spot VMs for ~69% savings.
Scale-to-zero flow:
- Airflow triggers a KPO task.
- KPO creates a pod with toleration for
workload=kpo:NoScheduleandnodeSelector: pool: kpo. - Pod is
Pending— no nodes exist in the pool. - Cluster Autoscaler detects the pending pod (~30 seconds).
- Spot VM provisioned (~60-90 seconds).
- Pod runs, completes, is cleaned up.
- After ~10 minutes idle (
scaleDownUnneededTime), the autoscaler removes the empty node.
Spot VM savings: ~$15/mo vs ~$49/mo for e2-standard-2 (on-demand). dbt tasks are idempotent — if a spot node is preempted mid-task, Airflow retries.
#
workload-pool — DataHub stack (Phase 2)
Hosts Kafka brokers, OpenSearch data nodes, and DataHub services. Not created in Phase 1. Added when DataHub work begins.
#
Zero-Downtime Operations
The combination of the following mechanisms ensures that node replacements, upgrades, and scaling events do not interrupt running workloads.
#
Surge upgrades
max_surge = 1 # Add 1 new node before draining old one
max_unavailable = 0 # Never remove a node without adding a replacement first
During a node pool upgrade:
- GKE creates a new node with the updated version.
- The old node is cordoned (no new pods scheduled).
- The old node is drained (existing pods are evicted, respecting PDBs).
- Once all pods are safely rescheduled, the old node is deleted.
#
PodDisruptionBudgets (PDBs)
PDBs tell Kubernetes how many pods of a set must remain available during voluntary disruptions (upgrades, scaling down).
#
Topology spreading (Phase 2, regional cluster)
Stateful workloads spread across zones to survive zone-level failures:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
Applied to: Kafka brokers, OpenSearch data nodes, DataHub GMS replicas (in prod). Not applicable in Phase 1 (zonal cluster).
#
Maintenance window
Node auto-upgrades respect the maintenance window. Combined with surge upgrades and PDBs, upgrades happen during off-hours with zero service interruption.
#
Autoscaling behavior
- Cluster Autoscaler monitors pending pods. When pods can't schedule due to insufficient resources, it adds nodes (up to
max_nodes). When nodes are underutilized for 10+ minutes, it scales down (respecting PDBs). - HPA (Horizontal Pod Autoscaler) scales DataHub GMS and Frontend pods based on CPU/memory utilization. Only in prod (dev uses fixed replicas).
- Strimzi Cruise Control rebalances Kafka partitions across brokers after scaling events (Phase 2).
- OpenSearch automatically redistributes shards when nodes join or leave the cluster (Phase 2).
#
Alert on scaling limits
Cloud Monitoring alert fires when a node pool reaches its max_nodes count. This means the autoscaler cannot add more capacity and workloads may queue. See Observability.
#
Ingress and TLS
#
GKE Ingress (GCLB)
GKE's built-in Ingress controller provisions a Google Cloud L7 Load Balancer (GCLB) for each Ingress resource. No nginx-ingress, no Istio, no Traefik.
Advantages:
- Zero ops: GKE manages the load balancer lifecycle.
- Native integration with Certificate Manager for GCP-issued TLS.
- Native integration with Cloud Armor for WAF (future).
- Native integration with IAP for zero-trust access (future).
#
Certificate Manager
TLS certificates are issued by GCP Certificate Manager. Wildcard certificate via DNS authorization:
*.data.ume.com.br → Certificate Manager → GCLB → GKE Ingress
No cert-manager pods. No Let's Encrypt ACME challenges. No certificate rotation runbooks. GCP handles everything.
#
Ingress routing
Domain is parametrized via var.domain_name (default umedev.marpont.es, will change). DNS zone delegated from GoDaddy to Google Cloud DNS.
#
Terraform Configuration
GKE is provisioned via the modules/gke-standard/ local module, called from environments/{env}-01-base/gke.tf. The module uses direct Terraform resources (google_container_cluster, google_container_node_pool) and encapsulates naming, labels, and security defaults. Each environment calls the module with different parameters (machine types, node counts, location).
The module enforces:
- Dataplane V2 (
ADVANCED_DATAPATH) for built-in network policy. - Workload Identity on all nodes.
- GCS FUSE CSI driver add-on (for mounting GCS buckets as volumes — used by DAG sync).
- Shielded instances (secure boot + integrity monitoring).
- Private nodes with public endpoint (restricted via authorized networks).
- Legacy metadata endpoints disabled.
- Surge upgrade defaults (
max_surge=1, max_unavailable=0). - Mandatory labels on cluster and node pools.
All settings are exposed as variables with sensible defaults so environments can override without editing the module.
In-cluster resources (Helm releases, ingress config, operators) are provisioned in environments/{env}-02-runtime/ using the helm and kubernetes Terraform providers, authenticated via the GKE cluster credentials from {env}-01-base remote state.