# Airflow on GKE

Airflow runs on the shared GKE Standard cluster via the official Apache Airflow Helm chart with CeleryExecutor. It handles orchestration for ETL pipelines, dbt runs (via Cosmos), and future DataHub ingestion recipes.

# Why GKE Standard over Cloud Composer

Factor Cloud Composer 3 Airflow on GKE Standard
Monthly cost floor (dev) ~$300-400 (always-on managed infra) ~$81 (1 node + Cloud SQL)
Executor choice Managed (no LocalExecutor/CeleryExecutor) Full control
KubernetesPodOperator Billed as DCUs inside Composer's hidden cluster Runs on a dedicated spot node pool you control
Cluster reuse Separate Google-managed cluster Same cluster hosts Airflow + DataHub
Custom image control Must extend Composer base; constrained by Google's release cycle Standard Docker image; any base, any packages
Scale-to-zero Not possible Possible — delete Helm release or scale to 0
Operational burden Near-zero Moderate — you own upgrades, monitoring, DAG sync

The 4-5x cost difference is the primary driver. The operational burden is acceptable because the GKE cluster is already planned for DataHub.

# Executor: CeleryExecutor

CeleryExecutor uses Redis as a task broker. The scheduler enqueues tasks into Redis; dedicated worker pods pick them up and execute them. This keeps the scheduler lean and decouples task execution from scheduling.

Scheduler → Redis queue → Celery Worker(s) → execute task

# Why CeleryExecutor over LocalExecutor

  • Scheduler isolation: with LocalExecutor, dbt subprocesses compete for CPU/memory on the scheduler pod. With CeleryExecutor, workers handle execution independently.
  • Scalable workers: workers can scale from 1 to N. Start with 1 worker; add more if task queuing grows.
  • No KEDA needed: min=1 worker is always on. No cold-start delay. Workers on the default-pool share the node with the scheduler at no additional VM cost.

# Why not KubernetesExecutor

KubernetesExecutor creates a pod per task — true scale-to-zero, no Redis needed. But every task incurs 10-30s pod startup overhead. At 3-10 DAGs with 10-50 models each, that's significant latency. CeleryExecutor with pre-started workers is faster for steady workloads.

Escape hatch: if you outgrow CeleryExecutor (need per-task isolation, hundreds of concurrent tasks), KubernetesExecutor is the next step. The Helm chart supports switching executors with a single value change.

# Cosmos dbt Execution

# Execution mode: Local (on Celery workers)

Cosmos local execution mode runs dbt as subprocesses directly on the Celery worker. Each dbt model becomes an Airflow task, dispatched to a worker via Redis.

This is the fastest stable mode — no container overhead, no pod startup. The worker handles multiple dbt tasks concurrently (controlled by worker_concurrency).

# Hybrid pattern: Local + KubernetesPodOperator

For specific heavy or isolated jobs (large dbt full-refresh, data quality checks, ingestion jobs), use KubernetesPodOperator to dispatch to the kpo-pool (spot VMs, scale-to-zero):

# Default: Cosmos local mode on Celery workers
dbt_dag = DbtDag(
    execution_config=ExecutionConfig(execution_mode=ExecutionMode.LOCAL),
    ...
)

# Heavy jobs: KPO on kpo-pool
heavy_task = KubernetesPodOperator(
    namespace="airflow-kpo",
    node_selector={"pool": "kpo"},
    tolerations=[{"key": "workload", "value": "kpo", "effect": "NoSchedule"}],
    service_account_name="airflow-kpo",
    ...
)

This gives you the speed of local execution for most work, with full pod isolation available when needed.

# Cosmos execution modes considered but deferred

Mode Status Why deferred
Watcher Experimental Known bugs: broken retries (#2193), missing compiled SQL (#2233), template rendering failures. Revisit when it graduates to stable.
Watcher Kubernetes Experimental Inherits all watcher bugs + K8s complexity. Same — revisit when stable.
Airflow Async Stable, BQ-only Strong for BigQuery-only projects (non-blocking async SQL submission). Deferred because future dbt targets may include Postgres or other engines.
Kubernetes Stable 10-30s pod startup per model. At 50+ models per DAG, overhead is 8-25 min. Use KPO for individual heavy tasks instead.

# Terraform Configuration

Airflow is deployed in environments/{env}-02-runtime/airflow.tf using a helm_release resource with the official Apache Airflow Helm chart 1.20.0 (Airflow 3.2.0).

Key variables:

Variable Dev value Prod value Notes
project_id poc-ume-data ume-platform-prod From tfvars
airflow_image_tag 3.2.0 same tag as dev Immutable; prod promotes dev-validated tag
airflow_chart_version 1.20.0 same Chart 1.20.0 supports Airflow 3 components
airflow_namespace airflow airflow Dedicated namespace

# Airflow 3 component architecture

Chart 1.20.0 uses semver gates in its templates. With Airflow >= 3.0.0:

Component Status Notes
apiServer Enabled Serves UI and REST API (replaces webserver)
scheduler Enabled Scheduling only, DAG parsing moved to dagProcessor
dagProcessor Enabled Mandatory standalone DAG parser in Airflow 3
triggerer Enabled Deferrable operator polling
workers Enabled CeleryExecutor task execution
webserver Skipped by chart Template only renders for Airflow < 3. Block kept for defaultUser config consumed by createUserJob

# Bootstrap sequence

Before the Helm release, a Terraform-managed kubernetes_job_v1 (db_bootstrap) runs:

  1. Cloud SQL Auth Proxy native sidecar (init container with restartPolicy: Always)
  2. grants init container -- connects as postgres admin, GRANTs privileges to the IAM user
  3. migrate init container -- runs airflow db migrate as the IAM user

This is needed because Cloud SQL IAM users start with zero DB privileges, and the chart's built-in migration hook runs too late. The chart's migrateDatabaseJob is disabled.

# Service account

Chart 1.20.0 creates per-component KSAs by default (airflow-scheduler, airflow-api-server, etc.), none of which carry the Workload Identity annotation. A single kubernetes_service_account_v1 is created in Terraform with the WI annotation, and all components reference it with serviceAccount = { create = false, name = "airflow" }.

# Helm values (dev PoC)

executor: CeleryExecutor
defaultAirflowRepository: apache/airflow
defaultAirflowTag: "3.2.0"

# ---------- scheduler ----------
scheduler:
  replicas: 1
  serviceAccount: { create: false, name: airflow }
  waitForMigrations: { enabled: false }
  resources:
    requests: { cpu: 200m, memory: 512Mi }
    limits: { cpu: "1", memory: 1Gi }
  startupProbe: { timeoutSeconds: 60, failureThreshold: 20 }
  livenessProbe: { timeoutSeconds: 60 }
  extraContainers:
    - <cloud-sql-proxy --auto-iam-authn --private-ip>
  # + GCS FUSE volume/mount/annotation

# ---------- API server (Airflow 3+) ----------
apiServer:
  enabled: true
  replicas: 1
  serviceAccount: { create: false, name: airflow }
  waitForMigrations: { enabled: false }
  resources:
    requests: { cpu: 250m, memory: 512Mi }
    limits: { cpu: 500m, memory: 1Gi }
  startupProbe: { failureThreshold: 20 }
  extraContainers:
    - <cloud-sql-proxy --auto-iam-authn --private-ip>

# ---------- DAG processor (Airflow 3+) ----------
dagProcessor:
  enabled: true
  replicas: 1
  serviceAccount: { create: false, name: airflow }
  waitForMigrations: { enabled: false }
  resources:
    requests: { cpu: 150m, memory: 384Mi }
    limits: { cpu: 500m, memory: 1Gi }
  livenessProbe: { timeoutSeconds: 60 }
  extraContainers:
    - <cloud-sql-proxy --auto-iam-authn --private-ip>
  # + GCS FUSE volume/mount/annotation

# ---------- webserver (Airflow < 3 only) ----------
# Chart skips this template for Airflow 3+.
# Kept for defaultUser consumed by createUserJob.
webserver:
  serviceAccount: { create: false, name: airflow }
  defaultUser:
    enabled: true

# ---------- triggerer ----------
triggerer:
  enabled: true
  replicas: 1
  serviceAccount: { create: false, name: airflow }
  waitForMigrations: { enabled: false }
  resources:
    requests: { cpu: 100m, memory: 256Mi }
    limits: { cpu: 250m, memory: 512Mi }
  livenessProbe: { timeoutSeconds: 60 }
  extraContainers:
    - <cloud-sql-proxy --auto-iam-authn --private-ip>
  # + GCS FUSE volume/mount/annotation

# ---------- celery workers ----------
workers:
  replicas: 1
  serviceAccount: { create: false, name: airflow }
  waitForMigrations: { enabled: false }
  resources:
    requests: { cpu: 500m, memory: 1536Mi }
    limits: { cpu: "1.5", memory: 3Gi }
  livenessProbe: { timeoutSeconds: 60 }
  terminationGracePeriodSeconds: 600
  extraContainers:
    - <cloud-sql-proxy --auto-iam-authn --private-ip>
  # + GCS FUSE volume/mount/annotation

# ---------- redis ----------
redis:
  enabled: true
  serviceAccount: { create: false, name: airflow }
  resources:
    requests: { cpu: 50m, memory: 64Mi }
    limits: { cpu: 100m, memory: 128Mi }

# ---------- metadata database (external Cloud SQL) ----------
postgresql:
  enabled: false

data:
  metadataSecretName: airflow-metadata-connection
  resultBackendSecretName: airflow-result-backend-connection

# ---------- DAG sync (GCS FUSE, not git-sync) ----------
dags:
  persistence: { enabled: false }
  gitSync: { enabled: false }

# ---------- remote logging ----------
env:
  - name: AIRFLOW__LOGGING__REMOTE_LOGGING
    value: "True"
  - name: AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER
    value: "gs://ume-airflow-logs-poc-ume-data/logs"
  - name: AIRFLOW__LOGGING__DELETE_LOCAL_LOGS
    value: "True"

# ---------- airflow.cfg overrides ----------
config:
  core:
    parallelism: 16
    max_active_tasks_per_dag: 8
    max_active_runs_per_dag: 2
  celery:
    worker_concurrency: 8
  scheduler:
    min_file_process_interval: 60

# ---------- chart migration job (disabled -- handled by Terraform bootstrap) ----------
migrateDatabaseJob:
  enabled: false

# ---------- cleanup (disabled in chart -- standalone Terraform CronJob) ----------
cleanup:
  enabled: false

# Outputs

The runtime stack exports:

  • airflow_namespace -- Kubernetes namespace.
  • airflow_logs_bucket -- GCS bucket for task execution logs.
  • airflow_dags_bucket -- GCS bucket for DAG sync via GCS FUSE.

# Custom Image

# What goes in the image

The custom image extends the official Apache Airflow base image. It adds the Python packages needed for Cosmos + dbt:

Package Purpose
astronomer-cosmos Renders dbt projects as Airflow task groups
dbt-core dbt runtime
dbt-bigquery BigQuery adapter for dbt

System packages (git, build tools) are installed if needed by Python wheels.

# What does NOT go in the image

  • The dbt project itself — synced to GCS bucket via CI (see DAG Sync below).
  • DAG files — synced to GCS bucket via CI.
  • Secrets or credentials — injected at runtime via Workload Identity or Secret Manager.

# Image lifecycle

Image ownership lives in ume-data-dags:

ume-data-dags CI (on push to main touching docker/)
    │
    ├── image.yml
    │   ├── docker build + push <AR_URL>/airflow:3.2.0-<sha>
    │   └── tag is immutable (AR docker_config.immutable_tags = true)
    │
    └── bot-pr.yml (workflow_run on image.yml success)
        └── Uses INFRA_PR_TOKEN (fine-grained PAT) to open a PR on
            ume-data-infra bumping airflow_image_tag in
            environments/dev-03-runtime/terraform.tfvars.
            Merging that PR triggers terraform-apply's
            wait-for-image gate → Helm rolls the pods.

Tag format: <airflow-version>-<commit-sha> (e.g., 3.2.0-a1b2c3d).

Immutability: once pushed, a tag is never overwritten. Prod promotion means changing prod-02-runtime/terraform.tfvars to reference the exact same tag validated in dev.

Rollback: revert the airflow_image_tag in tfvars to the previous value and apply.

# DAG Sync

# Mechanism: GCS FUSE CSI

DAGs are synced from a GCS bucket to the Airflow pods via the GCS FUSE CSI driver — a native GKE add-on that mounts a GCS bucket as a local filesystem.

ume-data-dags CI (push to main, paths: dags/ or dbt/)
    │
    └── dag-sync.yml
        ├── gcloud storage rsync dags/ gs://ume-airflow-dags-poc-ume-data/dags/
        └── gcloud storage rsync dbt/ gs://ume-airflow-dags-poc-ume-data/dbt/
            │
            └── GCS FUSE CSI (mountOptions: implicit-dirs) mounts bucket
                at /opt/airflow/dags/ on scheduler, worker, triggerer,
                dag-processor. Changes visible near-instantly; dag-processor
                refreshes the bundle every 300s.

# Why GCS FUSE over git-sync

Factor GCS FUSE CSI git-sync sidecar
Auth Workload Identity (already configured) GitHub token or SSH key in Secret Manager
Failure modes GCS is highly reliable Network issues to GitHub, token expiration, rate limiting
Config overhead Enable GKE add-on + volume mount Token management, callback URLs, Helm chart config
Iteration speed ~1-2 min (CI sync + scheduler scan) ~2 min (poll interval + scheduler scan)

Workload Identity handles all GCS auth — no additional credentials to manage, rotate, or store.

# GCS FUSE CSI configuration

The GCS FUSE CSI driver is enabled as a GKE cluster add-on in modules/gke-standard/. Pods opt in via annotation and volume spec:

# Pod annotation (enables the FUSE sidecar injector)
gke-gcsfuse/volumes: "true"

# Volume spec
volumes:
  - name: dags
    csi:
      driver: gcsfuse.csi.storage.gke.io
      readOnly: true
      volumeAttributes:
        bucketName: ume-airflow-dags-poc-ume-data

volumeMounts:
  - name: dags
    mountPath: /opt/airflow/dags/
    readOnly: true

# DAG + dbt project location at runtime

GCS FUSE mounts the bucket root at /opt/airflow/dags/. ume-data-dags's dag-sync.yml workflow rsyncs its dags/ and dbt/ into the bucket, so the filesystem looks like:

/opt/airflow/dags/
├── dags/
│   └── cosmos_dbt_dag.py
└── dbt/
    ├── dbt_project.yml
    ├── profiles.yml
    └── models/
        └── example/

Cosmos references the dbt project at /opt/airflow/dags/dbt.

# Iteration speed

  1. Engineer pushes to main.
  2. CI pipeline runs gcloud storage rsync (~30 seconds).
  3. GCS FUSE reflects new files (near-instant — bucket is mounted live).
  4. Scheduler detects updated files (~30-60 seconds).

Total time from push to runnable: ~1-2 minutes.

# dbt + Cosmos Integration

# How Cosmos works

Cosmos is an Airflow provider that renders a dbt project as an Airflow task group. Each dbt model becomes an Airflow task, with dependencies preserved.

from cosmos import DbtDag, ProjectConfig, ProfileConfig, ExecutionConfig
from cosmos.constants import ExecutionMode

dbt_dag = DbtDag(
    project_config=ProjectConfig(
        dbt_project_path="/opt/airflow/dags/dbt",
    ),
    profile_config=ProfileConfig(
        profile_name="ume",
        target_name="dev",
        profiles_yml_filepath="/opt/airflow/dags/dbt/profiles.yml",
    ),
    execution_config=ExecutionConfig(
        execution_mode=ExecutionMode.LOCAL,
        dbt_executable_path="/home/airflow/dbt-venv/bin/dbt",
    ),
    schedule="@daily",
    dag_id="dbt_ume",
)

# dbt profile and credentials

dbt connects to BigQuery using the Airflow service account's identity (Workload Identity). The profiles.yml uses the oauth method:

ume:
  target: "{{ ERROR }}"
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: poc-ume-data
      dataset: "{{ ERROR }}"
      threads: 4
    prod:
      type: bigquery
      method: oauth
      project: ume-data-prod
      dataset: "{{ ERROR }}"
      threads: 8

No service-account keys. Workload Identity provides the OAuth token.

# KubernetesPodOperator (KPO)

# How it works

KPO creates Kubernetes pods directly via the API. KPO tasks run on the dedicated kpo-pool (spot VMs, scale-to-zero). Use for heavy batch jobs, data quality checks, or any task needing full isolation from the Airflow worker.

from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator

dbt_full_refresh = KubernetesPodOperator(
    task_id="dbt_full_refresh",
    namespace="airflow-kpo",
    image="{{ ERROR }}",
    cmds=["dbt", "run", "--full-refresh", "--project-dir", "/dbt"],
    service_account_name="airflow-kpo",
    node_selector={"pool": "kpo"},
    tolerations=[{
        "key": "workload",
        "operator": "Equal",
        "value": "kpo",
        "effect": "NoSchedule",
    }],
    is_delete_operator_pod=True,
)

# kpo-pool scale-to-zero

  1. Scheduler dispatches KPO task to worker (via CeleryExecutor).
  2. Worker creates a pod with toleration for workload=kpo:NoSchedule + nodeSelector: pool: kpo.
  3. Pod is Pending — no kpo-pool nodes exist.
  4. Cluster Autoscaler detects pending pod (~30s).
  5. Spot VM provisioned (~60-90s).
  6. Pod runs, completes, is cleaned up.
  7. After ~10 minutes idle, autoscaler removes the empty node.

# Logging (Hybrid: Cloud Logging + GCS)

Airflow uses a hybrid logging approach — operational logs and task execution logs go to different destinations, each optimized for its use case.

# Container logs → Cloud Logging (automatic)

GKE automatically ships all container stdout/stderr to Cloud Logging (GCP's equivalent of CloudWatch). This is zero-config — enabled by default on every GKE cluster. These logs cover:

  • Scheduler heartbeat and parsing output
  • Worker task pickup and execution events
  • Webserver access logs
  • Pod crashes, OOM events, restarts

Cloud Logging provides searchable, indexed logs with alerting — ideal for operational observability.

# Task execution logs → GCS (configured)

Airflow's built-in remote_logging feature ships task execution logs (the structured output from each DAG task run) to a GCS bucket. The Airflow UI reads task logs directly from GCS.

env:
  - name: AIRFLOW__LOGGING__REMOTE_LOGGING
    value: "True"
  - name: AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER
    value: "gs://ume-airflow-logs-poc-ume-data/logs"
  - name: AIRFLOW__LOGGING__DELETE_LOCAL_LOGS
    value: "True"

GCS is cheaper than Cloud Logging for retention and Airflow natively reads from it — no custom log handler needed.

# GCS log bucket

Created via modules/gcs-bucket/ in dev-02-runtime/buckets.tf:

  • Bucket: ume-airflow-logs-poc-ume-data
  • ume-airflow has roles/storage.objectAdmin (project-wide for PoC; scope to this bucket as a hardening task)
  • Lifecycle rule: delete objects older than 90 days (configurable)

# Log cleanup sidecar

The Helm chart includes a log cleanup sidecar on the scheduler pod. With remote logging + DELETE_LOCAL_LOGS=True, this is a safety net:

scheduler:
  logCleanup:
    enabled: true
    retentionMinutes: 1440  # 1 day (local copies only; GCS has its own lifecycle)

# Metadata Database Maintenance

# Growth drivers

The Airflow metadata database grows continuously. Fastest-growing tables:

Table Growth driver
task_instance One row per task execution
log Task execution events
xcom Inter-task data passing
dag_run One row per DAG run

Without cleanup, the database grows indefinitely and scheduler performance degrades.

# Automated cleanup

A standalone kubernetes_cron_job_v1 Terraform resource runs airflow db clean weekly. The Helm chart's built-in cleanup section does not support sidecar injection (additionalProperties: false in its JSON schema), so the Cloud SQL Auth Proxy cannot be added there. The standalone CronJob uses a K8s 1.28+ native sidecar (init container with restartPolicy: Always) to provide database connectivity.

# In the environment's airflow.tf or terraform.tfvars:
cleanup_enabled = true           # default: false
cleanup_schedule = "0 3 * * 0"   # Sunday 3 AM UTC
cleanup_retention_days = 90

This retains 90 days of metadata. Adjust based on backfill needs — if you use depends_on_past=True, ensure retention covers the lookback window.

# Manual cleanup

For one-off cleanups or targeted table cleanup:

kubectl exec -it deploy/airflow-scheduler -n airflow -- \
  airflow db clean --clean-before-timestamp "2026-01-01" --only-tables task_instance,log,xcom

Always backup the database before manual cleanup.

# Monitoring and Alerting

# Scaling signals: when to upgrade from e2-standard-2

The default-pool starts with a single e2-standard-2 (1930m allocatable CPU, ~6.1 GiB RAM). Monitor these signals to know when to scale:

Signal Threshold Action
Scheduler heartbeat gap > 30 seconds sustained Scheduler is CPU-starved. Upgrade node or add second node.
Task queue depth (Redis) Growing unbounded Worker can't keep up. Add workers or increase worker_concurrency.
Worker CPU usage > 80% sustained (15 min) dbt tasks are CPU-bound. Upgrade to e2-standard-4 or add a second node.
Pod evictions Any on default-pool Memory pressure. Upgrade to e2-standard-4 (13.3 GiB allocatable).
Cluster Autoscaler adds 2nd node Frequently Sustained demand exceeds 1 node. Set min_nodes=2 or switch to e2-standard-4.

Upgrade path: e2-standard-2 ($49/mo) → e2-standard-4 ($98/mo, 3920m CPU / 13.3 GiB). Alternatively, keep e2-standard-2 and set min_nodes=2 ($98/mo, 3860m aggregate CPU but better fault tolerance).

# Cloud SQL alerts

Metric Threshold Severity
database/disk/quota_utilization > 80% Warning
database/cpu/utilization > 85% sustained (10 min) Warning
database/memory/utilization > 85% sustained (10 min) Warning
Database size > 5 GiB (on 10 GiB allocation) Info

# Airflow alerts

Metric Threshold Severity
Scheduler heartbeat gap > 60 seconds Critical
Task failure rate > 10% over 15 minutes Warning
DAG import errors > 0 Warning
Orphaned tasks > 0 sustained Warning

# GKE / infrastructure alerts

Metric Threshold Severity
Node CPU pressure > 85% sustained (15 min) Warning
Node memory pressure > 90% Critical
Pod restart count > 3 in 1 hour (per pod) Warning
kpo-pool at max nodes node_count = max_nodes Warning

# Recommended dashboard

Create a Cloud Monitoring dashboard with:

  1. Scheduler health: heartbeat interval, task queue depth, DAG parsing time
  2. Worker load: CPU/memory utilization, task concurrency, queue wait time
  3. Database: disk usage, CPU, memory, active connections
  4. Cluster: node count per pool, pod status, autoscaler events

# Workload Identity

Kubernetes SA Namespace Google SA Purpose
airflow airflow ume-airflow Cloud SQL IAM auth, Secret Manager, BigQuery, GCS (logs + DAG bucket)
airflow-kpo airflow-kpo ume-airflow-kpo Scoped identity for KPO tasks (BigQuery, GCS only)

Separate namespaces and SAs enforce least privilege — a compromised KPO container cannot access Airflow metadata or Cloud SQL.

# API Server Authentication (IAP at the GCLB)

External access to the Airflow API server is gated by Identity-Aware Proxy at the Google Cloud Load Balancer, one layer in front of Airflow. Airflow 3 keeps its default SimpleAuthManager with an admin user created by the Helm chart's createUserJob; users reach the app only after IAP validates their Google identity. Implemented in Story 4c.

# Topology

Browser
   │  TLS handshake with *.umedev.marpont.es
   ▼
GCLB  (shared static IP ume-data-dev-ingress-ip, wildcard cert from Certificate Manager)
   │  Gateway listens on :80 (redirect → :443) and :443 (HTTPS)
   ▼
HTTPRoute (airflow namespace) — hostname airflow.umedev.marpont.es → airflow-api-server:8080
   │
   ▼
IAP gate  (attached to the backend service via GCPBackendPolicy)
   │  Google OIDC sign-in + check against roles/iap.httpsResourceAccessor
   ▼
Service airflow-api-server → api-server pod → Airflow SimpleAuthManager

# Why IAP over Airflow-native OIDC

Factor IAP at GCLB Airflow-native OIDC (Flask AppBuilder)
Airflow 3 support Works out of the box — Airflow stays on SimpleAuthManager Requires installing apache-airflow-providers-fab in a custom image
Image changes None Pulls Story 4d's custom-image work forward
Identity Google OIDC, enforced before traffic hits Airflow Google OIDC, enforced inside Airflow
Port-forward break-glass Still works (bypasses LB, lands on SimpleAuthManager login) Same — port-forward bypass is orthogonal
DataHub consistency Same pattern reused for DataHub in Phase 2 Different auth stack per app

# Gateway API (not classic Ingress)

The shared Gateway is a gateway.networking.k8s.io/v1 Gateway with gatewayClassName: gke-l7-global-external-managed. One Gateway fronts every service in the environment — Airflow today, DataHub in Phase 2 — on a single static IP and a single wildcard TLS cert. Each app attaches its own HTTPRoute for its hostname.

Gateway ownership:

  • Shared Gateway + redirect HTTPRoute live in environments/{env}-02-k8s-base/gateway.tf.
  • Per-app HTTPRoute lives inside the app's module (for Airflow, modules/airflow-helm/httproute.tf).

Cross-namespace attachment is allowed without ReferenceGrant by setting allowedRoutes.namespaces.from = All on each listener. Backend Service references stay intra-namespace (HTTPRoute and Service both in airflow).

# IAP wiring

Per-service IAP is provisioned by modules/iap-oauth/:

  • google_iap_client creates an OAuth 2.0 client under the project-level IAP brand (passed in via var.iap_brand_name).
  • kubernetes_secret_v1 stores client_id + client_secret in the Service namespace — keys must match exactly because GCPBackendPolicy.spec.default.iap.oauth2ClientSecret.name expects that shape.
  • kubernetes_manifest GCPBackendPolicy (networking.gke.io/v1) attaches IAP to the Service via targetRef. The GKE Gateway controller reads this and enables IAP on the generated backend service.
  • google_project_iam_member bindings grant roles/iap.httpsResourceAccessor to the allow-listed principals unconditionally. IAM conditions on the project-level grant do not propagate to IAP's authorization path for Gateway-API backends (IAP reads the IAP-resource-level policy on the backend, not project IAM with conditions). Scoping is done via the allow-list — pick users/groups tightly — not via IAM conditions.

The module accepts three allow-list variables — iap_allowed_domains, iap_allowed_groups, iap_allowed_users — and takes the UNION. Use individual users for tight scoping during the PoC. When a second IAP-protected backend exists with different access requirements, switch to google_iap_web_backend_service_iam_member scoped per service.

# Prerequisite — OAuth consent screen is manual

google_iap_brand cannot create the OAuth consent screen via API for projects outside a Workspace org, and even for in-org projects the IAP OAuth Admin API is being phased out. The brand must be created once in the GCP Console. See the header comment in environments/{env}-03-runtime/iap.tf for the step-by-step runbook. After creation:

gcloud iap oauth-brands list --project=<project-id> --format='value(name)'

Paste the resulting projects/<project_number>/brands/<brand_id> into iap_brand_name in the runtime stack's tfvars.

# Known provider quirks

  • IAM conditions on IAP bindings are inert for Gateway-API backends. A project-level google_project_iam_member on roles/iap.httpsResourceAccessor with condition resource.type == "iap.googleapis.com/WebBackendService" applies cleanly and shows in gcloud projects get-iam-policy, but IAP rejects sign-in with "You don't have access". IAP for Gateway API reads the IAP-resource-level policy (gcloud iap web get-iam-policy --resource-type=backend-services --service=…), not project IAM with conditions. Use unconditional bindings or google_iap_web_backend_service_iam_member per backend.
  • Conditional IAM member + domain: members crash on create. With an IAM condition attached, google_project_iam_member creations for domain: members hit a google-provider rollback bug ("Provider produced inconsistent result after apply: Root object was present, but now absent"). user: members don't hit this. Combined with the previous point, the cleanest path is unconditional + explicit per-user allow-list.
  • google_iap_brand can't be created for non-Workspace projects (HTTP 400) and the IAP OAuth Admin API is being phased out. Create the OAuth consent screen manually in Console and pass the brand name in.
  • The IAP brand is a one-way door — cannot be deleted via API; terraform destroy requires terraform state rm google_iap_brand.project first.

# Airflow-side auth (post-IAP)

Behind IAP, Airflow 3 runs SimpleAuthManager with [core] simple_auth_manager_all_admins = true — every request is treated as admin with no login prompt. IAP already authenticated the user; a second password would add no security and confuse users. The chart's createUserJob is auto-disabled in that mode because airflow users create uses FAB's security manager (AirflowSecurityManagerV2.find_role) which isn't configured under SimpleAuthManager and crashes the Helm hook.

Two Airflow configs must be set as a pair:

Config Value Effect
[core] auth_manager airflow.api_fastapi.auth.managers.simple.simple_auth_manager.SimpleAuthManager Pins SimpleAuthManager. Default apache/airflow:3.2.0 image ships the FAB provider which would otherwise take over get_auth_manager(), and the combination (SimpleAuthManager middleware + FAB manager) throws AttributeError: 'SimpleAuthManagerUser' has no attribute 'id'.
[core] simple_auth_manager_all_admins true Skips the login screen; every request is admin.

The airflow-helm module wires both together — airflow_config.simple_auth_manager_all_admins = true on the module call flips both internally, and also disables createUserJob.

Port-forward remains a break-glass path:

kubectl port-forward svc/airflow-api-server 8080:8080 -n airflow
# lands straight on the UI — SimpleAuthManager trusts every request

Port-forward is already gated upstream by GKE IAM (you need container.clusters.get + pod exec/port-forward perms to run it), so skipping the Airflow-side login doesn't widen the blast radius.

# Cloud SQL (Airflow Metadata Database)

A Cloud SQL PostgreSQL instance serves as the Airflow metadata store.

Setting Dev (PoC) Prod
Instance name ume-data-dev-airflow-pg ume-data-prod-platform-pg
Tier db-g1-small db-custom-2-7680
HA Single zone Regional (auto failover)
Storage 10 GB SSD, auto-increase 50 GB SSD
Backups Daily, 7-day retention Daily, 30-day + PITR
Network Private IP via PSA Same
Auth IAM authentication Same

Phase 2 shared instance strategy: When DataHub arrives, evaluate whether to create a second logical database (datahub) on this instance (cheaper) or a separate instance (better isolation).