# CI/CD

All infrastructure changes flow through GitHub Actions. No manual terraform apply from laptops (except for the one-time bootstrap). This section documents the workflows, authentication, and operational conventions.

# Authentication

# Workload Identity Federation

GitHub Actions authenticates to GCP using Workload Identity Federation (WIF). No service-account key files.

GitHub Actions OIDC token
    → GCP WIF pool (trusts github.com)
    → WIF provider (attribute condition: repo = 1edata/ume-data-infra)
    → Short-lived access token for tf-plan-sa or tf-apply-sa
    → Terraform uses this token for all GCP API calls

The WIF pool and provider are created in layers/00-bootstrap/. The attribute condition restricts access to the specific repository. When the repo moves to a different GitHub org, the attribute condition must be updated (see Operations - WIF repo rename).

# Service accounts

SA Used by Permissions
tf-plan-sa Plan workflow (PR) roles/viewer on project, roles/storage.objectViewer on state bucket
tf-apply-sa Apply workflow (merge), drift workflow roles/editor on project, roles/storage.objectAdmin on state bucket

# Workflows

# terraform-plan.yml (on PR)

Trigger: pull request opened or updated against main.

Steps:

  1. Detect changed stacks: scripts/detect-changed-stacks.sh compares the PR diff against main and outputs a list of affected stack paths (e.g., environments/dev-01-base, layers/00-bootstrap).
  2. Matrix plan: for each changed stack, a parallel job runs:
    • terraform init -backend-config=backend.hcl
    • terraform validate
    • terraform plan -out=plan.tfplan
  3. Post plan as PR comment: the plan output is posted as a collapsible comment on the PR, one section per stack. If the plan has no changes, it says so.
  4. Label check: a lint step verifies that all resources in the plan carry the 5 mandatory labels. Fails the check if any are missing.

# terraform-apply.yml (on merge to main)

Trigger: push to main (i.e., PR merged).

Steps:

  1. Detect changed stacks: same script as plan.
  2. Apply order: stacks are applied in dependency order:
    • layers/00-bootstraplayers/10-platform-shared{env}-01-base{env}-02-k8s-base{env}-03-runtime
    • If only a single stack changed, only that stack is applied.
  3. Dev stacks: auto-apply on merge. No manual gate.
  4. Prod stacks: gated by a GitHub Environment protection rule (production environment with required reviewers). The workflow pauses and waits for approval before applying.
  5. Post-apply: run terraform output -json and store as a workflow artifact for audit.

# terraform-drift.yml (scheduled)

Trigger: cron schedule, daily at 06:00 UTC (configurable).

Steps:

  1. For every stack (all layers + all environments): terraform plan -detailed-exitcode.
  2. If exit code = 2 (drift detected): open a GitHub issue tagged drift with the plan diff, or post to a Slack channel.
  3. If exit code = 0 (no drift): no action.

Drift detection catches out-of-band changes (console clicks, gcloud commands, other automation) that diverge from the Terraform-managed state.

# Changed-Stack Detection

The script scripts/detect-changed-stacks.sh determines which stacks need plan/apply:

# Pseudocode
changed_files=$(git diff --name-only origin/main...HEAD)

for stack_dir in layers/* environments/*/; do
  if echo "$changed_files" | grep -q "^${stack_dir}"; then
    echo "$stack_dir"
  fi
done

# Also detect module changes and map to dependent stacks
for module_dir in modules/*/; do
  if echo "$changed_files" | grep -q "^${module_dir}"; then
    # Find all stacks that reference this module
    grep -rl "modules/${module_name}" environments/ layers/ \
      | xargs -I{} dirname {} \
      | sort -u
  fi
done

This ensures that a change to modules/gke-standard/ triggers plans for all stacks that use that module.

# Branching Model

Trunk-based development. No long-lived environment branches.

feature branch → PR → review → merge to main → auto-apply (dev) / gated apply (prod)
  • All work happens on short-lived feature branches.
  • PRs target main.
  • main is always deployable.
  • Prod is not gated by a branch; it is gated by a GitHub Environment approval.

# Prod Gating

# Wave-1: GitHub Environment protection

A GitHub Environment named production with required reviewers. When the apply workflow reaches a prod stack, it pauses and notifies reviewers. After approval, apply proceeds.

# Target state: semantic-release (documented, not built in wave-1)

The target workflow uses conventional commits and semantic-release:

  1. PRs merged to main auto-apply to dev (as today).
  2. When ready for prod, a release is cut: semantic-release creates a git tag (e.g., v1.2.0) based on commit messages.
  3. The tag creation event triggers the prod apply workflow.
  4. Prod apply uses the exact commit that was tagged - no separate branch, no cherry-picking.

This model gives a clear audit trail (tag = prod release) and decouples dev velocity from prod stability. Implementation deferred to after wave-1 is stable.

# CI for the DAGs Repository

The ume-data-dags repo owns the Airflow image + content pipeline. It federates via the same WIF provider as ume-data-infra, using a dedicated ume-datainfra-content-push SA with bucket-scoped roles/storage.objectAdmin and repo-scoped roles/artifactregistry.writer.

# Workflows (in ume-data-dags)

  • image.yml — on push to main touching docker/**: builds and pushes <airflow-version>-<commit-sha> to Artifact Registry (ume-composer-images, immutable tags).

  • dag-sync.yml — on push to main touching dags/** or dbt/**: gcloud storage rsyncs to gs://ume-airflow-dags-<project>/{dags,dbt}/. GCS FUSE on the Airflow pods reflects changes live.

  • bot-pr.ymlworkflow_run after image.yml succeeds on main. Uses INFRA_PR_TOKEN (fine-grained PAT scoped to ume-data-infra only) to open a PR against ume-data-infra bumping airflow_image_tag in environments/dev-03-runtime/terraform.tfvars.

  • pr-ci.yml — on PR: hadolint + python -m py_compile + dbt parse. No GCP auth needed.

# Rollout flow (end-to-end)

ume-data-dags merge to main
    ↓
image.yml pushes 3.2.0-<sha> to AR
    ↓
bot-pr.yml opens PR on ume-data-infra bumping airflow_image_tag
    ↓
human reviews + merges the bot-PR
    ↓
ume-data-infra terraform-apply.yml:
    ├── wait-for-image gate confirms the tag exists in AR
    └── terraform apply → Helm rolls Airflow pods onto the new image

# Prod image promotion (when prod lands)

  • Image: no rebuild. Same immutable image tag validated in dev is referenced in prod-03-runtime/terraform.tfvars (stack path subject to whatever prod layout lands).

  • Bot-PR: extend bot-pr.yml to either open two PRs (dev + prod) or target a single per-env tfvars. Prod gets the GH-Environment approval gate described earlier.

# Conventions

  • Never skip hooks: --no-verify is prohibited. If a pre-commit hook fails, fix the issue.
  • Never force-push to main: protected branch rules enforce this.
  • Commit messages: follow conventional commits (feat:, fix:, chore:, docs:) to prepare for semantic-release.
  • PR size: prefer small, single-stack PRs. A PR touching both dev-01-base and dev-02-runtime should be split unless the changes are tightly coupled.
  • Manual applies: only for 00-bootstrap (one-time) and emergency break-glass (documented in Operations).