#
CI/CD
All infrastructure changes flow through GitHub Actions. No manual terraform apply from laptops (except for the one-time bootstrap). This section documents the workflows, authentication, and operational conventions.
#
Authentication
#
Workload Identity Federation
GitHub Actions authenticates to GCP using Workload Identity Federation (WIF). No service-account key files.
GitHub Actions OIDC token
→ GCP WIF pool (trusts github.com)
→ WIF provider (attribute condition: repo = 1edata/ume-data-infra)
→ Short-lived access token for tf-plan-sa or tf-apply-sa
→ Terraform uses this token for all GCP API calls
The WIF pool and provider are created in layers/00-bootstrap/. The attribute condition restricts access to the specific repository. When the repo moves to a different GitHub org, the attribute condition must be updated (see Operations - WIF repo rename).
#
Service accounts
#
Workflows
#
terraform-plan.yml (on PR)
Trigger: pull request opened or updated against main.
Steps:
- Detect changed stacks:
scripts/detect-changed-stacks.shcompares the PR diff againstmainand outputs a list of affected stack paths (e.g.,environments/dev-01-base,layers/00-bootstrap). - Matrix plan: for each changed stack, a parallel job runs:
terraform init -backend-config=backend.hclterraform validateterraform plan -out=plan.tfplan
- Post plan as PR comment: the plan output is posted as a collapsible comment on the PR, one section per stack. If the plan has no changes, it says so.
- Label check: a lint step verifies that all resources in the plan carry the 5 mandatory labels. Fails the check if any are missing.
#
terraform-apply.yml (on merge to main)
Trigger: push to main (i.e., PR merged).
Steps:
- Detect changed stacks: same script as plan.
- Apply order: stacks are applied in dependency order:
layers/00-bootstrap→layers/10-platform-shared→{env}-01-base→{env}-02-k8s-base→{env}-03-runtime- If only a single stack changed, only that stack is applied.
- Dev stacks: auto-apply on merge. No manual gate.
- Prod stacks: gated by a GitHub Environment protection rule (
productionenvironment with required reviewers). The workflow pauses and waits for approval before applying. - Post-apply: run
terraform output -jsonand store as a workflow artifact for audit.
#
terraform-drift.yml (scheduled)
Trigger: cron schedule, daily at 06:00 UTC (configurable).
Steps:
- For every stack (all layers + all environments):
terraform plan -detailed-exitcode. - If exit code = 2 (drift detected): open a GitHub issue tagged
driftwith the plan diff, or post to a Slack channel. - If exit code = 0 (no drift): no action.
Drift detection catches out-of-band changes (console clicks, gcloud commands, other automation) that diverge from the Terraform-managed state.
#
Changed-Stack Detection
The script scripts/detect-changed-stacks.sh determines which stacks need plan/apply:
# Pseudocode
changed_files=$(git diff --name-only origin/main...HEAD)
for stack_dir in layers/* environments/*/; do
if echo "$changed_files" | grep -q "^${stack_dir}"; then
echo "$stack_dir"
fi
done
# Also detect module changes and map to dependent stacks
for module_dir in modules/*/; do
if echo "$changed_files" | grep -q "^${module_dir}"; then
# Find all stacks that reference this module
grep -rl "modules/${module_name}" environments/ layers/ \
| xargs -I{} dirname {} \
| sort -u
fi
done
This ensures that a change to modules/gke-standard/ triggers plans for all stacks that use that module.
#
Branching Model
Trunk-based development. No long-lived environment branches.
feature branch → PR → review → merge to main → auto-apply (dev) / gated apply (prod)
- All work happens on short-lived feature branches.
- PRs target
main. mainis always deployable.- Prod is not gated by a branch; it is gated by a GitHub Environment approval.
#
Prod Gating
#
Wave-1: GitHub Environment protection
A GitHub Environment named production with required reviewers. When the apply workflow reaches a prod stack, it pauses and notifies reviewers. After approval, apply proceeds.
#
Target state: semantic-release (documented, not built in wave-1)
The target workflow uses conventional commits and semantic-release:
- PRs merged to
mainauto-apply to dev (as today). - When ready for prod, a release is cut:
semantic-releasecreates a git tag (e.g.,v1.2.0) based on commit messages. - The tag creation event triggers the prod apply workflow.
- Prod apply uses the exact commit that was tagged - no separate branch, no cherry-picking.
This model gives a clear audit trail (tag = prod release) and decouples dev velocity from prod stability. Implementation deferred to after wave-1 is stable.
#
CI for the DAGs Repository
The ume-data-dags repo owns
the Airflow image + content pipeline. It federates via the same WIF
provider as ume-data-infra, using a dedicated ume-datainfra-content-push
SA with bucket-scoped roles/storage.objectAdmin and repo-scoped
roles/artifactregistry.writer.
#
Workflows (in ume-data-dags)
image.yml— on push tomaintouchingdocker/**: builds and pushes<airflow-version>-<commit-sha>to Artifact Registry (ume-composer-images, immutable tags).dag-sync.yml— on push tomaintouchingdags/**ordbt/**:gcloud storage rsyncs togs://ume-airflow-dags-<project>/{dags,dbt}/. GCS FUSE on the Airflow pods reflects changes live.bot-pr.yml—workflow_runafterimage.ymlsucceeds on main. UsesINFRA_PR_TOKEN(fine-grained PAT scoped toume-data-infraonly) to open a PR againstume-data-infrabumpingairflow_image_taginenvironments/dev-03-runtime/terraform.tfvars.pr-ci.yml— on PR: hadolint +python -m py_compile+dbt parse. No GCP auth needed.
#
Rollout flow (end-to-end)
ume-data-dags merge to main
↓
image.yml pushes 3.2.0-<sha> to AR
↓
bot-pr.yml opens PR on ume-data-infra bumping airflow_image_tag
↓
human reviews + merges the bot-PR
↓
ume-data-infra terraform-apply.yml:
├── wait-for-image gate confirms the tag exists in AR
└── terraform apply → Helm rolls Airflow pods onto the new image
#
Prod image promotion (when prod lands)
Image: no rebuild. Same immutable image tag validated in dev is referenced in
prod-03-runtime/terraform.tfvars(stack path subject to whatever prod layout lands).Bot-PR: extend
bot-pr.ymlto either open two PRs (dev + prod) or target a single per-env tfvars. Prod gets the GH-Environment approval gate described earlier.
#
Conventions
- Never skip hooks:
--no-verifyis prohibited. If a pre-commit hook fails, fix the issue. - Never force-push to main: protected branch rules enforce this.
- Commit messages: follow conventional commits (
feat:,fix:,chore:,docs:) to prepare for semantic-release. - PR size: prefer small, single-stack PRs. A PR touching both
dev-01-baseanddev-02-runtimeshould be split unless the changes are tightly coupled. - Manual applies: only for
00-bootstrap(one-time) and emergency break-glass (documented in Operations).