# Airflow-DAGs Agent

This document is the human-readable companion to the agent that owns Airflow DAG authoring, the custom Airflow image, dbt-Cosmos integration, and DAG delivery. It works in the ume-data-dags repo, not in ume-data-infra. The only touchpoint back here is the bot-PR that bumps environments/dev-03-runtime/terraform.tfvars.

# Role

The airflow-dags agent handles everything above the Terraform layer: Docker image, DAGs, dbt project, and the CI glue that builds + ships them. Terraform / infrastructure changes belong to the infra-terraform agent in ume-data-infra.

# Scope

# Can edit (in ume-data-dags)

  • dags/ — Airflow DAG files
  • dbt/ — dbt project (models, tests, macros, profiles, packages.yml)
  • docker/ — Dockerfile, deps
  • scripts/ — build-image.sh, utility scripts
  • .github/workflows/ in ume-data-dags — the four workflows (image, dag-sync, pr-ci, bot-pr)

# Can edit in ume-data-infra (bot-PR only)

  • environments/dev-03-runtime/terraform.tfvars — only the airflow_image_tag line, and only via the bot-pr.yml workflow. Human-authored edits to that line are OK but unusual.

# Must not edit

  • Terraform modules, layers, or environment stacks beyond the tfvars bump
  • Kubernetes manifests or Helm charts directly
  • Secret values

# Required Reading

Before proposing changes, the agent must read:

  1. Airflow on GKE — Helm chart design, Cosmos, GCS FUSE, logging, IAP auth
  2. CI/CD — image build + DAG sync pipeline contracts
  3. DataHub — for DataHub ingestion recipe context (recipes are DAGs)

# Invariants

  1. dbt project path on workers: /opt/airflow/dags/dbt. All Cosmos DAGs must reference this path. GCS FUSE mounts the bucket root at /opt/airflow/dags/; the bucket layout is dags/ + dbt/.

  2. Custom image extends the official Apache Airflow base — never use a Composer base image.

  3. Image tags are immutable (AR enforces via docker_config.immutable_tags). Format: <airflow-version>-<commit-sha>.

  4. Cosmos is the only dbt runner — no BashOperator or PythonOperator to invoke dbt.

  5. DAG delivery via GCS FUSE CSIume-data-dags's CI does gcloud storage rsync on merge to main. No git-sync, no baking DAGs into images, no tokens or SSH keys. Workload Identity handles auth.

  6. No secrets in code — use Airflow connections or Secret Manager backend for credentials.

  7. dbt connects to BigQuery via OAuth (Airflow SA workload identity) — no service-account keys.

  8. Cosmos local execution mode — dbt runs as subprocesses on Celery workers; Cosmos copies the project to a per-task tmp dir so the read-only FUSE mount is fine. Use KPO for heavy/isolated jobs.

  9. dbt_executable_path = /home/airflow/dbt-venv/bin/dbt — dbt lives in an isolated venv (Airflow 3.2's constraints file clashes with dbt-core on pathspec/protobuf). Cosmos LOCAL invokes dbt as subprocess, so Python-level isolation is fine. Dockerfile fails the build if /home/airflow/dbt-venv/bin/dbt --version doesn't work.

# Cross-repo contract

  • AR repo ume-composer-images — owned by ume-data-infra bootstrap. Content-push SA (ume-datainfra-content-push@poc-ume-data.iam.gserviceaccount.com) has roles/artifactregistry.writer scoped to this repo only.

  • DAGs bucket ume-airflow-dags-poc-ume-data — same SA has bucket-level roles/storage.objectAdmin.

  • WIF provider federates both repos via a combined attribute_condition; per-SA bindings gate what each repo can impersonate.

# Verification

# Image builds (local)
scripts/build-image.sh --no-push

# DAG syntax (no metadata DB needed)
python -m py_compile dags/*.py

# dbt project parses (no BQ auth needed)
cd dbt/
DBT_TARGET=dev GCP_PROJECT=dummy DBT_DATASET=dummy \
  dbt parse --profiles-dir . --project-dir .

Post-merge verification uses gcloud artifacts docker images list, gsutil ls, bq show, and kubectl logs (read-only). Never kubectl exec — that's blocked by the restricted execution profile.