#
Operations
This section contains runbooks for common operational tasks. Each runbook follows a consistent structure: context, prerequisites, steps, verification, and rollback.
All runbooks assume you have gcloud CLI configured, kubectl access to the GKE cluster, and Terraform locally installed (for break-glass scenarios). For routine operations, changes should flow through CI/CD.
#
DataHub Version Upgrade
#
Context
DataHub releases frequently and includes database migration scripts. Upgrades must be tested on dev before promoting to prod.
#
Prerequisites
- Current DataHub Helm chart version noted (from
terraform.tfvars) - Target version's release notes reviewed for breaking changes
- Dev environment available and healthy
#
Steps
- Update dev: change
datahub_helm_chart_versioninenvironments/dev-02-runtime/terraform.tfvars. - PR and plan: open PR, verify
terraform planshows only the Helm release changing. - Apply to dev: merge PR. CI auto-applies. The Helm upgrade runs DataHub's migration hooks automatically.
- Verify on dev:
- DataHub UI loads and login works.
- Run a test ingestion recipe (BigQuery or dbt).
- Check Kafka consumer lag is draining.
- Check OpenSearch cluster health is green.
- Verify lineage and search return expected results.
- Promote to prod: update
environments/prod-02-runtime/terraform.tfvarswith the same version. PR → approval → apply.
#
Rollback
If the upgrade fails:
- Revert the
datahub_helm_chart_versionintfvarsto the previous value. - Apply. The Helm rollback will restore the previous release.
- If the database migration is not backward-compatible (rare), restore Cloud SQL from PITR (see below).
#
Cloud SQL Point-in-Time Recovery (PITR)
#
Context
Cloud SQL supports PITR, allowing restoration to any second within the backup retention window.
#
Prerequisites
- PITR enabled on the Cloud SQL instance (configured in Terraform).
- Target recovery timestamp known.
#
Steps
- Identify recovery point: determine the timestamp just before the corruption/issue.
Create a new instance from PITR:
gcloud sql instances clone ume-data-dev-datahub-pg ume-data-dev-datahub-pg-restored \ --point-in-time="2026-04-13T10:00:00Z"- Verify the restored instance: connect and validate data integrity.
- Swap traffic: update the Cloud SQL connection name in
terraform.tfvarsto point to the restored instance. Apply via Terraform. - Clean up: once verified, delete the old instance (or keep for forensics).
#
Rollback
Revert terraform.tfvars to the original instance name and apply.
#
Kafka Restore (Strimzi)
#
Context
Kafka topics are the event bus, not the durable store. If topics are corrupted or lost, metadata is recoverable by re-ingesting from source systems. Cloud SQL holds the durable metadata.
#
Prerequisites
- Strimzi operator running and healthy.
- DataHub's Cloud SQL database intact.
#
Steps
- Assess damage: check if the Kafka cluster is recoverable (broker pods running, ZooKeeper/KRaft healthy).
- If cluster is recoverable: Strimzi operator will self-heal broker pods. Wait for ISR to stabilize.
- If cluster is unrecoverable:
- Delete the
KafkaCRD resource (Strimzi will clean up). - Re-apply Terraform for
{env}-02-k8s-baseto recreate the Kafka cluster. - DataHub's
kafka-setupinit job will recreate topics on the next Helm release cycle.
- Delete the
- Re-ingest metadata: trigger DataHub ingestion DAGs in Airflow to repopulate events. Cloud SQL is the source of truth; Kafka receives new change events.
#
Verification
kafka-topics.sh --listshows expected topics.- DataHub consumer lag is draining.
- DataHub UI shows current metadata.
#
OpenSearch — Current State, Failure Modes, and Recovery
#
Context
OpenSearch stores DataHub's entity/graph indices and time-series events. Indices are rebuildable from Cloud SQL (entity source of truth) and Kafka MAE replay (recent change events) — so in the worst case OpenSearch downtime degrades the UI but does not lose metadata, as long as Kafka retention outlasts the recovery window (currently 72h per Story 9's log.retention).
#
Current shape (dev)
- Three data nodes (
ume-data-dev-opensearch-nodes-0..2) each runningcluster_manager + data + ingestroles onpremium-rwo(pd-ssd), 5 GiB PVC per node. The operator's native bootstrap flow forms the cluster via its standard bootstrap pod. - Security plugin disabled — internal ClusterIP only, DataHub is the sole client.
- Snapshot scaffolding exists (
ume-opensearch-snapshots-poc-ume-databucket,ume-opensearch-snapshotGSA, WI-annotated KSA) but the snapshot repository is NOT registered and the CronJob is NOT wired —repository-gcsdoes not support Workload Identity upstream, so the credential path is a deferred design decision. See backlog.
#
Failure modes and expected RTO/RPO
DataHub's durability guarantee comes from Cloud SQL (entity source of truth) + Kafka MAE topic (recent events). OpenSearch is a rebuildable search/graph cache. Plan for downtime, not data loss — as long as Kafka retention (currently 72h) covers the recovery window. If a restore is expected to take longer than Kafka retention, the MAE topic's older events are gone and the DataHub reindex will need to replay from the source ingestion recipes themselves (Airflow DAGs).
#
Runbook — Pod crash / stuck StatefulSet
If kubectl -n opensearch get pods shows ume-data-dev-opensearch-nodes-0 in CrashLoopBackOff:
- Read the pod logs:
kubectl -n opensearch logs ume-data-dev-opensearch-nodes-0 -c opensearch --tail=80. - If logs show
setting [cluster.initial_master_nodes] is not allowed when [discovery.type] is set to [single-node]or similar config conflicts, the mounted ConfigMap is stale (kubelet sync lag). Delete the pod to force a fresh mount:kubectl -n opensearch delete pod ume-data-dev-opensearch-nodes-0. - If logs show
ClusterManagerNotDiscoveredExceptionand the pod has been up > 2 minutes, escalate to the cluster-wedge runbook below. - If startup probe fails with
Heap sizeorMaxDirectMemorySizeerrors, the JVM heap is larger than the container request. Checkmodules/opensearch-clusterjvm_heapvsdata_memory_request— heap should be ≤ 50% of memory request.
#
Runbook — Cluster wedge (cluster_manager_not_discovered)
Symptom: pod Running 1/1 but /_cluster/health returns 503 with cluster_manager_not_discovered_exception. Operator logs show repeated "Failed to get OpenSearch health status" but nothing progresses.
Cause: the operator's bootstrap pod was killed before registering as cluster-manager (known single-node race), and the operator now considers status.phase = RUNNING, initialized = true — it will not re-run bootstrap.
Recovery (requires one k8s mutation):
Delete the cluster CR so the operator tears everything down:
kubectl -n opensearch delete opensearchcluster.opensearch.org ume-data-dev-opensearch- Wait ~30 s for the operator to reclaim the StatefulSet and PVCs.
- Trigger a Terraform apply on
environments/dev-03-runtime— any merge tomainthat touches the stack (or a one-line no-op change to a referenced module) works. The next apply detects drift and recreates the CR with the correct env overrides. - Watch
kubectl -n opensearch get opensearchcluster.opensearch.org -wuntilHEALTH=green. - If the ISM policy fails to reconcile after the cluster is up (see next runbook), delete and let the next apply recreate it.
#
Runbook — ISM policy status.managedCluster drift
Symptom: kubectl -n opensearch get opensearchismpolicy.opensearch.org ume-retention -o jsonpath='{.status}' shows state: ERROR, reason: cannot change the cluster a resource refers to.
Cause: the cluster CR was destroyed and recreated (e.g. by the wedge runbook above), but the ISM CR's status.managedCluster UID still points at the old cluster. Terraform manages .spec, not .status, so there is no drift to detect until the CR is deleted.
Recovery:
kubectl -n opensearch delete opensearchismpolicy.opensearch.org ume-retention- Trigger a Terraform apply — the next run detects the missing CR and recreates it bound to the current cluster's UID.
- Verify:
kubectl -n opensearch get opensearchismpolicy.opensearch.org ume-retention -o jsonpath='{.status.state}'returnsCREATED.
#
Runbook — PVC loss / full data rebuild
If the PVC is lost (zonal disk failure, accidental deletion, corruption):
- Confirm the PVC is gone or unrecoverable. If it is still bound, snapshot it first via
gcloud compute disks snapshotfor forensics. Delete the StatefulSet and the PVC:
kubectl -n opensearch delete sts ume-data-dev-opensearch-nodes kubectl -n opensearch delete pvc data-ume-data-dev-opensearch-nodes-0 kubectl -n opensearch delete opensearchcluster.opensearch.org ume-data-dev-opensearch- Trigger a Terraform apply. The operator provisions a fresh PVC, bootstraps a fresh cluster, and reconciles the ISM policy.
- Rebuild DataHub's indices. Two paths, in order of preference:
- Restore from snapshot (see below) — fastest, preserves history. Requires the snapshot credential path to be in place.
- Reindex from Cloud SQL + Kafka MAE — DataHub's GMS service exposes a restore-indices job that reads entities from the
datahubCloud SQL database and replays the MAE Kafka topic. Trigger via the DataHub UI (Config → Restore indices) or the MCE CLI. Runtime scales with metadata volume; budget hours for a large catalog.
- If Kafka retention (currently 72 h) is shorter than the recovery window, the older MAE events are gone. Finish by re-running the ingestion DAGs in Airflow so the source systems re-emit their metadata.
#
Runbook — Restore from GCS snapshot
warning
As of Story 10, the snapshot repository is **NOT registered** and the CronJob **is not running**. The bucket, GSA, and WI binding exist as scaffolding. Do not attempt this runbook until the backlog's "OpenSearch snapshot credential path" item lands. It is documented here so the procedure is ready on the day the infra is.
#
Prerequisites (future state)
- Snapshot CronJob has been running and producing snapshots to
gs://ume-opensearch-snapshots-poc-ume-data. - Snapshot repository
gcs_backupis registered in the cluster via_snapshot/gcs_backup(pointing at the bucket, using whichever credential path ends up chosen). kubectlaccess to the target OpenSearch pod.
#
Steps
List available snapshots:
kubectl -n opensearch exec ume-data-dev-opensearch-nodes-0 -c opensearch -- \ curl -s "localhost:9200/_snapshot/gcs_backup/_all"- Pick the target snapshot ID — normally the most recent successful one.
Close any conflicting indices (only needed if they exist and you want to overwrite):
kubectl -n opensearch exec ume-data-dev-opensearch-nodes-0 -c opensearch -- \ curl -XPOST "localhost:9200/<index-name>/_close"Restore:
kubectl -n opensearch exec ume-data-dev-opensearch-nodes-0 -c opensearch -- \ curl -XPOST "localhost:9200/_snapshot/gcs_backup/<snapshot-id>/_restore" \ -H 'Content-Type: application/json' \ -d '{"indices": "*", "ignore_unavailable": true, "include_global_state": false}'include_global_state: falseavoids overwriting cluster-level settings (ISM policies, templates). Those are Terraform-managed.Verify:
kubectl -n opensearch exec ume-data-dev-opensearch-nodes-0 -c opensearch -- \ curl -s "localhost:9200/_cluster/health"Expect
status: green,active_shards_percent_as_number: 100.0. DataHub UI should show metadata after a page refresh.
#
Falling back if the snapshot is corrupt or missing
Go to the PVC loss runbook above — reindex from Cloud SQL + Kafka MAE is always available.
#
SLA dependency — Kafka retention
OpenSearch recovery without a snapshot depends on Kafka having the relevant MAE events. Kafka's log.retention.hours (72 h today) is therefore the de facto RPO floor for reindex-based recovery. If the recovery window exceeds retention, any events that fell off Kafka must be replayed from their source (Airflow ingestion DAGs). Monitor Kafka disk utilisation and consumer lag — either a burst of events or a lag in the GMS consumer can shrink the retention cushion. Bumping log.retention.hours in modules/strimzi-kafka buys more cushion but also more broker PVC pressure.
#
Known issues
- Snapshot credential path deferred — backlog item to choose between key-via-CSI, external-dump CronJob, or upstream WIF support.
#
Secret Rotation
#
Context
Secrets stored in Secret Manager should be rotated periodically. Wave-1 does not automate rotation; this is a manual runbook.
#
Secrets to rotate
#
Steps (example: DataHub OAuth client secret)
- Generate new secret in GCP Console (OAuth consent screen → client → reset secret).
Add new version to Secret Manager:
echo -n "NEW_SECRET_VALUE" | gcloud secrets versions add datahub-oidc-client-secret --data-file=-Restart DataHub Frontend pods to pick up the new secret version:
kubectl rollout restart deployment datahub-frontend -n datahub- Verify: log out and log in to DataHub to confirm OAuth works with the new secret.
- Disable old version in Secret Manager (do not delete immediately; keep for 24h rollback window).
#
WIF Repository Rename / Org Migration
#
Context
The ume-data-infra repo currently lives at github.com/1edata/ume-data-infra. When it moves to a different GitHub org, the WIF provider's attribute condition must be updated or CI will lose GCP access.
#
Steps
- Before the move: document the new org/repo name.
Update Terraform: in
layers/00-bootstrap/, update the WIF provider attribute condition:attribute_condition = "assertion.repository == 'NEW_ORG/ume-data-infra'"Apply from a local machine (since CI is about to break):
cd layers/00-bootstrap terraform apply- Move the repo on GitHub.
- Verify: trigger a GitHub Actions workflow manually to confirm WIF authentication succeeds.
There is a brief window between step 3 and step 4 where the old repo name no longer matches the attribute condition. Plan the repo move immediately after the Terraform apply. Do not leave this window open overnight.
#
GKE Node Pool Upgrade
#
Context
GKE auto-upgrades nodes within the maintenance window, but sometimes a manual upgrade is needed (e.g., CVE patch).
#
Steps
Check current versions:
gcloud container node-pools list --cluster=ume-data-dev-gke --region=us-east1Trigger upgrade (if not auto-upgrading):
gcloud container node-pools update workload \ --cluster=ume-data-dev-gke \ --region=us-east1 \ --node-version=<target-version>- Monitor: watch node replacement via
kubectl get nodes -w. Surge upgrade adds a new node before draining the old one. PDBs protect running workloads. - Verify: all nodes at the target version; no pods in
CrashLoopBackOff; Kafka ISR healthy; OpenSearch green.
#
Emergency: node stuck draining
If a node is stuck draining (pod cannot be evicted due to PDB conflict):
- Identify the stuck pod:
kubectl get pods --field-selector spec.nodeName=<node-name>. - Check if the PDB is blocking:
kubectl get pdb -A. - If safe, temporarily adjust the PDB
minAvailabledown by 1, allow the drain, then restore. - Document why and what was adjusted.
#
Break-Glass Manual Terraform Apply
#
Context
When CI/CD is broken (e.g., WIF misconfigured, GitHub Actions outage), apply from a local machine.
#
Prerequisites
gcloudauthenticated as a user with Owner on the target project.- Terraform installed locally.
- Repo cloned and on the correct branch.
#
Steps
- Navigate to the target stack directory.
terraform init -backend-config=backend.hclterraform plan— review carefully.terraform apply— only if plan matches intent.- Immediately document what was applied and why in a GitHub issue or commit message.
- Fix CI as the first priority after the emergency.
Manual applies bypass code review and CI checks. Use only for genuine emergencies. Every manual apply must be documented.