# Operations

This section contains runbooks for common operational tasks. Each runbook follows a consistent structure: context, prerequisites, steps, verification, and rollback.

All runbooks assume you have gcloud CLI configured, kubectl access to the GKE cluster, and Terraform locally installed (for break-glass scenarios). For routine operations, changes should flow through CI/CD.

# DataHub Version Upgrade

# Context

DataHub releases frequently and includes database migration scripts. Upgrades must be tested on dev before promoting to prod.

# Prerequisites

Current DataHub Helm chart version noted (from terraform.tfvars)
Target version's release notes reviewed for breaking changes
Dev environment available and healthy

# Steps

Update dev: change datahub_helm_chart_version in environments/dev-02-runtime/terraform.tfvars.
PR and plan: open PR, verify terraform plan shows only the Helm release changing.
Apply to dev: merge PR. CI auto-applies. The Helm upgrade runs DataHub's migration hooks automatically.
Verify on dev:
- DataHub UI loads and login works.
- Run a test ingestion recipe (BigQuery or dbt).
- Check Kafka consumer lag is draining.
- Check OpenSearch cluster health is green.
- Verify lineage and search return expected results.
Promote to prod: update environments/prod-02-runtime/terraform.tfvars with the same version. PR → approval → apply.

# Rollback

If the upgrade fails:

Revert the datahub_helm_chart_version in tfvars to the previous value.
Apply. The Helm rollback will restore the previous release.
If the database migration is not backward-compatible (rare), restore Cloud SQL from PITR (see below).

# Cloud SQL Point-in-Time Recovery (PITR)

# Context

Cloud SQL supports PITR, allowing restoration to any second within the backup retention window.

# Prerequisites

PITR enabled on the Cloud SQL instance (configured in Terraform).
Target recovery timestamp known.

# Steps

Identify recovery point: determine the timestamp just before the corruption/issue.

Create a new instance from PITR:

gcloud sql instances clone ume-data-dev-datahub-pg ume-data-dev-datahub-pg-restored \
  --point-in-time="2026-04-13T10:00:00Z"

Verify the restored instance: connect and validate data integrity.
Swap traffic: update the Cloud SQL connection name in terraform.tfvars to point to the restored instance. Apply via Terraform.
Clean up: once verified, delete the old instance (or keep for forensics).

# Rollback

Revert terraform.tfvars to the original instance name and apply.

# Kafka Restore (Strimzi)

# Context

Kafka topics are the event bus, not the durable store. If topics are corrupted or lost, metadata is recoverable by re-ingesting from source systems. Cloud SQL holds the durable metadata.

# Prerequisites

Strimzi operator running and healthy.
DataHub's Cloud SQL database intact.

# Steps

Assess damage: check if the Kafka cluster is recoverable (broker pods running, ZooKeeper/KRaft healthy).
If cluster is recoverable: Strimzi operator will self-heal broker pods. Wait for ISR to stabilize.
If cluster is unrecoverable:
1. Delete the Kafka CRD resource (Strimzi will clean up).
2. Re-apply Terraform for {env}-02-k8s-base to recreate the Kafka cluster.
3. DataHub's kafka-setup init job will recreate topics on the next Helm release cycle.
Re-ingest metadata: trigger DataHub ingestion DAGs in Airflow to repopulate events. Cloud SQL is the source of truth; Kafka receives new change events.

# Verification

kafka-topics.sh --list shows expected topics.
DataHub consumer lag is draining.
DataHub UI shows current metadata.

# OpenSearch — Current State, Failure Modes, and Recovery

# Context

OpenSearch stores DataHub's entity/graph indices and time-series events. Indices are rebuildable from Cloud SQL (entity source of truth) and Kafka MAE replay (recent change events) — so in the worst case OpenSearch downtime degrades the UI but does not lose metadata, as long as Kafka retention outlasts the recovery window (currently 72h per Story 9's log.retention).

# Current shape (dev)

Three data nodes (ume-data-dev-opensearch-nodes-0..2) each running cluster_manager + data + ingest roles on premium-rwo (pd-ssd), 5 GiB PVC per node. The operator's native bootstrap flow forms the cluster via its standard bootstrap pod.
Security plugin disabled — internal ClusterIP only, DataHub is the sole client.
Snapshot scaffolding exists (ume-opensearch-snapshots-poc-ume-data bucket, ume-opensearch-snapshot GSA, WI-annotated KSA) but the snapshot repository is NOT registered and the CronJob is NOT wired — repository-gcs does not support Workload Identity upstream, so the credential path is a deferred design decision. See backlog.

# Failure modes and expected RTO/RPO

Failure	DataHub impact	RTO	RPO	Mitigation / recovery
Pod crash (OOM, SIGKILL)	UI stale	30s–2 min	0	StatefulSet heals automatically
Node upgrade / eviction	UI stale	1–3 min	0	PVC re-attaches on the new node
Node-local disk corruption	Full outage	hours	minutes (Kafka MAE lag)	Delete PVC → wait for new pod → DataHub reindex from Kafka + SQL
Zonal PD loss (rare)	Full outage	hours	minutes	Same as above; prod should use regional PD
`cluster_manager_not_discovered` wedge	Full outage	minutes (manual CR delete)	0	See runbook below
Stuck ISM / `managedCluster` drift	ISM policy not applied	minutes	0	See runbook below
Full cluster loss	Full outage	hours–day	minutes (Kafka MAE lag)	Reindex from Kafka + SQL; snapshot restore when available

DataHub's durability guarantee comes from Cloud SQL (entity source of truth) + Kafka MAE topic (recent events). OpenSearch is a rebuildable search/graph cache. Plan for downtime, not data loss — as long as Kafka retention (currently 72h) covers the recovery window. If a restore is expected to take longer than Kafka retention, the MAE topic's older events are gone and the DataHub reindex will need to replay from the source ingestion recipes themselves (Airflow DAGs).

# Runbook — Pod crash / stuck StatefulSet

If kubectl -n opensearch get pods shows ume-data-dev-opensearch-nodes-0 in CrashLoopBackOff:

Read the pod logs: kubectl -n opensearch logs ume-data-dev-opensearch-nodes-0 -c opensearch --tail=80.
If logs show setting [cluster.initial_master_nodes] is not allowed when [discovery.type] is set to [single-node] or similar config conflicts, the mounted ConfigMap is stale (kubelet sync lag). Delete the pod to force a fresh mount: kubectl -n opensearch delete pod ume-data-dev-opensearch-nodes-0.
If logs show ClusterManagerNotDiscoveredException and the pod has been up > 2 minutes, escalate to the cluster-wedge runbook below.
If startup probe fails with Heap size or MaxDirectMemorySize errors, the JVM heap is larger than the container request. Check modules/opensearch-cluster jvm_heap vs data_memory_request — heap should be ≤ 50% of memory request.

# Runbook — Cluster wedge (`cluster_manager_not_discovered`)

Symptom: pod Running 1/1 but /_cluster/health returns 503 with cluster_manager_not_discovered_exception. Operator logs show repeated "Failed to get OpenSearch health status" but nothing progresses.

Cause: the operator's bootstrap pod was killed before registering as cluster-manager (known single-node race), and the operator now considers status.phase = RUNNING, initialized = true — it will not re-run bootstrap.

Recovery (requires one k8s mutation):

Delete the cluster CR so the operator tears everything down:

kubectl -n opensearch delete opensearchcluster.opensearch.org ume-data-dev-opensearch

Wait ~30 s for the operator to reclaim the StatefulSet and PVCs.
Trigger a Terraform apply on environments/dev-03-runtime — any merge to main that touches the stack (or a one-line no-op change to a referenced module) works. The next apply detects drift and recreates the CR with the correct env overrides.
Watch kubectl -n opensearch get opensearchcluster.opensearch.org -w until HEALTH=green.
If the ISM policy fails to reconcile after the cluster is up (see next runbook), delete and let the next apply recreate it.

# Runbook — ISM policy `status.managedCluster` drift

Symptom: kubectl -n opensearch get opensearchismpolicy.opensearch.org ume-retention -o jsonpath='{.status}' shows state: ERROR, reason: cannot change the cluster a resource refers to.

Cause: the cluster CR was destroyed and recreated (e.g. by the wedge runbook above), but the ISM CR's status.managedCluster UID still points at the old cluster. Terraform manages .spec, not .status, so there is no drift to detect until the CR is deleted.

Recovery:

kubectl -n opensearch delete opensearchismpolicy.opensearch.org ume-retention
Trigger a Terraform apply — the next run detects the missing CR and recreates it bound to the current cluster's UID.
Verify: kubectl -n opensearch get opensearchismpolicy.opensearch.org ume-retention -o jsonpath='{.status.state}' returns CREATED.

# Runbook — PVC loss / full data rebuild

If the PVC is lost (zonal disk failure, accidental deletion, corruption):

Confirm the PVC is gone or unrecoverable. If it is still bound, snapshot it first via gcloud compute disks snapshot for forensics.

Delete the StatefulSet and the PVC:

kubectl -n opensearch delete sts ume-data-dev-opensearch-nodes
kubectl -n opensearch delete pvc data-ume-data-dev-opensearch-nodes-0
kubectl -n opensearch delete opensearchcluster.opensearch.org ume-data-dev-opensearch

Trigger a Terraform apply. The operator provisions a fresh PVC, bootstraps a fresh cluster, and reconciles the ISM policy.
Rebuild DataHub's indices. Two paths, in order of preference:
- Restore from snapshot (see below) — fastest, preserves history. Requires the snapshot credential path to be in place.
- Reindex from Cloud SQL + Kafka MAE — DataHub's GMS service exposes a restore-indices job that reads entities from the datahub Cloud SQL database and replays the MAE Kafka topic. Trigger via the DataHub UI (Config → Restore indices) or the MCE CLI. Runtime scales with metadata volume; budget hours for a large catalog.
If Kafka retention (currently 72 h) is shorter than the recovery window, the older MAE events are gone. Finish by re-running the ingestion DAGs in Airflow so the source systems re-emit their metadata.

# Runbook — Restore from GCS snapshot

warning

As of Story 10, the snapshot repository is **NOT registered** and the CronJob **is not running**. The bucket, GSA, and WI binding exist as scaffolding. Do not attempt this runbook until the backlog's "OpenSearch snapshot credential path" item lands. It is documented here so the procedure is ready on the day the infra is.

# Prerequisites (future state)

Snapshot CronJob has been running and producing snapshots to gs://ume-opensearch-snapshots-poc-ume-data.
Snapshot repository gcs_backup is registered in the cluster via _snapshot/gcs_backup (pointing at the bucket, using whichever credential path ends up chosen).
kubectl access to the target OpenSearch pod.

# Steps

List available snapshots:

kubectl -n opensearch exec ume-data-dev-opensearch-nodes-0 -c opensearch -- \
  curl -s "localhost:9200/_snapshot/gcs_backup/_all"

Pick the target snapshot ID — normally the most recent successful one.

Close any conflicting indices (only needed if they exist and you want to overwrite):

kubectl -n opensearch exec ume-data-dev-opensearch-nodes-0 -c opensearch -- \
  curl -XPOST "localhost:9200/<index-name>/_close"

Restore:

kubectl -n opensearch exec ume-data-dev-opensearch-nodes-0 -c opensearch -- \
  curl -XPOST "localhost:9200/_snapshot/gcs_backup/<snapshot-id>/_restore" \
    -H 'Content-Type: application/json' \
    -d '{"indices": "*", "ignore_unavailable": true, "include_global_state": false}'

include_global_state: false avoids overwriting cluster-level settings (ISM policies, templates). Those are Terraform-managed.

Verify:

kubectl -n opensearch exec ume-data-dev-opensearch-nodes-0 -c opensearch -- \
  curl -s "localhost:9200/_cluster/health"

Expect status: green, active_shards_percent_as_number: 100.0. DataHub UI should show metadata after a page refresh.

# Falling back if the snapshot is corrupt or missing

Go to the PVC loss runbook above — reindex from Cloud SQL + Kafka MAE is always available.

# SLA dependency — Kafka retention

OpenSearch recovery without a snapshot depends on Kafka having the relevant MAE events. Kafka's log.retention.hours (72 h today) is therefore the de facto RPO floor for reindex-based recovery. If the recovery window exceeds retention, any events that fell off Kafka must be replayed from their source (Airflow ingestion DAGs). Monitor Kafka disk utilisation and consumer lag — either a burst of events or a lag in the GMS consumer can shrink the retention cushion. Bumping log.retention.hours in modules/strimzi-kafka buys more cushion but also more broker PVC pressure.

# Known issues

Snapshot credential path deferred — backlog item to choose between key-via-CSI, external-dump CronJob, or upstream WIF support.

# Secret Rotation

# Context

Secrets stored in Secret Manager should be rotated periodically. Wave-1 does not automate rotation; this is a manual runbook.

# Secrets to rotate

Secret	Rotation trigger	Affected services
DataHub OAuth client secret	Quarterly or on compromise	DataHub Frontend
Cloud SQL admin password	Quarterly or on compromise	Break-glass access only
WIF provider (not a secret, but sensitive config)	On repo rename/org move	CI/CD

# Steps (example: DataHub OAuth client secret)

Generate new secret in GCP Console (OAuth consent screen → client → reset secret).

Add new version to Secret Manager:

echo -n "NEW_SECRET_VALUE" | gcloud secrets versions add datahub-oidc-client-secret --data-file=-

Restart DataHub Frontend pods to pick up the new secret version:
```
kubectl rollout restart deployment datahub-frontend -n datahub
```
Verify: log out and log in to DataHub to confirm OAuth works with the new secret.
Disable old version in Secret Manager (do not delete immediately; keep for 24h rollback window).

# WIF Repository Rename / Org Migration

# Context

The ume-data-infra repo currently lives at github.com/1edata/ume-data-infra. When it moves to a different GitHub org, the WIF provider's attribute condition must be updated or CI will lose GCP access.

# Steps

Before the move: document the new org/repo name.
Update Terraform: in layers/00-bootstrap/, update the WIF provider attribute condition:
```
attribute_condition = "assertion.repository == 'NEW_ORG/ume-data-infra'"
```
Apply from a local machine (since CI is about to break):
```
cd layers/00-bootstrap
terraform apply
```
Move the repo on GitHub.
Verify: trigger a GitHub Actions workflow manually to confirm WIF authentication succeeds.

There is a brief window between step 3 and step 4 where the old repo name no longer matches the attribute condition. Plan the repo move immediately after the Terraform apply. Do not leave this window open overnight.

# GKE Node Pool Upgrade

# Context

GKE auto-upgrades nodes within the maintenance window, but sometimes a manual upgrade is needed (e.g., CVE patch).

# Steps

Check current versions:

gcloud container node-pools list --cluster=ume-data-dev-gke --region=us-east1

Trigger upgrade (if not auto-upgrading):

gcloud container node-pools update workload \
  --cluster=ume-data-dev-gke \
  --region=us-east1 \
  --node-version=<target-version>

Monitor: watch node replacement via kubectl get nodes -w. Surge upgrade adds a new node before draining the old one. PDBs protect running workloads.
Verify: all nodes at the target version; no pods in CrashLoopBackOff; Kafka ISR healthy; OpenSearch green.

# Emergency: node stuck draining

If a node is stuck draining (pod cannot be evicted due to PDB conflict):

Identify the stuck pod: kubectl get pods --field-selector spec.nodeName=<node-name>.
Check if the PDB is blocking: kubectl get pdb -A.
If safe, temporarily adjust the PDB minAvailable down by 1, allow the drain, then restore.
Document why and what was adjusted.

# Break-Glass Manual Terraform Apply

# Context

When CI/CD is broken (e.g., WIF misconfigured, GitHub Actions outage), apply from a local machine.

# Prerequisites

gcloud authenticated as a user with Owner on the target project.
Terraform installed locally.
Repo cloned and on the correct branch.

# Steps

Navigate to the target stack directory.
terraform init -backend-config=backend.hcl
terraform plan — review carefully.
terraform apply — only if plan matches intent.
Immediately document what was applied and why in a GitHub issue or commit message.
Fix CI as the first priority after the emergency.

Manual applies bypass code review and CI checks. Use only for genuine emergencies. Every manual apply must be documented.

# Operations

# DataHub Version Upgrade

# Context

# Prerequisites

# Steps

# Rollback

# Cloud SQL Point-in-Time Recovery (PITR)

# Context

# Prerequisites

# Steps

# Rollback

# Kafka Restore (Strimzi)

# Context

# Prerequisites

# Steps

# Verification

# OpenSearch — Current State, Failure Modes, and Recovery

# Context

# Current shape (dev)

# Failure modes and expected RTO/RPO

# Runbook — Pod crash / stuck StatefulSet

# Runbook — Cluster wedge (cluster_manager_not_discovered)

# Runbook — ISM policy status.managedCluster drift

# Runbook — PVC loss / full data rebuild

# Runbook — Restore from GCS snapshot

warning

# Prerequisites (future state)

# Steps

# Falling back if the snapshot is corrupt or missing

# SLA dependency — Kafka retention

# Known issues

# Secret Rotation

# Context

# Secrets to rotate

# Steps (example: DataHub OAuth client secret)

# WIF Repository Rename / Org Migration

# Context

# Steps

# GKE Node Pool Upgrade

# Context

# Steps

# Emergency: node stuck draining

# Break-Glass Manual Terraform Apply

# Context

# Prerequisites

# Steps

# Runbook — Cluster wedge (`cluster_manager_not_discovered`)

# Runbook — ISM policy `status.managedCluster` drift