# Operations

This section contains runbooks for common operational tasks. Each runbook follows a consistent structure: context, prerequisites, steps, verification, and rollback.

# DataHub Version Upgrade

# Context

DataHub releases frequently and includes database migration scripts. Upgrades must be tested on dev before promoting to prod.

# Prerequisites

  • Current DataHub Helm chart version noted (from terraform.tfvars)
  • Target version's release notes reviewed for breaking changes
  • Dev environment available and healthy

# Steps

  1. Update dev: change datahub_helm_chart_version in environments/dev-02-runtime/terraform.tfvars.
  2. PR and plan: open PR, verify terraform plan shows only the Helm release changing.
  3. Apply to dev: merge PR. CI auto-applies. The Helm upgrade runs DataHub's migration hooks automatically.
  4. Verify on dev:
    • DataHub UI loads and login works.
    • Run a test ingestion recipe (BigQuery or dbt).
    • Check Kafka consumer lag is draining.
    • Check OpenSearch cluster health is green.
    • Verify lineage and search return expected results.
  5. Promote to prod: update environments/prod-02-runtime/terraform.tfvars with the same version. PR → approval → apply.

# Rollback

If the upgrade fails:

  1. Revert the datahub_helm_chart_version in tfvars to the previous value.
  2. Apply. The Helm rollback will restore the previous release.
  3. If the database migration is not backward-compatible (rare), restore Cloud SQL from PITR (see below).

# Cloud SQL Point-in-Time Recovery (PITR)

# Context

Cloud SQL supports PITR, allowing restoration to any second within the backup retention window.

# Prerequisites

  • PITR enabled on the Cloud SQL instance (configured in Terraform).
  • Target recovery timestamp known.

# Steps

  1. Identify recovery point: determine the timestamp just before the corruption/issue.
  2. Create a new instance from PITR:

    gcloud sql instances clone ume-data-dev-datahub-pg ume-data-dev-datahub-pg-restored \
      --point-in-time="2026-04-13T10:00:00Z"
  3. Verify the restored instance: connect and validate data integrity.
  4. Swap traffic: update the Cloud SQL connection name in terraform.tfvars to point to the restored instance. Apply via Terraform.
  5. Clean up: once verified, delete the old instance (or keep for forensics).

# Rollback

Revert terraform.tfvars to the original instance name and apply.


# Kafka Restore (Strimzi)

# Context

Kafka topics are the event bus, not the durable store. If topics are corrupted or lost, metadata is recoverable by re-ingesting from source systems. Cloud SQL holds the durable metadata.

# Prerequisites

  • Strimzi operator running and healthy.
  • DataHub's Cloud SQL database intact.

# Steps

  1. Assess damage: check if the Kafka cluster is recoverable (broker pods running, ZooKeeper/KRaft healthy).
  2. If cluster is recoverable: Strimzi operator will self-heal broker pods. Wait for ISR to stabilize.
  3. If cluster is unrecoverable:
    1. Delete the Kafka CRD resource (Strimzi will clean up).
    2. Re-apply Terraform for {env}-02-k8s-base to recreate the Kafka cluster.
    3. DataHub's kafka-setup init job will recreate topics on the next Helm release cycle.
  4. Re-ingest metadata: trigger DataHub ingestion DAGs in Airflow to repopulate events. Cloud SQL is the source of truth; Kafka receives new change events.

# Verification

  • kafka-topics.sh --list shows expected topics.
  • DataHub consumer lag is draining.
  • DataHub UI shows current metadata.

# OpenSearch — Current State, Failure Modes, and Recovery

# Context

OpenSearch stores DataHub's entity/graph indices and time-series events. Indices are rebuildable from Cloud SQL (entity source of truth) and Kafka MAE replay (recent change events) — so in the worst case OpenSearch downtime degrades the UI but does not lose metadata, as long as Kafka retention outlasts the recovery window (currently 72h per Story 9's log.retention).

# Current shape (dev)

  • Three data nodes (ume-data-dev-opensearch-nodes-0..2) each running cluster_manager + data + ingest roles on premium-rwo (pd-ssd), 5 GiB PVC per node. The operator's native bootstrap flow forms the cluster via its standard bootstrap pod.
  • Security plugin disabled — internal ClusterIP only, DataHub is the sole client.
  • Snapshot scaffolding exists (ume-opensearch-snapshots-poc-ume-data bucket, ume-opensearch-snapshot GSA, WI-annotated KSA) but the snapshot repository is NOT registered and the CronJob is NOT wiredrepository-gcs does not support Workload Identity upstream, so the credential path is a deferred design decision. See backlog.

# Failure modes and expected RTO/RPO

Failure DataHub impact RTO RPO Mitigation / recovery
Pod crash (OOM, SIGKILL) UI stale 30s–2 min 0 StatefulSet heals automatically
Node upgrade / eviction UI stale 1–3 min 0 PVC re-attaches on the new node
Node-local disk corruption Full outage hours minutes (Kafka MAE lag) Delete PVC → wait for new pod → DataHub reindex from Kafka + SQL
Zonal PD loss (rare) Full outage hours minutes Same as above; prod should use regional PD
cluster_manager_not_discovered wedge Full outage minutes (manual CR delete) 0 See runbook below
Stuck ISM / managedCluster drift ISM policy not applied minutes 0 See runbook below
Full cluster loss Full outage hours–day minutes (Kafka MAE lag) Reindex from Kafka + SQL; snapshot restore when available

DataHub's durability guarantee comes from Cloud SQL (entity source of truth) + Kafka MAE topic (recent events). OpenSearch is a rebuildable search/graph cache. Plan for downtime, not data loss — as long as Kafka retention (currently 72h) covers the recovery window. If a restore is expected to take longer than Kafka retention, the MAE topic's older events are gone and the DataHub reindex will need to replay from the source ingestion recipes themselves (Airflow DAGs).

# Runbook — Pod crash / stuck StatefulSet

If kubectl -n opensearch get pods shows ume-data-dev-opensearch-nodes-0 in CrashLoopBackOff:

  1. Read the pod logs: kubectl -n opensearch logs ume-data-dev-opensearch-nodes-0 -c opensearch --tail=80.
  2. If logs show setting [cluster.initial_master_nodes] is not allowed when [discovery.type] is set to [single-node] or similar config conflicts, the mounted ConfigMap is stale (kubelet sync lag). Delete the pod to force a fresh mount: kubectl -n opensearch delete pod ume-data-dev-opensearch-nodes-0.
  3. If logs show ClusterManagerNotDiscoveredException and the pod has been up > 2 minutes, escalate to the cluster-wedge runbook below.
  4. If startup probe fails with Heap size or MaxDirectMemorySize errors, the JVM heap is larger than the container request. Check modules/opensearch-cluster jvm_heap vs data_memory_request — heap should be ≤ 50% of memory request.

# Runbook — Cluster wedge (cluster_manager_not_discovered)

Symptom: pod Running 1/1 but /_cluster/health returns 503 with cluster_manager_not_discovered_exception. Operator logs show repeated "Failed to get OpenSearch health status" but nothing progresses.

Cause: the operator's bootstrap pod was killed before registering as cluster-manager (known single-node race), and the operator now considers status.phase = RUNNING, initialized = true — it will not re-run bootstrap.

Recovery (requires one k8s mutation):

  1. Delete the cluster CR so the operator tears everything down:

    kubectl -n opensearch delete opensearchcluster.opensearch.org ume-data-dev-opensearch
  2. Wait ~30 s for the operator to reclaim the StatefulSet and PVCs.
  3. Trigger a Terraform apply on environments/dev-03-runtime — any merge to main that touches the stack (or a one-line no-op change to a referenced module) works. The next apply detects drift and recreates the CR with the correct env overrides.
  4. Watch kubectl -n opensearch get opensearchcluster.opensearch.org -w until HEALTH=green.
  5. If the ISM policy fails to reconcile after the cluster is up (see next runbook), delete and let the next apply recreate it.

# Runbook — ISM policy status.managedCluster drift

Symptom: kubectl -n opensearch get opensearchismpolicy.opensearch.org ume-retention -o jsonpath='{.status}' shows state: ERROR, reason: cannot change the cluster a resource refers to.

Cause: the cluster CR was destroyed and recreated (e.g. by the wedge runbook above), but the ISM CR's status.managedCluster UID still points at the old cluster. Terraform manages .spec, not .status, so there is no drift to detect until the CR is deleted.

Recovery:

  1. kubectl -n opensearch delete opensearchismpolicy.opensearch.org ume-retention
  2. Trigger a Terraform apply — the next run detects the missing CR and recreates it bound to the current cluster's UID.
  3. Verify: kubectl -n opensearch get opensearchismpolicy.opensearch.org ume-retention -o jsonpath='{.status.state}' returns CREATED.

# Runbook — PVC loss / full data rebuild

If the PVC is lost (zonal disk failure, accidental deletion, corruption):

  1. Confirm the PVC is gone or unrecoverable. If it is still bound, snapshot it first via gcloud compute disks snapshot for forensics.
  2. Delete the StatefulSet and the PVC:

    kubectl -n opensearch delete sts ume-data-dev-opensearch-nodes
    kubectl -n opensearch delete pvc data-ume-data-dev-opensearch-nodes-0
    kubectl -n opensearch delete opensearchcluster.opensearch.org ume-data-dev-opensearch
  3. Trigger a Terraform apply. The operator provisions a fresh PVC, bootstraps a fresh cluster, and reconciles the ISM policy.
  4. Rebuild DataHub's indices. Two paths, in order of preference:
    • Restore from snapshot (see below) — fastest, preserves history. Requires the snapshot credential path to be in place.
    • Reindex from Cloud SQL + Kafka MAE — DataHub's GMS service exposes a restore-indices job that reads entities from the datahub Cloud SQL database and replays the MAE Kafka topic. Trigger via the DataHub UI (Config → Restore indices) or the MCE CLI. Runtime scales with metadata volume; budget hours for a large catalog.
  5. If Kafka retention (currently 72 h) is shorter than the recovery window, the older MAE events are gone. Finish by re-running the ingestion DAGs in Airflow so the source systems re-emit their metadata.

# Runbook — Restore from GCS snapshot


# GKE Node Pool Upgrade

# Context

GKE auto-upgrades nodes within the maintenance window, but sometimes a manual upgrade is needed (e.g., CVE patch).

# Steps

  1. Check current versions:

    gcloud container node-pools list --cluster=ume-data-dev-gke --region=us-east1
  2. Trigger upgrade (if not auto-upgrading):

    gcloud container node-pools update workload \
      --cluster=ume-data-dev-gke \
      --region=us-east1 \
      --node-version=<target-version>
  3. Monitor: watch node replacement via kubectl get nodes -w. Surge upgrade adds a new node before draining the old one. PDBs protect running workloads.
  4. Verify: all nodes at the target version; no pods in CrashLoopBackOff; Kafka ISR healthy; OpenSearch green.

# Emergency: node stuck draining

If a node is stuck draining (pod cannot be evicted due to PDB conflict):

  1. Identify the stuck pod: kubectl get pods --field-selector spec.nodeName=<node-name>.
  2. Check if the PDB is blocking: kubectl get pdb -A.
  3. If safe, temporarily adjust the PDB minAvailable down by 1, allow the drain, then restore.
  4. Document why and what was adjusted.

# Break-Glass Manual Terraform Apply

# Context

When CI/CD is broken (e.g., WIF misconfigured, GitHub Actions outage), apply from a local machine.

# Prerequisites

  • gcloud authenticated as a user with Owner on the target project.
  • Terraform installed locally.
  • Repo cloned and on the correct branch.

# Steps

  1. Navigate to the target stack directory.
  2. terraform init -backend-config=backend.hcl
  3. terraform plan — review carefully.
  4. terraform apply — only if plan matches intent.
  5. Immediately document what was applied and why in a GitHub issue or commit message.
  6. Fix CI as the first priority after the emergency.