# DataHub

DataHub is the data catalog and governance platform for UME. It is deployed on GKE via Helm, backed by Cloud SQL (PostgreSQL), Strimzi-managed Kafka, and self-hosted OpenSearch. This section covers the deployment architecture, component configuration, identity, and known risks.

For the business rationale, feature set, and governance workflows, see Data Catalog.

# Component Architecture

┌─────────────────────────────────────────────────────────────────────┐
│  GKE Cluster                                                        │
│                                                                     │
│  ┌──────────────────┐   ┌──────────────────┐   ┌────────────────┐  │
│  │  DataHub Frontend │   │   DataHub GMS    │   │  MAE Consumer  │  │
│  │  (React UI)       │   │  (Metadata Svc)  │   │                │  │
│  └────────┬─────────┘   └──┬────┬────┬─────┘   └───────┬────────┘  │
│           │                 │    │    │                  │           │
│           │     ┌───────────┘    │    └────────┐        │           │
│           │     │                │             │        │           │
│  ┌────────▼─────▼──┐  ┌────────▼────────┐  ┌─▼────────▼────────┐  │
│  │   Kafka (Strimzi)│  │   OpenSearch    │  │  MCE Consumer     │  │
│  │   3 brokers      │  │   3 data nodes  │  │                   │  │
│  └──────────────────┘  └────────────────┘  └───────────────────┘  │
│                                                                     │
└──────────────────────────────┬──────────────────────────────────────┘
                               │ Private Service Access
                    ┌──────────▼──────────┐
                    │  Cloud SQL Postgres  │
                    │  (DataHub metadata)  │
                    └─────────────────────┘

# DataHub services

Service Role Scaling
GMS (Generalized Metadata Service) Core API. Reads/writes metadata to SQL, publishes events to Kafka, queries OpenSearch Dev: 1 replica. Prod: HPA 2-4 replicas
Frontend React web UI. Serves search, browse, lineage visualization Dev: 1 replica. Prod: HPA 2-3 replicas
MAE Consumer Reads Metadata Audit Events from Kafka, indexes into OpenSearch Dev: 1 replica. Prod: 2 replicas
MCE Consumer Reads Metadata Change Events from Kafka, writes to SQL Dev: 1 replica. Prod: 2 replicas

# Cloud SQL (PostgreSQL)

DataHub's metadata store. Holds entities, relationships, aspects, and system metadata.

# Configuration

Setting Dev Prod
Instance tier db-g1-small db-custom-2-7680
HA Single zone Regional (automatic failover)
Storage 10 GB SSD 50 GB SSD, auto-increase
Backups Daily, 7-day retention Daily, 30-day retention + PITR
Network Private IP via PSA Private IP via PSA
Auth IAM authentication for service accounts Same
Admin password In Secret Manager (break-glass only) Same

# IAM authentication

DataHub GMS authenticates to Cloud SQL using the datahub-sa Google service account via IAM database authentication. No passwords for programmatic access:

datahub-gms pod (k8s SA: datahub-gms)
    → Workload Identity → Google SA: datahub-sa
    → IAM database auth → Cloud SQL Postgres

The datahub-sa is granted roles/cloudsql.client and roles/cloudsql.instanceUser plus the IAM database user is created with GRANT ALL on the DataHub database.

# Kafka (Strimzi)

Kafka serves as DataHub's event bus for metadata change propagation.

# Why self-hosted

GCP Managed Service for Apache Kafka pricing is prohibitive for the workload size. The smallest cluster costs several hundred dollars per month. Strimzi on GKE gives us full Kafka with operator-managed lifecycle at node cost only.

Documented as an upgrade path: when Managed Kafka pricing improves, migration from Strimzi to managed is a config-level change (update broker endpoints in DataHub Helm values, decommission Strimzi resources).

# Strimzi deployment

Setting Dev Prod
Brokers 3 5-7
Replicas (replication factor) 2 3
Storage per broker 20 GB PD-SSD 100 GB PD-SSD
Strimzi operator version Latest stable Same
Cruise Control Enabled Enabled

Strimzi is deployed in two Terraform resources within dev-02-k8s-base/kafka.tf:

  1. Strimzi operator - Helm release of the Strimzi operator chart.
  2. Kafka cluster - Kafka CRD applied via the kubernetes_manifest resource, referencing the operator.

# Topics

DataHub requires several Kafka topics. The DataHub Helm chart creates them automatically via its kafka-setup init job. Topic names follow DataHub's defaults:

  • MetadataChangeLog_Versioned_v1
  • MetadataChangeLog_Timeseries_v1
  • MetadataAuditEvent_v1
  • PlatformEvent_v1
  • And several others (DataHub manages these; we do not create them manually).

# Autoscaling

Strimzi does not natively auto-scale brokers. However:

  • Kafka broker scaling: Strimzi supports declarative scaling (change replicas in the Kafka CRD and the operator handles rolling addition/removal). This is a manual operation triggered by monitoring.
  • Node-level scaling: the GKE Cluster Autoscaler adds nodes when Kafka pods are pending. This handles the compute side.
  • Partition rebalancing: Cruise Control rebalances partitions after broker additions, distributing load evenly.

Alerts to watch: broker CPU > 80%, consumer lag > threshold, broker count at minimum. See Observability.

# OpenSearch

OpenSearch provides full-text search and graph query capabilities for DataHub's discovery UI.

# Deployment

Setting Dev Prod
Data nodes 3 5-7
Storage per node 20 GB PD-SSD 50 GB PD-SSD
JVM heap 512 MB 2 GB
Master-eligible All data nodes (dev) Dedicated master nodes (prod)

OpenSearch is deployed via the OpenSearch operator Helm chart in dev-02-k8s-base/opensearch.tf. The operator manages rolling upgrades, shard rebalancing, and node replacement.

# Backups

OpenSearch snapshots are stored in GCS:

  • Schedule: daily snapshot via Kubernetes CronJob that calls the OpenSearch Snapshot API.
  • Bucket: ume-opensearch-snapshots-{env} (created in {env}-01-base).
  • Retention: 7 days dev, 30 days prod.
  • Restore procedure: documented in Operations.

# Autoscaling

OpenSearch does not natively auto-scale. Scaling is handled by:

  • Horizontal: add data nodes by updating the operator CRD replica count. The operator handles rolling addition and shard redistribution.
  • Vertical: increase JVM heap and node resources in the CRD.
  • Node-level: GKE Cluster Autoscaler adds nodes when OpenSearch pods are pending.

Alerts to watch: JVM heap > 75%, unassigned shards > 0, disk usage > 80%.

# Identity and OAuth

# Google OIDC configuration

DataHub Frontend authenticates users via Google OIDC (OpenID Connect):

  1. An OAuth client is created in the GCP console (or via Terraform google_iap_client if IAP is used, or google_project_service_identity + manual OAuth consent screen setup).
  2. Client ID and client secret are stored in Secret Manager.
  3. Mounted into the DataHub Frontend pod via the Secret Manager CSI driver.
  4. DataHub's application.yml is configured via Helm values:
datahub-frontend:
  extraEnvs:
    - name: AUTH_OIDC_ENABLED
      value: "true"
    - name: AUTH_OIDC_CLIENT_ID
      valueFrom:
        secretKeyRef:
          name: datahub-oidc-secret
          key: client_id
    - name: AUTH_OIDC_CLIENT_SECRET
      valueFrom:
        secretKeyRef:
          name: datahub-oidc-secret
          key: client_secret
    - name: AUTH_OIDC_DISCOVERY_URI
      value: "https://accounts.google.com/.well-known/openid-configuration"
    - name: AUTH_OIDC_BASE_URL
      value: "https://datahub.data.ume.com.br"

# Access restriction

  • Org-domain restriction: the OAuth consent screen is configured as "Internal" (GCP Workspace), which restricts login to users in the UME organization domain. This applies to both dev and prod.
  • DataHub groups: within DataHub, groups are created to map access levels:
Group DataHub role Members
datahub-admins Admin Platform team
datahub-editors Editor Data engineers, analytics engineers
datahub-viewers Viewer All other org members

Group membership is managed within DataHub's UI. Integration with Google Workspace groups is a future enhancement.

# Helm Values (Dev vs Prod)

The modules/datahub-helm/ module templates Helm values based on environment variables. Key differences:

Value path Dev Prod
datahub-gms.replicaCount 1 2
datahub-frontend.replicaCount 1 2
datahub-mae-consumer.replicaCount 1 2
datahub-mce-consumer.replicaCount 1 2
datahub-gms.resources.requests.memory 1Gi 4Gi
datahub-frontend.resources.requests.memory 512Mi 2Gi
global.sql.datasource.host (from remote state) (from remote state)
global.kafka.bootstrap.server (from remote state) (from remote state)
global.elasticsearch.host (from remote state) (from remote state)

All backing-service endpoints are read from terraform_remote_state and passed into the Helm values dynamically.

# Known Risks and Mitigations

# The "metadata spike" risk

Analysts running large metadata crawls (e.g., full BigQuery ingestion) cause CPU/RAM spikes on GMS and Kafka backlogs.

Mitigations:

  • GKE Cluster Autoscaler adds workload nodes when pods need more resources.
  • Kafka consumer lag alerts fire before the backlog becomes critical.
  • Document: run large crawls during off-hours.
  • DataHub supports staged ingestion (process N tables per run); configure crawls accordingly.

# Data persistence and corruption

If OpenSearch indices or Kafka topics are corrupted during a node upgrade, metadata discovery breaks.

Mitigations:

  • OpenSearch: daily GCS snapshots. Restore procedure in Operations.
  • Kafka: topics are recoverable by re-ingesting from DataHub's metadata sources (Cloud SQL is the durable store; Kafka is the event bus). Re-ingest procedure in Operations.
  • Cloud SQL: automated backups + PITR.
  • PDBs prevent multiple replicas from being evicted simultaneously during upgrades.

# Version drift

DataHub releases frequently. Upgrading versions often involves metadata migrations that can take hours or fail.

Mitigations:

  • Never upgrade prod directly. Always upgrade dev first.
  • Run DataHub's built-in migration preflight script (datahub docker quickstart has a migration check; the Helm chart's upgrade hooks run migrations automatically, but test on dev first).
  • Pin the DataHub Helm chart version in terraform.tfvars. Version bumps are explicit PRs.
  • Upgrade procedure documented in Operations.

# The "connector trap"

Users will request fixes to Looker, dbt, or BigQuery ingestion connectors. This is data engineering work, not infrastructure.

Mitigations:

  • Clear ownership boundary: infrastructure team owns the DataHub platform (deployment, scaling, upgrades). Data engineering team owns ingestion recipes (connector config, scheduling, troubleshooting).
  • Ingestion recipes live in the DAGs repo, not in ume-data-infra.
  • The datahub-platform agent is scoped to Helm values and platform config; it does not touch ingestion recipes.

# Ingestion Recipes (Reference)

Ingestion recipes run as Airflow DAGs (in the DAGs repo, not in ume-data-infra). Wave-1 delivers three recipes:

Recipe Source Frequency
BigQuery metadata BigQuery datasets + tables + column descriptions Daily
Airflow DAGs Airflow REST API Daily
dbt manifests dbt manifest.json + run_results.json After each dbt run

Each recipe is an Airflow DAG that invokes datahub ingest (DataHub's CLI) with a YAML recipe file. The DataHub CLI is included in the custom Airflow image.