# Data Catalog

Recommendation:

Deploy DataHub as a self-hosted instance on GCP, secured with Identity-Aware Proxy
Establish quarterly upgrade cadence with documented rollback procedures
Configure dbt manifest ingestion as the first integration priority
Implement DLP scanning with results pushed to DataHub for PII classification
Define and document certification workflow (Experimental > Validated > Certified)
Build onboarding guide for users: how to search, how to interpret lineage, how to request access

The data catalog is the governance backbone of the data platform. It provides a single place where users discover what data exists, understand its meaning, trace its origins, and assess its trustworthiness. More than a technical inventory, the catalog is where the organization builds shared understanding of its data assets.

For UME, where trust in numbers has been identified as a core challenge, the catalog is not optional tooling - it is foundational infrastructure. Every recommendation in the preceding sections - Data Sources, Object Storage, Lake Engine, ETL, Reporting, and Data Science - depends on a functioning catalog to deliver visibility, accountability, and confidence.

# Current Challenges

Before selecting tooling, it helps to articulate the problems we need to solve:

Lack of trust in data - Multiple versions of the same KPI exist across dashboards, leading to conflicting numbers and eroded confidence. Users do not know which tables are reliable.
No visibility into lineage - When numbers look wrong, there is no way to trace back through transformations to the source. When a source changes, there is no way to know what downstream assets are affected.
Scattered documentation - Business definitions, ownership, and usage context live in spreadsheets, wikis, and tribal knowledge. New team members have no starting point.
PII exposure risk - Sensitive data exists across the lake without systematic classification. Compliance with LGPD requires knowing where PII lives.
No quality signals - Users cannot tell whether data has passed tests, how fresh it is, or whether it is certified for production use.

The catalog must address all of these.

# Tooling: DataHub

We recommend DataHub as the data catalog platform, deployed as a self-hosted instance on GCP.

# Why DataHub over GCP-native tools

GCP offers cataloging capabilities through Dataplex and Vertex AI Metadata, but these tools have significant limitations for our use case:

User experience - Native GCP tools are designed for data engineers navigating the cloud console, not for analysts or business users seeking to discover and understand data. They lack the "home-feeling" of a purpose-built catalog where users can search, browse, and explore without GCP expertise.
Discoverability - Dataplex is buried within the GCP console, requiring navigation through multiple layers. A standalone catalog provides a single URL that becomes the starting point for data discovery.
Social features - DataHub offers built-in support for comments, discussions, and certification workflows that GCP tools do not provide at the same depth.
Cross-platform lineage - DataHub ingests lineage from diverse sources (dbt, Airflow, reporting tools, data science platforms) into a unified graph. GCP-native tools are more siloed.

# Self-hosting rationale

Data governance tools in this category - Atlan, Alation, data.world, Collibra - either have opaque enterprise pricing or costs that scale unpredictably. DataHub is open-source with active development and a large community.

Self-hosting provides:

Cost predictability - Infrastructure costs are transparent and controllable
Flexibility - Customization and integration without vendor constraints
Data residency - Metadata stays within UME's GCP environment

Self-hosting requires operational commitment. Plan for:

Regular upgrades (establish a quarterly upgrade cadence)
Monitoring and alerting for the DataHub infrastructure
Backup and disaster recovery for the metadata database

# Deployment security: Identity-Aware Proxy

Securing access to a self-hosted DataHub does not require VPN tunnels. Google Identity-Aware Proxy (IAP) provides a simpler, GCP-native approach:

No VPN client required - Users access DataHub through their browser; IAP authenticates via Google Workspace identity
Context-aware policies - Restrict access based on user groups, device security posture, or IP ranges
Audit trail - Every access attempt is logged with user identity
Zero infrastructure overhead - IAP is a managed service; no VPN concentrators to maintain

This aligns with UME's preference for native GCP tooling while avoiding the complexity and user friction of traditional VPN solutions.

# Core Capabilities

The catalog must deliver four capabilities that address the challenges outlined above.

# Data Discovery

Users should find data without needing to know where it lives:

Search - Full-text search across table names, column names, descriptions, and documentation
Browse - Navigate by domain, data tier (bronze/silver/gold), or business concept
Metadata - Every asset displays owner, description, tags, and usage statistics
Examples - Where appropriate, include sample queries or documentation on how to consume the data

The catalog is the first stop when answering "does this data exist?" or "where do I find data about X?"

# Lineage Visibility

End-to-end lineage is essential for trust and impact analysis:

Sources > Bronze > Silver > Gold > Reports/Models/Predictions

Upstream lineage - Given a dashboard, trace back through dbt models to source systems
Downstream lineage - Given a source table, see all dashboards, models, and features that depend on it
Impact analysis - Before changing a transformation, understand what will be affected
Root cause tracing - When numbers look wrong, follow the lineage to isolate where issues were introduced

Lineage is captured automatically from dbt manifests, orchestration metadata, and reporting tool connectors. Manual enrichment is available for cases where automated capture is incomplete.

# Data Quality and Certification

Quality signals build trust. The catalog surfaces:

Signal	Source	Visibility
Test status	dbt test results	Pass/fail badges on datasets
Freshness	Orchestration metadata	Last updated timestamp, freshness violations
Certification	Manual workflow	Experimental > Validated > Certified
Comments	User contributions	Questions, clarifications, warnings
Documentation	Stewards and owners	Business definitions, usage notes

Certification workflow follows the model from Reporting - KPI Lifecycle:

Experimental - Default state; not for production use
Validated - Passes automated tests, has documentation and owner assigned
Certified - Reviewed by data steward, stakeholder sign-off, approved for business-critical use

Users viewing a dataset see its certification status prominently displayed.

# PII Classification and Governance

The catalog is the central registry for sensitive data:

Automated scanning - Google Cloud DLP scans buckets and BigQuery tables on a schedule, pushing classification results to DataHub
Manual tagging - Data stewards can add or correct classifications
Classification propagation - When a source column is tagged as PII, the catalog can suggest (or auto-apply) tags to downstream columns in lineage
Access policy linkage - Classifications inform who should have access; the catalog documents which policies apply

See Object Storage - PII detection and ETL - PII and Sensitive Data Handling for related guidance on how PII is handled in those layers.

# Platform Integration

The catalog is only valuable if it reflects reality. This requires integration with every layer of the platform.

# ETL Integration (dbt + Airflow)

dbt manifests - Ingest model definitions, column descriptions, tests, and lineage after each dbt run. This is the primary source of truth for transformation metadata.
Airflow/Composer metadata - Capture DAG definitions, run history, and task lineage. Link orchestration metadata to the datasets they produce.
Schema contracts - The catalog is the source of truth for schema definitions (see ETL - Schema enforcement). Changes are versioned and documented.

# Reporting Tool Integration

Looker Studio / Metabase connectors - DataHub provides connectors that extract dashboard metadata, including which tables each report queries
Dashboard ownership - Ownership metadata from Reporting - Ownership and Lineage lives here
Downstream lineage - Reports appear as downstream consumers of Gold-layer tables, completing the lineage picture

# Data Science Integration

Data science artifacts must be traceable:

Artifact	Catalog Entry
Features	Definition, owner, source lineage, SQL/code reference
Models	Version, training data lineage, metrics, lifecycle status
Predictions	Output tables linked to model version and input data

Vertex AI integration - Configure metadata export from Vertex AI to DataHub. Models registered in Vertex should appear in the catalog.
Lineage for complex pipelines - When data scientists build feature engineering pipelines that cannot be expressed in SQL, those pipelines must emit lineage metadata compatible with the catalog (see Data Science - Complex Pipelines).

# Cost Attribution

Link cost data to catalog entries:

BigQuery cost reports - Lake Engine generates cost data per table and query. Surface high-cost tables in the catalog.
Dashboard cost attribution - From Reporting - Cost Attribution, link top-cost dashboards to their catalog entries. Owners see cost alongside other metadata.

# Governance Recommendations

Make the catalog the starting point - Train users to begin data discovery in the catalog, not in BigQuery or reporting tools. Link to it from documentation, onboarding materials, and team channels.
Require ownership - Every dataset in Gold and above must have an owner. Ownership is visible and contact information is accessible.
Surface test results - Integrate dbt test outcomes so users see whether data passed quality gates before consuming it.
Establish certification workflow - Define criteria for each certification level. Make certification status visible in search results and asset pages.
Automate PII scanning - Schedule regular DLP scans; push results to the catalog. Alert on new PII discoveries.
Capture full lineage - Configure ingestion from dbt, orchestration, reporting tools, and data science platforms. Manual enrichment fills gaps.
Document schema evolution - Maintain history of schema changes. Alert downstream consumers when breaking changes occur.
Link cost data - High-cost assets should be flagged. Owners should be aware of the cost profile of their data.
Use IAP for access - Secure DataHub with Identity-Aware Proxy rather than VPN tunnels.