#
Data Catalog
Recommendation:
- Deploy DataHub as a self-hosted instance on GCP, secured with Identity-Aware Proxy
- Establish quarterly upgrade cadence with documented rollback procedures
- Configure dbt manifest ingestion as the first integration priority
- Implement DLP scanning with results pushed to DataHub for PII classification
- Define and document certification workflow (Experimental > Validated > Certified)
- Build onboarding guide for users: how to search, how to interpret lineage, how to request access
The data catalog is the governance backbone of the data platform. It provides a single place where users discover what data exists, understand its meaning, trace its origins, and assess its trustworthiness. More than a technical inventory, the catalog is where the organization builds shared understanding of its data assets.
For UME, where trust in numbers has been identified as a core challenge, the catalog is not optional tooling - it is foundational infrastructure. Every recommendation in the preceding sections - Data Sources, Object Storage, Lake Engine, ETL, Reporting, and Data Science - depends on a functioning catalog to deliver visibility, accountability, and confidence.
#
Current Challenges
Before selecting tooling, it helps to articulate the problems we need to solve:
- Lack of trust in data - Multiple versions of the same KPI exist across dashboards, leading to conflicting numbers and eroded confidence. Users do not know which tables are reliable.
- No visibility into lineage - When numbers look wrong, there is no way to trace back through transformations to the source. When a source changes, there is no way to know what downstream assets are affected.
- Scattered documentation - Business definitions, ownership, and usage context live in spreadsheets, wikis, and tribal knowledge. New team members have no starting point.
- PII exposure risk - Sensitive data exists across the lake without systematic classification. Compliance with LGPD requires knowing where PII lives.
- No quality signals - Users cannot tell whether data has passed tests, how fresh it is, or whether it is certified for production use.
The catalog must address all of these.
#
Tooling: DataHub
We recommend DataHub as the data catalog platform, deployed as a self-hosted instance on GCP.
#
Why DataHub over GCP-native tools
GCP offers cataloging capabilities through Dataplex and Vertex AI Metadata, but these tools have significant limitations for our use case:
- User experience - Native GCP tools are designed for data engineers navigating the cloud console, not for analysts or business users seeking to discover and understand data. They lack the "home-feeling" of a purpose-built catalog where users can search, browse, and explore without GCP expertise.
- Discoverability - Dataplex is buried within the GCP console, requiring navigation through multiple layers. A standalone catalog provides a single URL that becomes the starting point for data discovery.
- Social features - DataHub offers built-in support for comments, discussions, and certification workflows that GCP tools do not provide at the same depth.
- Cross-platform lineage - DataHub ingests lineage from diverse sources (dbt, Airflow, reporting tools, data science platforms) into a unified graph. GCP-native tools are more siloed.
#
Self-hosting rationale
Data governance tools in this category - Atlan, Alation, data.world, Collibra - either have opaque enterprise pricing or costs that scale unpredictably. DataHub is open-source with active development and a large community.
Self-hosting provides:
- Cost predictability - Infrastructure costs are transparent and controllable
- Flexibility - Customization and integration without vendor constraints
- Data residency - Metadata stays within UME's GCP environment
Self-hosting requires operational commitment. Plan for:
- Regular upgrades (establish a quarterly upgrade cadence)
- Monitoring and alerting for the DataHub infrastructure
- Backup and disaster recovery for the metadata database
#
Deployment security: Identity-Aware Proxy
Securing access to a self-hosted DataHub does not require VPN tunnels. Google Identity-Aware Proxy (IAP) provides a simpler, GCP-native approach:
- No VPN client required - Users access DataHub through their browser; IAP authenticates via Google Workspace identity
- Context-aware policies - Restrict access based on user groups, device security posture, or IP ranges
- Audit trail - Every access attempt is logged with user identity
- Zero infrastructure overhead - IAP is a managed service; no VPN concentrators to maintain
This aligns with UME's preference for native GCP tooling while avoiding the complexity and user friction of traditional VPN solutions.
#
Core Capabilities
The catalog must deliver four capabilities that address the challenges outlined above.
#
Data Discovery
Users should find data without needing to know where it lives:
- Search - Full-text search across table names, column names, descriptions, and documentation
- Browse - Navigate by domain, data tier (bronze/silver/gold), or business concept
- Metadata - Every asset displays owner, description, tags, and usage statistics
- Examples - Where appropriate, include sample queries or documentation on how to consume the data
The catalog is the first stop when answering "does this data exist?" or "where do I find data about X?"
#
Lineage Visibility
End-to-end lineage is essential for trust and impact analysis:
Sources > Bronze > Silver > Gold > Reports/Models/Predictions
- Upstream lineage - Given a dashboard, trace back through dbt models to source systems
- Downstream lineage - Given a source table, see all dashboards, models, and features that depend on it
- Impact analysis - Before changing a transformation, understand what will be affected
- Root cause tracing - When numbers look wrong, follow the lineage to isolate where issues were introduced
Lineage is captured automatically from dbt manifests, orchestration metadata, and reporting tool connectors. Manual enrichment is available for cases where automated capture is incomplete.
#
Data Quality and Certification
Quality signals build trust. The catalog surfaces:
Certification workflow follows the model from Reporting - KPI Lifecycle:
- Experimental - Default state; not for production use
- Validated - Passes automated tests, has documentation and owner assigned
- Certified - Reviewed by data steward, stakeholder sign-off, approved for business-critical use
Users viewing a dataset see its certification status prominently displayed.
#
PII Classification and Governance
The catalog is the central registry for sensitive data:
- Automated scanning - Google Cloud DLP scans buckets and BigQuery tables on a schedule, pushing classification results to DataHub
- Manual tagging - Data stewards can add or correct classifications
- Classification propagation - When a source column is tagged as PII, the catalog can suggest (or auto-apply) tags to downstream columns in lineage
- Access policy linkage - Classifications inform who should have access; the catalog documents which policies apply
See Object Storage - PII detection and ETL - PII and Sensitive Data Handling for related guidance on how PII is handled in those layers.
#
Platform Integration
The catalog is only valuable if it reflects reality. This requires integration with every layer of the platform.
#
ETL Integration (dbt + Airflow)
- dbt manifests - Ingest model definitions, column descriptions, tests, and lineage after each dbt run. This is the primary source of truth for transformation metadata.
- Airflow/Composer metadata - Capture DAG definitions, run history, and task lineage. Link orchestration metadata to the datasets they produce.
- Schema contracts - The catalog is the source of truth for schema definitions (see ETL - Schema enforcement). Changes are versioned and documented.
#
Reporting Tool Integration
- Looker Studio / Metabase connectors - DataHub provides connectors that extract dashboard metadata, including which tables each report queries
- Dashboard ownership - Ownership metadata from Reporting - Ownership and Lineage lives here
- Downstream lineage - Reports appear as downstream consumers of Gold-layer tables, completing the lineage picture
#
Data Science Integration
Data science artifacts must be traceable:
- Vertex AI integration - Configure metadata export from Vertex AI to DataHub. Models registered in Vertex should appear in the catalog.
- Lineage for complex pipelines - When data scientists build feature engineering pipelines that cannot be expressed in SQL, those pipelines must emit lineage metadata compatible with the catalog (see Data Science - Complex Pipelines).
#
Cost Attribution
Link cost data to catalog entries:
- BigQuery cost reports - Lake Engine generates cost data per table and query. Surface high-cost tables in the catalog.
- Dashboard cost attribution - From Reporting - Cost Attribution, link top-cost dashboards to their catalog entries. Owners see cost alongside other metadata.
#
Governance Recommendations
- Make the catalog the starting point - Train users to begin data discovery in the catalog, not in BigQuery or reporting tools. Link to it from documentation, onboarding materials, and team channels.
- Require ownership - Every dataset in Gold and above must have an owner. Ownership is visible and contact information is accessible.
- Surface test results - Integrate dbt test outcomes so users see whether data passed quality gates before consuming it.
- Establish certification workflow - Define criteria for each certification level. Make certification status visible in search results and asset pages.
- Automate PII scanning - Schedule regular DLP scans; push results to the catalog. Alert on new PII discoveries.
- Capture full lineage - Configure ingestion from dbt, orchestration, reporting tools, and data science platforms. Manual enrichment fills gaps.
- Document schema evolution - Maintain history of schema changes. Alert downstream consumers when breaking changes occur.
- Link cost data - High-cost assets should be flagged. Owners should be aware of the cost profile of their data.
- Use IAP for access - Secure DataHub with Identity-Aware Proxy rather than VPN tunnels.
#
Tasks
- Deploy DataHub on GKE with Identity-Aware Proxy for secure access
- Configure dbt manifest ingestion; validate lineage and test result visibility
- Configure Airflow/Composer metadata ingestion for orchestration lineage
- Evaluate and configure reporting tool connectors (Looker Studio, Metabase)
- Configure Vertex AI metadata export for data science artifacts
- Set up Cloud DLP scheduled scans with results pushed to DataHub
- Define PII classification taxonomy and tagging workflow
- Document certification criteria for Experimental, Validated, and Certified levels
- Build cost attribution integration: surface BigQuery cost data alongside catalog entries
- Create user onboarding guide: discovery, lineage interpretation, access requests
- Establish quarterly upgrade and maintenance schedule for DataHub