#
Data Catalog
The Data Catalog serves as the central governance hub for the data platform. It provides visibility into all data assets, enabling discovery, trust, and compliance.
#
Purpose
The catalog addresses key organizational challenges:
- Discovery: "What data do we have and where is it?"
- Trust: "Can I rely on this data?"
- Ownership: "Who should I ask about this data?"
- Compliance: "Is this data being handled appropriately?"
A well-maintained catalog acts as a distributed DBA, helping users find existing data instead of creating duplicates.
#
Core Capabilities
#
Data Source Discovery
The catalog indexes all data assets across the platform:
Indexed assets:
- Tables and views
- Columns and data types
- Data dictionaries and descriptions
- Sample data and statistics
Search capabilities:
- Full-text search across metadata
- Filter by domain, owner, or tier
- Semantic search for related concepts
Example: Searching for "RFM" (Recency, Frequency, Monetary) should surface all tables containing customer value metrics, even if column names vary.
#
Schema Validation
Automated schema management:
- Schema registration: Document expected structure
- Validation: Compare actual vs. expected schemas
- Evolution tracking: History of schema changes
- Alerting: Notify on unexpected changes
#
Lineage Tracking
End-to-end visibility of data flow:
Scope:
- All storage tiers (Bronze → Silver → Gold)
- Downstream artifacts (reports, models, dashboards)
- Cross-system dependencies
Visualization:
- Graph-based lineage explorer
- Impact analysis for proposed changes
- Root cause investigation for data issues
Benefits:
- Understand data origins
- Assess change impact
- Debug data quality issues
- Satisfy audit requirements
#
Data Stewardship
Assign clear ownership and accountability:
Ownership model:
- Data Owner: Business accountable for the data
- Data Steward: Day-to-day quality and governance
- Technical Owner: Pipeline and infrastructure
Steward responsibilities:
- Define and document data definitions
- Validate data quality
- Approve access requests
- Respond to data questions
Governance workflows:
- Certification process for Gold data
- Change approval for critical datasets
- Regular review cycles
#
Compliance
Support regulatory and policy requirements:
PII Identification:
- Automated scanning for personal data
- Classification and tagging
- Policy enforcement based on classification
Compliance Reporting:
- Data inventory reports
- Access audit reports
- Retention compliance status
- PII location inventory
#
Building Trust
The catalog is central to building trust in data:
#
Trust Indicators
Communicate data reliability through:
#
Quality Metrics
Surface data quality information:
- Freshness (last update time)
- Completeness (null rates)
- Consistency (cross-reference checks)
- Test results (pass/fail history)
#
Usage Signals
Help users find reliable data:
- Query frequency
- Number of downstream dependencies
- User ratings and feedback
#
Integration with Dataplex
The platform leverages Google Dataplex for catalog capabilities:
#
Current State
- Automatic discovery of BigQuery assets
- LLM-assisted metadata generation (column descriptions)
- Search across all cataloged data
#
Roadmap
- Enhanced lineage visualization
- Integration with ETL for automated lineage
- Custom classification policies
- Access request workflows
#
User Experience
#
Self-Service Discovery
Enable users to find data without asking the data team:
- Search for concepts or keywords
- Browse by domain or use case
- View documentation and examples
- Understand quality and ownership
#
Feedback Mechanisms
Allow users to contribute:
- Comments and questions on datasets
- Quality issue reports
- Suggestions for improvements
- Rating data usefulness
#
Support Integration
Direct channel to data stewards:
- "Ask a question" functionality
- Escalation for unresolved issues
- Request for new data onboarding
#
Best Practices
#
For Data Producers
- Document before publishing: Add descriptions before data goes live
- Assign ownership: Every dataset needs an accountable owner
- Include examples: Show how data should be queried
- Maintain freshness: Keep documentation current
#
For Data Consumers
- Search first: Check the catalog before requesting new data
- Verify certification: Prefer certified datasets for critical work
- Provide feedback: Report issues and suggestions
- Respect policies: Follow access and usage guidelines
#
Metrics
Track catalog health and adoption:
#
Related Sections
- ETL - How lineage is captured
- Object Storage - Data organization and tiers
- Cross-Cutting - Compliance and access controls