#
Object Storage
Google Cloud Storage (GCS) serves as the foundation storage layer for the data platform. All data flows through GCS before being made available for querying and analysis.
#
Medallion Architecture
Data is organized into three tiers based on its maturity and quality level:
#
Bronze Layer
Purpose: Raw data preservation
The Bronze layer stores data exactly as received from source systems, with minimal or no transformation. This ensures:
- Auditability: Original data is always available for reference
- Reprocessing: Ability to rebuild downstream layers if logic changes
- Debugging: Investigate issues by comparing raw vs. transformed data
Characteristics:
- Schema-on-read approach
- Partitioned by ingestion time
- Retained according to compliance requirements
#
Silver Layer
Purpose: Cleaned and validated data
The Silver layer contains data that has been:
- Cleaned: Null handling, deduplication, type corrections
- Validated: Passed quality checks and assertions
- Conformed: Standardized naming, consistent formats
Characteristics:
- Defined schemas with documentation
- Business key deduplication
- Cross-source entity resolution where applicable
#
Gold Layer
Purpose: Business-ready, curated datasets
The Gold layer provides consumption-ready data:
- Aggregated: Pre-computed metrics and summaries
- Curated: Business logic applied, ready for reporting
- Governed: Assigned data stewards, validated for accuracy
Characteristics:
- Optimized for query performance
- Documented business definitions
- Subject to change management processes
#
Project Organization
Data is separated into GCP projects based on domain, team, or access requirements:
#
Benefits of Project Separation
- Cost allocation: Clear billing attribution per team or domain
- Access boundaries: IAM policies scoped to project level
- Blast radius: Issues in one project don't affect others
- Quota management: Resource limits per project
#
Project Structure Guidelines
- Central data project: Core data assets shared across the organization
- Domain projects: Team-specific datasets and transformations
- Sandbox projects: Experimentation with controlled access
- Production vs. non-production: Separate environments for safety
#
Cross-Project Access
When data needs to be shared across projects:
- Use dataset-level permissions rather than project-level
- Document cross-project dependencies in the data catalog
- Monitor cross-project query costs
#
Storage Security
#
Managed KMS Keys
Customer-managed encryption keys (CMEK) provide:
- Control: Organization owns the encryption keys
- Audit: Key usage is logged and monitored
- Compliance: Meet regulatory requirements for data protection
Implementation:
- Work with Infrastructure team on key lifecycle management
- Rotate keys according to security policy
- Document key assignments per bucket
#
Access Policies
Fine-grained IAM controls ensure least-privilege access:
- Bucket-level: Who can read/write to storage
- Object-level: Granular controls when needed
- Conditional access: Time-based, IP-based restrictions
Best Practices:
- Avoid broad project-level permissions
- Use groups rather than individual users
- Regular access reviews
#
Audit Trail
Comprehensive logging for compliance and security:
- Access logs: Who accessed what and when
- Admin logs: Configuration and permission changes
- Retention: Logs retained according to company policy
#
Lifecycle Policies
#
Data Retention
Define retention periods based on:
- Business value: How long is data useful?
- Compliance: Regulatory minimum retention
- Cost: Storage costs vs. value
#
Storage Classes
Optimize costs with appropriate storage tiers:
#
Automatic Transitions
Configure lifecycle rules to automatically:
- Move data to colder storage after defined period
- Delete temporary or expired data
- Clean up incomplete uploads
Example policy:
- Bronze data → Nearline after 90 days
- Staging data → Delete after 30 days
- Archive data → Coldline after 1 year
#
File Format Recommendations
#
Preferred Formats
#
Anti-Patterns
Avoid patterns that increase costs:
- Raw JSON in BigQuery: Expensive scans, poor performance
- Many small files: Overhead in listing and reading
- Uncompressed text: Wasted storage and transfer costs
#
Bucket Naming Convention
Establish consistent naming for discoverability and automation:
{project}-{environment}-{domain}-{tier}
Examples:
ume-prod-credit-bronzeume-prod-collections-silverume-dev-analytics-gold
#
Related Sections
- Data Sources - Where data originates
- Lake Engine - How data is queried
- ETL - How data moves between tiers