# Object Storage

Google Cloud Storage (GCS) serves as the foundation storage layer for the data platform. All data flows through GCS before being made available for querying and analysis.

# Medallion Architecture

Data is organized into three tiers based on its maturity and quality level:

# Bronze Layer

Purpose: Raw data preservation

The Bronze layer stores data exactly as received from source systems, with minimal or no transformation. This ensures:

  • Auditability: Original data is always available for reference
  • Reprocessing: Ability to rebuild downstream layers if logic changes
  • Debugging: Investigate issues by comparing raw vs. transformed data

Characteristics:

  • Schema-on-read approach
  • Partitioned by ingestion time
  • Retained according to compliance requirements

# Silver Layer

Purpose: Cleaned and validated data

The Silver layer contains data that has been:

  • Cleaned: Null handling, deduplication, type corrections
  • Validated: Passed quality checks and assertions
  • Conformed: Standardized naming, consistent formats

Characteristics:

  • Defined schemas with documentation
  • Business key deduplication
  • Cross-source entity resolution where applicable

# Gold Layer

Purpose: Business-ready, curated datasets

The Gold layer provides consumption-ready data:

  • Aggregated: Pre-computed metrics and summaries
  • Curated: Business logic applied, ready for reporting
  • Governed: Assigned data stewards, validated for accuracy

Characteristics:

  • Optimized for query performance
  • Documented business definitions
  • Subject to change management processes

# Project Organization

Data is separated into GCP projects based on domain, team, or access requirements:

# Benefits of Project Separation

  1. Cost allocation: Clear billing attribution per team or domain
  2. Access boundaries: IAM policies scoped to project level
  3. Blast radius: Issues in one project don't affect others
  4. Quota management: Resource limits per project

# Project Structure Guidelines

  • Central data project: Core data assets shared across the organization
  • Domain projects: Team-specific datasets and transformations
  • Sandbox projects: Experimentation with controlled access
  • Production vs. non-production: Separate environments for safety

# Cross-Project Access

When data needs to be shared across projects:

  1. Use dataset-level permissions rather than project-level
  2. Document cross-project dependencies in the data catalog
  3. Monitor cross-project query costs

# Storage Security

# Managed KMS Keys

Customer-managed encryption keys (CMEK) provide:

  • Control: Organization owns the encryption keys
  • Audit: Key usage is logged and monitored
  • Compliance: Meet regulatory requirements for data protection

Implementation:

  • Work with Infrastructure team on key lifecycle management
  • Rotate keys according to security policy
  • Document key assignments per bucket

# Access Policies

Fine-grained IAM controls ensure least-privilege access:

  • Bucket-level: Who can read/write to storage
  • Object-level: Granular controls when needed
  • Conditional access: Time-based, IP-based restrictions

Best Practices:

  • Avoid broad project-level permissions
  • Use groups rather than individual users
  • Regular access reviews

# Audit Trail

Comprehensive logging for compliance and security:

  • Access logs: Who accessed what and when
  • Admin logs: Configuration and permission changes
  • Retention: Logs retained according to company policy

# Lifecycle Policies

# Data Retention

Define retention periods based on:

  • Business value: How long is data useful?
  • Compliance: Regulatory minimum retention
  • Cost: Storage costs vs. value

# Storage Classes

Optimize costs with appropriate storage tiers:

Class Use Case Access
Standard Frequently accessed Immediate
Nearline Monthly access Immediate
Coldline Quarterly access Immediate
Archive Rarely accessed Retrieval delay

# Automatic Transitions

Configure lifecycle rules to automatically:

  1. Move data to colder storage after defined period
  2. Delete temporary or expired data
  3. Clean up incomplete uploads

Example policy:

  • Bronze data → Nearline after 90 days
  • Staging data → Delete after 30 days
  • Archive data → Coldline after 1 year

# File Format Recommendations

# Preferred Formats

Format Use Case Benefits
Parquet Analytical workloads Columnar, compressed, schema
ORC Heavy aggregations Optimized for large scans
Avro Schema evolution Row-based, schema registry
JSON (compressed) Semi-structured Flexible, widely supported

# Anti-Patterns

Avoid patterns that increase costs:

  • Raw JSON in BigQuery: Expensive scans, poor performance
  • Many small files: Overhead in listing and reading
  • Uncompressed text: Wasted storage and transfer costs

# Bucket Naming Convention

Establish consistent naming for discoverability and automation:

{project}-{environment}-{domain}-{tier}

Examples:

  • ume-prod-credit-bronze
  • ume-prod-collections-silver
  • ume-dev-analytics-gold

# Related Sections