# Object Storage

Google Cloud Storage (GCS) serves as the foundation storage layer for the data platform. All data flows through GCS before being made available for querying and analysis.

# Medallion Architecture

Data is organized into three tiers based on its maturity and quality level:

# Bronze Layer

Purpose: Raw data preservation

The Bronze layer stores data exactly as received from source systems, with minimal or no transformation. This ensures:

Auditability: Original data is always available for reference
Reprocessing: Ability to rebuild downstream layers if logic changes
Debugging: Investigate issues by comparing raw vs. transformed data

Characteristics:

Schema-on-read approach
Partitioned by ingestion time
Retained according to compliance requirements

# Silver Layer

Purpose: Cleaned and validated data

The Silver layer contains data that has been:

Cleaned: Null handling, deduplication, type corrections
Validated: Passed quality checks and assertions
Conformed: Standardized naming, consistent formats

Characteristics:

Defined schemas with documentation
Business key deduplication
Cross-source entity resolution where applicable

# Gold Layer

Purpose: Business-ready, curated datasets

The Gold layer provides consumption-ready data:

Aggregated: Pre-computed metrics and summaries
Curated: Business logic applied, ready for reporting
Governed: Assigned data stewards, validated for accuracy

Characteristics:

Optimized for query performance
Documented business definitions
Subject to change management processes

# Project Organization

Data is separated into GCP projects based on domain, team, or access requirements:

# Benefits of Project Separation

Cost allocation: Clear billing attribution per team or domain
Access boundaries: IAM policies scoped to project level
Blast radius: Issues in one project don't affect others
Quota management: Resource limits per project

# Project Structure Guidelines

Central data project: Core data assets shared across the organization
Domain projects: Team-specific datasets and transformations
Sandbox projects: Experimentation with controlled access
Production vs. non-production: Separate environments for safety

# Cross-Project Access

When data needs to be shared across projects:

Use dataset-level permissions rather than project-level
Document cross-project dependencies in the data catalog
Monitor cross-project query costs

# Storage Security

# Managed KMS Keys

Customer-managed encryption keys (CMEK) provide:

Control: Organization owns the encryption keys
Audit: Key usage is logged and monitored
Compliance: Meet regulatory requirements for data protection

Implementation:

Work with Infrastructure team on key lifecycle management
Rotate keys according to security policy
Document key assignments per bucket

# Access Policies

Fine-grained IAM controls ensure least-privilege access:

Bucket-level: Who can read/write to storage
Object-level: Granular controls when needed
Conditional access: Time-based, IP-based restrictions

Best Practices:

Avoid broad project-level permissions
Use groups rather than individual users
Regular access reviews

# Audit Trail

Comprehensive logging for compliance and security:

Access logs: Who accessed what and when
Admin logs: Configuration and permission changes
Retention: Logs retained according to company policy

# Lifecycle Policies

# Data Retention

Define retention periods based on:

Business value: How long is data useful?
Compliance: Regulatory minimum retention
Cost: Storage costs vs. value

# Storage Classes

Optimize costs with appropriate storage tiers:

Class	Use Case	Access
Standard	Frequently accessed	Immediate
Nearline	Monthly access	Immediate
Coldline	Quarterly access	Immediate
Archive	Rarely accessed	Retrieval delay

# Automatic Transitions

Configure lifecycle rules to automatically:

Move data to colder storage after defined period
Delete temporary or expired data
Clean up incomplete uploads

Example policy:

Bronze data → Nearline after 90 days
Staging data → Delete after 30 days
Archive data → Coldline after 1 year

# File Format Recommendations

# Preferred Formats

Format	Use Case	Benefits
Parquet	Analytical workloads	Columnar, compressed, schema
ORC	Heavy aggregations	Optimized for large scans
Avro	Schema evolution	Row-based, schema registry
JSON (compressed)	Semi-structured	Flexible, widely supported

# Anti-Patterns

Avoid patterns that increase costs:

Raw JSON in BigQuery: Expensive scans, poor performance
Many small files: Overhead in listing and reading
Uncompressed text: Wasted storage and transfer costs

# Bucket Naming Convention

Establish consistent naming for discoverability and automation:

{project}-{environment}-{domain}-{tier}

Examples:

ume-prod-credit-bronze
ume-prod-collections-silver
ume-dev-analytics-gold

Data Sources - Where data originates
Lake Engine - How data is queried
ETL - How data moves between tiers

# Object Storage

# Medallion Architecture

# Bronze Layer

# Silver Layer

# Gold Layer

# Project Organization

# Benefits of Project Separation

# Project Structure Guidelines

# Cross-Project Access

# Storage Security

# Managed KMS Keys

# Access Policies

# Audit Trail

# Lifecycle Policies

# Data Retention

# Storage Classes

# Automatic Transitions

# File Format Recommendations

# Preferred Formats

# Anti-Patterns

# Bucket Naming Convention

# Related Sections