# Lake Engine

The Lake Engine provides the query and federation layer that sits between storage and consumption. It enables unified access to data across different storage tiers and sources while enforcing access policies and optimizing costs.

# Data Federation

Data federation allows queries to access data from multiple sources as if they were a single database:

# Federated Sources

  • Object Storage (GCS): Query Parquet, ORC, and other formats directly
  • BigQuery datasets: Native BigQuery tables and views
  • External tables: Data in other systems accessible via connectors

# Benefits

  1. Unified view: Single query language across sources
  2. Reduced data movement: Query in place without copying
  3. Centralized governance: Access policies applied consistently
  4. Cost optimization: Choose the right storage for each use case

# BigQuery Data Federation

BigQuery serves as the primary federation layer, providing:

  • External tables over GCS buckets
  • Cross-project dataset access
  • Federated queries to Cloud SQL and other sources

# Access Policies

Access control is enforced at query time, ensuring users only see data they're authorized to access.

# Policy Enforcement

  1. Dataset-level: Control access to entire datasets
  2. Table-level: Restrict access to specific tables
  3. Column-level: Mask or hide sensitive columns
  4. Row-level: Filter rows based on user attributes (RLS)

# Row-Level Security (RLS)

Row-Level Security filters query results based on user identity:

Use cases:

  • Tenant data isolation (each tenant sees only their data)
  • Regional access (users see data for their region)
  • PII protection (limit access to sensitive records)

# Column-Level Security

Protect sensitive columns without restricting table access:

  • Masking: Show partial data (e.g., last 4 digits of phone)
  • Null replacement: Hide values for unauthorized users
  • Policy tags: Tag columns with sensitivity levels

# Capacity Models

The platform supports two capacity models to balance cost and performance:

# On-Demand Capacity

How it works: Pay per query based on bytes scanned

Best for:

  • Ad-hoc analysis and exploration
  • Variable or unpredictable workloads
  • Development and testing

Cost considerations:

  • Costs scale with data volume scanned
  • Inefficient queries can be expensive
  • No upfront commitment

Optimization strategies:

  • Partition tables by commonly filtered columns
  • Use clustering for frequently joined columns
  • Materialize common aggregations

# Provisioned Capacity

How it works: Reserve dedicated compute capacity (slots)

Best for:

  • Predictable, recurring workloads
  • Dashboards with many concurrent users
  • Cost predictability requirements

Options:

  • BigQuery Slots: Reserved compute capacity
  • BI Engine: In-memory acceleration for dashboards

Cost considerations:

  • Fixed cost regardless of query volume
  • More economical for high-volume workloads
  • Requires capacity planning

# Cost Optimization

BigQuery costs have been identified as a significant expense. Key strategies include:

# Query Optimization

  1. **Avoid SELECT ***: Query only needed columns
  2. Use partitioned tables: Filter on partition columns
  3. Limit view nesting: Deep view hierarchies multiply scans
  4. Preview before running: Estimate costs for large queries

# Data Organization

  1. Partition tables: By date or other high-cardinality columns
  2. Cluster tables: By frequently filtered columns
  3. Materialize views: Pre-compute common aggregations
  4. Archive unused data: Move to cold storage

# Governance Controls

  1. Cost quotas: Set per-user or per-project limits
  2. Query review: Flag expensive queries before execution
  3. Usage monitoring: Track costs by user, project, dashboard
  4. Automated cleanup: Remove unused views and tables

# Query Governance

# Query Audit

All queries are logged for:

  • Cost attribution
  • Performance analysis
  • Security auditing
  • Usage patterns

# Cost Attribution

Track query costs by:

  • User or service account
  • Project or team
  • Dashboard or report
  • Time period

# Alerting

Set up alerts for:

  • Individual expensive queries
  • Unusual cost patterns
  • Quota threshold breaches
  • Failed or long-running queries

# Performance Considerations

# When to Use BI Engine

BigQuery BI Engine provides sub-second query response for dashboards:

Criteria:

  • Dashboard with high concurrent users
  • Data already in Gold layer (curated)
  • Refresh frequency can tolerate caching
  • Dataset fits within BI Engine capacity

# Materialized Views

Pre-compute aggregations for frequently accessed queries:

Good candidates:

  • Daily/weekly summaries
  • Standard KPI calculations
  • Common filter combinations

# Related Sections