# Lake Engine

The Lake Engine provides the query and federation layer that sits between storage and consumption. It enables unified access to data across different storage tiers and sources while enforcing access policies and optimizing costs.

# Data Federation

Data federation allows queries to access data from multiple sources as if they were a single database:

# Federated Sources

Object Storage (GCS): Query Parquet, ORC, and other formats directly
BigQuery datasets: Native BigQuery tables and views
External tables: Data in other systems accessible via connectors

# Benefits

Unified view: Single query language across sources
Reduced data movement: Query in place without copying
Centralized governance: Access policies applied consistently
Cost optimization: Choose the right storage for each use case

# BigQuery Data Federation

BigQuery serves as the primary federation layer, providing:

External tables over GCS buckets
Cross-project dataset access
Federated queries to Cloud SQL and other sources

# Access Policies

Access control is enforced at query time, ensuring users only see data they're authorized to access.

# Policy Enforcement

Dataset-level: Control access to entire datasets
Table-level: Restrict access to specific tables
Column-level: Mask or hide sensitive columns
Row-level: Filter rows based on user attributes (RLS)

# Row-Level Security (RLS)

Row-Level Security filters query results based on user identity:

Use cases:

Tenant data isolation (each tenant sees only their data)
Regional access (users see data for their region)
PII protection (limit access to sensitive records)

# Column-Level Security

Protect sensitive columns without restricting table access:

Masking: Show partial data (e.g., last 4 digits of phone)
Null replacement: Hide values for unauthorized users
Policy tags: Tag columns with sensitivity levels

# Capacity Models

The platform supports two capacity models to balance cost and performance:

# On-Demand Capacity

How it works: Pay per query based on bytes scanned

Best for:

Ad-hoc analysis and exploration
Variable or unpredictable workloads
Development and testing

Cost considerations:

Costs scale with data volume scanned
Inefficient queries can be expensive
No upfront commitment

Optimization strategies:

Partition tables by commonly filtered columns
Use clustering for frequently joined columns
Materialize common aggregations

# Provisioned Capacity

How it works: Reserve dedicated compute capacity (slots)

Best for:

Predictable, recurring workloads
Dashboards with many concurrent users
Cost predictability requirements

Options:

BigQuery Slots: Reserved compute capacity
BI Engine: In-memory acceleration for dashboards

Cost considerations:

Fixed cost regardless of query volume
More economical for high-volume workloads
Requires capacity planning

# Cost Optimization

BigQuery costs have been identified as a significant expense. Key strategies include:

# Query Optimization

**Avoid SELECT ***: Query only needed columns
Use partitioned tables: Filter on partition columns
Limit view nesting: Deep view hierarchies multiply scans
Preview before running: Estimate costs for large queries

# Data Organization

Partition tables: By date or other high-cardinality columns
Cluster tables: By frequently filtered columns
Materialize views: Pre-compute common aggregations
Archive unused data: Move to cold storage

# Governance Controls

Cost quotas: Set per-user or per-project limits
Query review: Flag expensive queries before execution
Usage monitoring: Track costs by user, project, dashboard
Automated cleanup: Remove unused views and tables

# Query Governance

# Query Audit

All queries are logged for:

Cost attribution
Performance analysis
Security auditing
Usage patterns

# Cost Attribution

Track query costs by:

User or service account
Project or team
Dashboard or report
Time period

# Alerting

Set up alerts for:

Individual expensive queries
Unusual cost patterns
Quota threshold breaches
Failed or long-running queries

# Performance Considerations

# When to Use BI Engine

BigQuery BI Engine provides sub-second query response for dashboards:

Criteria:

Dashboard with high concurrent users
Data already in Gold layer (curated)
Refresh frequency can tolerate caching
Dataset fits within BI Engine capacity

# Materialized Views

Pre-compute aggregations for frequently accessed queries:

Good candidates:

Daily/weekly summaries
Standard KPI calculations
Common filter combinations

Object Storage - Where data is stored
Reporting - How data is consumed
Cross-Cutting - Multi-tenancy and access controls