# Lake Engine

Tha lake engine in our architecture view, can be understood as the engine capable of

  • Defining Data Definition Langauge on top of external tables mapping Google Cloud Storage objects and processing SQL on top of these.
  • Defining tables, schemas and other hierarchical organization levels with its internal storage mechanisms.
  • Behaving a lot like a database management system, but with its own peculiarities.

BigQuery generally serves as a great tool that attends to this criteria, but there are different types of workloads that need these same capabilities:

  • Transformations
  • Reporting
    • Fine-grained and detailed
    • Aggregated, KPI-level.

A tool like BigQuery cannot be used the same way for these types of workloads and the tool itself has a few mechanisms to allow for the best performance/cost rate for each scenario.

On our architecture view we have mainly two logical blocks that are:

  • On-demand capacity - understand "general usage". Good for transformations and non-frequent and/or very well modelled/aggregated datasets. Also good for new projects under development that don't involve a huge amount of data. Cost here (BigQuery) is on-demand, either per data scanned or per slots used (BigQuery Editions).
  • Provisioned capacity - cost/capacity - not necessarily BigQuery - is provisioned and stable regardless of the amount of queries executed. Not charged by byes scanned per query or per slots used, but for provisioned compute, for example: 1) BigQuery BI Engine, 2) Clickhouse Cloud.

Having this separation helps us provision and debounce costs for highly demanding jobs that demand the most out of BigQuery's on-demand costs.

Going beyond that however makes us have to look deeper into the best practices for the on-demand capacity.

# On demand capacity

Some of the greatest cost drivers in any datawarehouse in any company are:

  • Bad data modeling - can be seen by too complex data on reports
  • Bad storage optimization - on BigQuery this is usually:
    • too many data files on external tables
    • full data scans on flat files, coming from unstructured data being processed (no data types, nested values, de-serialization) during query time.
    • Not leveraging deisgn patterns for partitioning or clustering.

For tables used in reporting, clustering and partitioning provide cost-saving superpowers and can dramatically reduce costs with minimal effort, just by understanding query patterns.

Furthermore, the transition towards a sustainable usage should include the motivation towards best dimensional modeling practices. Namely:

  • Tackle a few first projects into the best data model
  • Document good practices and make them actionable and accessible
  • Incrementally bring datamarts into this new model
  • Reward good examples and demotivate bad examples.

# Governance recommendations

  • Split data into domains using projects, datasets, and schemas to enable:
    • Granular access policies separation between departments and people
    • Least-privilege access at dataset/table level
    • Cost attribution by project, team, and dashboard
  • Consider BigQuery Data Sharing (formerly Analytics Hub) to formalize cross-team data distribution
  • Enforce individual user identities for all queries (no shared service accounts) to enable audit trails - whenever possible by consuming apps.
  • Implement cost anomaly detection (spikes per user, project, or dashboard grain) that triggers a review pipeline involving data stewards
  • Document which regions data will reside in; avoid cross-region data transfer for cost and latency
  • Always use data encryption at rest
  • Understand extra-critical datasets for replication into different regions (BQ offers 99.99% uptime with automatic zonal redundancy, what we can't afford to have off for 10 or 40 minutes?)
  • Define naming conventions for projects, datasets, and tables; establish clear tier organization (bronze/silver/gold)
  • USe Partitioning and clustering - patterns will emerge from first project implementations and be documented as blueprints
  • Model data and transformations supporting KPI versioning alongside data (time travel or point-in-time snapshots) to enable reproducibility across reports

# Provisioned Capacity

The core idea behind provisioned capacity is cost predictability and debouncing. Instead of paying per query or per bytes scanned - where a single poorly written query or a spike in report usage can blow up costs - you reserve fixed compute capacity. This changes the cost model from variable to stable, which is critical for high-traffic deyailed dashboards and recurring workloads where on-demand pricing can bring surprises.

Provisioned capacity is best suited for:

  • Dashboards with many concurrent users - where query volume is high and unpredictable
  • Gold layer - curated, well-modeled datasets that are critical and justify the investment
  • Cost predictability requirements - finance and planning need stable forecasts
  • Sub-second response times - user experience matters for executive dashboards

# Options

Option Description Best for
BigQuery BI Engine in-memory acceleration capacity on top of BigQuery Dashboards needing sub-second response; data fits in memory
BigQuery Slots (Editions) reserved compute capacity for BigQuery queries (also suitable for a few On demand capacity workloads) high-volume ETL or reporting with predictable patterns
Clickhouse Cloud Columnar OLAP database, very fast, we pay for provisioned compute Experimental; very high query volume, real-time analytics

# Considerations

  • Capacity planning is required - we must first understand usage patterns to right-size
  • BI Engine has memory limits - datasets must fit; queries exceeding capacity fall back to on-demand BigQuery (with associated costs)
  • Clickhouse requires evaluation - identity federation, workload identity, and integration with data catalog need validation before adoption

# Governance recommendations

The governance recommendations from On demand capacity also apply here - domain organization, access policies, audit trails, naming conventions, and encryption remain relevant.

Additional considerations for provisioned capacity:

  • Establish eligibility criteria for datasets to be promoted to provisioned capacity (e.g., certified gold, high query volume, stable schema)
  • Monitor fallback to on-demand (BI Engine) and adjust capacity accordingly
  • Track usage against provisioned capacity to detect under/over-utilization
  • For Clickhouse or other non-BigQuery engines, ensure governance tooling (catalog, lineage, access policies) integrates properly before adoption