# Lake Engine

Recommendation:

Experiment with BigQuery BI Engine on the three most consuming datamarts
Do a small PoC with Clickhouse for demanding datasets
Do t-shirt comparison: on-demand vs BI Engine vs Clickhouse for shortlisted workloads
Validate BI Engine cost predictability - can we ensure no fallback to on-demand for critical dashboards?
Evaluate Clickhouse: identity federation, workload identity, integration with data catalog

Tha lake engine in our architecture view, can be understood as the engine capable of

Defining Data Definition Langauge on top of external tables mapping Google Cloud Storage objects and processing SQL on top of these.
Defining tables, schemas and other hierarchical organization levels with its internal storage mechanisms.
Behaving a lot like a database management system, but with its own peculiarities.

BigQuery generally serves as a great tool that attends to this criteria, but there are different types of workloads that need these same capabilities:

Transformations
Reporting
- Fine-grained and detailed
- Aggregated, KPI-level.

A tool like BigQuery cannot be used the same way for these types of workloads and the tool itself has a few mechanisms to allow for the best performance/cost rate for each scenario.

On our architecture view we have mainly two logical blocks that are:

On-demand capacity - understand "general usage". Good for transformations and non-frequent and/or very well modelled/aggregated datasets. Also good for new projects under development that don't involve a huge amount of data. Cost here (BigQuery) is on-demand, either per data scanned or per slots used (BigQuery Editions).
Provisioned capacity - cost/capacity - not necessarily BigQuery - is provisioned and stable regardless of the amount of queries executed. Not charged by byes scanned per query or per slots used, but for provisioned compute, for example: 1) BigQuery BI Engine, 2) Clickhouse Cloud.

Having this separation helps us provision and debounce costs for highly demanding jobs that demand the most out of BigQuery's on-demand costs.

Going beyond that however makes us have to look deeper into the best practices for the on-demand capacity.

# On demand capacity

Some of the greatest cost drivers in any datawarehouse in any company are:

Bad data modeling - can be seen by too complex data on reports
Bad storage optimization - on BigQuery this is usually:
- too many data files on external tables
- full data scans on flat files, coming from unstructured data being processed (no data types, nested values, de-serialization) during query time.
- Not leveraging deisgn patterns for partitioning or clustering.

For tables used in reporting, clustering and partitioning provide cost-saving superpowers and can dramatically reduce costs with minimal effort, just by understanding query patterns.

Furthermore, the transition towards a sustainable usage should include the motivation towards best dimensional modeling practices. Namely:

Tackle a few first projects into the best data model
Document good practices and make them actionable and accessible
Incrementally bring datamarts into this new model
Reward good examples and demotivate bad examples.

Recommendation:

Keep BiqQuery
Create costs and good practices reporting
- Make them appear on data catalog/governance tool
- Associate with a person by system design, not by individual initiative - e.g.: Checking FPD data on data catalog, I can see lineage and a link to reports for that table.
Calculate baseline and adopt BigQuery Editions for reporting workloads
Create onboarding blueprints associated with agentic recommendations for users.
Incrementally bring datamarts to better modelling and add aggregations
- Hunt for complex queries on reports/dashboards

# Governance recommendations

Split data into domains using projects, datasets, and schemas to enable:
- Granular access policies separation between departments and people
- Least-privilege access at dataset/table level
- Cost attribution by project, team, and dashboard
Consider BigQuery Data Sharing (formerly Analytics Hub) to formalize cross-team data distribution
Enforce individual user identities for all queries (no shared service accounts) to enable audit trails - whenever possible by consuming apps.
Implement cost anomaly detection (spikes per user, project, or dashboard grain) that triggers a review pipeline involving data stewards
Document which regions data will reside in; avoid cross-region data transfer for cost and latency
Always use data encryption at rest
Understand extra-critical datasets for replication into different regions (BQ offers 99.99% uptime with automatic zonal redundancy, what we can't afford to have off for 10 or 40 minutes?)
Define naming conventions for projects, datasets, and tables; establish clear tier organization (bronze/silver/gold)
USe Partitioning and clustering - patterns will emerge from first project implementations and be documented as blueprints
Model data and transformations supporting KPI versioning alongside data (time travel or point-in-time snapshots) to enable reproducibility across reports

# Provisioned Capacity

The core idea behind provisioned capacity is cost predictability and debouncing. Instead of paying per query or per bytes scanned - where a single poorly written query or a spike in report usage can blow up costs - you reserve fixed compute capacity. This changes the cost model from variable to stable, which is critical for high-traffic deyailed dashboards and recurring workloads where on-demand pricing can bring surprises.

Provisioned capacity is best suited for:

Dashboards with many concurrent users - where query volume is high and unpredictable
Gold layer - curated, well-modeled datasets that are critical and justify the investment
Cost predictability requirements - finance and planning need stable forecasts
Sub-second response times - user experience matters for executive dashboards

# Options

Option	Description	Best for
BigQuery BI Engine	in-memory acceleration capacity on top of BigQuery	Dashboards needing sub-second response; data fits in memory
BigQuery Slots (Editions)	reserved compute capacity for BigQuery queries (also suitable for a few On demand capacity workloads)	high-volume ETL or reporting with predictable patterns
Clickhouse Cloud	Columnar OLAP database, very fast, we pay for provisioned compute	Experimental; very high query volume, real-time analytics

# Considerations

Capacity planning is required - we must first understand usage patterns to right-size
BI Engine has memory limits - datasets must fit; queries exceeding capacity fall back to on-demand BigQuery (with associated costs)
Clickhouse requires evaluation - identity federation, workload identity, and integration with data catalog need validation before adoption

# Governance recommendations

The governance recommendations from On demand capacity also apply here - domain organization, access policies, audit trails, naming conventions, and encryption remain relevant.

Additional considerations for provisioned capacity:

Establish eligibility criteria for datasets to be promoted to provisioned capacity (e.g., certified gold, high query volume, stable schema)
Monitor fallback to on-demand (BI Engine) and adjust capacity accordingly
Track usage against provisioned capacity to detect under/over-utilization
For Clickhouse or other non-BigQuery engines, ensure governance tooling (catalog, lineage, access policies) integrates properly before adoption