#
Lake Engine
Recommendation:
- Experiment with BigQuery BI Engine on the three most consuming datamarts
- Do a small PoC with Clickhouse for demanding datasets
- Do t-shirt comparison: on-demand vs BI Engine vs Clickhouse for shortlisted workloads
- Validate BI Engine cost predictability - can we ensure no fallback to on-demand for critical dashboards?
- Evaluate Clickhouse: identity federation, workload identity, integration with data catalog
Tha lake engine in our architecture view, can be understood as the engine capable of
- Defining Data Definition Langauge on top of external tables mapping Google Cloud Storage objects and processing SQL on top of these.
- Defining tables, schemas and other hierarchical organization levels with its internal storage mechanisms.
- Behaving a lot like a database management system, but with its own peculiarities.
BigQuery generally serves as a great tool that attends to this criteria, but there are different types of workloads that need these same capabilities:
- Transformations
- Reporting
- Fine-grained and detailed
- Aggregated, KPI-level.
A tool like BigQuery cannot be used the same way for these types of workloads and the tool itself has a few mechanisms to allow for the best performance/cost rate for each scenario.
On our architecture view we have mainly two logical blocks that are:
- On-demand capacity - understand "general usage". Good for transformations and non-frequent and/or very well modelled/aggregated datasets. Also good for new projects under development that don't involve a huge amount of data. Cost here (BigQuery) is on-demand, either per data scanned or per slots used (BigQuery Editions).
- Provisioned capacity - cost/capacity - not necessarily BigQuery - is provisioned and stable regardless of the amount of queries executed. Not charged by byes scanned per query or per slots used, but for provisioned compute, for example: 1) BigQuery BI Engine, 2) Clickhouse Cloud.
Having this separation helps us provision and debounce costs for highly demanding jobs that demand the most out of BigQuery's on-demand costs.
Going beyond that however makes us have to look deeper into the best practices for the on-demand capacity.
#
On demand capacity
Some of the greatest cost drivers in any datawarehouse in any company are:
- Bad data modeling - can be seen by too complex data on reports
- Bad storage optimization - on BigQuery this is usually:
- too many data files on external tables
- full data scans on flat files, coming from unstructured data being processed (no data types, nested values, de-serialization) during query time.
- Not leveraging deisgn patterns for partitioning or clustering.
For tables used in reporting, clustering and partitioning provide cost-saving superpowers and can dramatically reduce costs with minimal effort, just by understanding query patterns.
Furthermore, the transition towards a sustainable usage should include the motivation towards best dimensional modeling practices. Namely:
- Tackle a few first projects into the best data model
- Document good practices and make them actionable and accessible
- Incrementally bring datamarts into this new model
- Reward good examples and demotivate bad examples.
Recommendation:
- Keep BiqQuery
- Create costs and good practices reporting
- Make them appear on data catalog/governance tool
- Associate with a person by system design, not by individual initiative - e.g.: Checking FPD data on data catalog, I can see lineage and a link to reports for that table.
- Calculate baseline and adopt BigQuery Editions for reporting workloads
- Create onboarding blueprints associated with agentic recommendations for users.
- Incrementally bring datamarts to better modelling and add aggregations
- Hunt for complex queries on reports/dashboards
#
Governance recommendations
- Split data into domains using projects, datasets, and schemas to enable:
- Granular access policies separation between departments and people
- Least-privilege access at dataset/table level
- Cost attribution by project, team, and dashboard
- Consider BigQuery Data Sharing (formerly Analytics Hub) to formalize cross-team data distribution
- Enforce individual user identities for all queries (no shared service accounts) to enable audit trails - whenever possible by consuming apps.
- Implement cost anomaly detection (spikes per user, project, or dashboard grain) that triggers a review pipeline involving data stewards
- Document which regions data will reside in; avoid cross-region data transfer for cost and latency
- Always use data encryption at rest
- Understand extra-critical datasets for replication into different regions (BQ offers 99.99% uptime with automatic zonal redundancy, what we can't afford to have off for 10 or 40 minutes?)
- Define naming conventions for projects, datasets, and tables; establish clear tier organization (bronze/silver/gold)
- USe Partitioning and clustering - patterns will emerge from first project implementations and be documented as blueprints
- Model data and transformations supporting KPI versioning alongside data (time travel or point-in-time snapshots) to enable reproducibility across reports
#
Provisioned Capacity
The core idea behind provisioned capacity is cost predictability and debouncing. Instead of paying per query or per bytes scanned - where a single poorly written query or a spike in report usage can blow up costs - you reserve fixed compute capacity. This changes the cost model from variable to stable, which is critical for high-traffic deyailed dashboards and recurring workloads where on-demand pricing can bring surprises.
Provisioned capacity is best suited for:
- Dashboards with many concurrent users - where query volume is high and unpredictable
- Gold layer - curated, well-modeled datasets that are critical and justify the investment
- Cost predictability requirements - finance and planning need stable forecasts
- Sub-second response times - user experience matters for executive dashboards
#
Options
#
Considerations
- Capacity planning is required - we must first understand usage patterns to right-size
- BI Engine has memory limits - datasets must fit; queries exceeding capacity fall back to on-demand BigQuery (with associated costs)
- Clickhouse requires evaluation - identity federation, workload identity, and integration with data catalog need validation before adoption
#
Governance recommendations
The governance recommendations from
Additional considerations for provisioned capacity:
- Establish eligibility criteria for datasets to be promoted to provisioned capacity (e.g., certified gold, high query volume, stable schema)
- Monitor fallback to on-demand (BI Engine) and adjust capacity accordingly
- Track usage against provisioned capacity to detect under/over-utilization
- For Clickhouse or other non-BigQuery engines, ensure governance tooling (catalog, lineage, access policies) integrates properly before adoption