#
Proposed Solution - MVP
#
Object/File Storage
Google Cloud Storage
- GCS offers the highest interoperability among other options
- It decouples storage from compute
- It is easier to offer volatile processing on top of stored data
- Highly available
- Easier to impose access policies / RBAC, audit trail, immutability
- Easier to apply tier transition policies
- Best use:
- Raw data / Landing area
- Restricted access to data integration people only
- Offer per-user sandboxes to maintain tack/responsibility for ad-hoc data files
Alternatives
- Block Storage
- Faster local workloads - e.g.: instances that process data while coupled to disk.
- Can be much faster
- Difficult to scale and operate
- Can be used, but depending on processing framework
- Google AlloyDB/Cloud SQL
Not 1:1 with object storage in comparisson, but instead of using GCS for CSVs, this data could be directly stored in DBMS. So attends to a subset of use cases for current GCS Use.
Maintenance won't scale as easy as GCS. of GCS is coupled with good use of DW maintenance and costs tend to scale better. Subject to use-case analyses.
#
Relational database
Google AlloyDB
- Used for transactional workloads when limited cluster cost justifies over BigQuery usage
- Easier to maintain and replicate
- Predictable cost
- Appropriate transactional workload response times/latency
- Still need to maintain indexes and keys
Alternatives
- BigQuery
- Higher latency
- Prohibitive cost on high usage
#
DW / Acceleration engine
BigQuery
- Appropriate for OLAP workloads
- Low maintenance - e.g.: indexes
- When paired with good design (partitioning, clustering) offers great performance
- Decoupled from compute - e.g.: doesn't charge for idle time
Clickhouse - experimental
- Storage (DBMS) offering
- Blazing fast, standard/widely adopted tech outside of cloud offers for DW
Alternatives
- Dremio / Starburst (trino)
- Similar offerings for distributed SQL engine
- Pay for compute instances reading from Object storage or other engines
- Enterprise pricing $$
- BI Engine
- Great couple for BIQuery for accelerating queries
- Pay per mamory available
- Can still fall back to BigQuery workloads on chaotic environment (many dashboards and users)
- Low cost predictability
- Databricks / Snowflake
- Not only dw but also offer many other features on top of a lake
- Enterprise pricing
#
Reporting / Exploration
Add vertical cost control
Metabase Cloud
- Managed instance
- Can offer cac Looker Studio
Hex
- Worth trying during PoC
- Pay per editor only
- Added value for ML
Alternatives
- Thoughtspot
- Enterprise pricing
- Superset
- Looker
- Enterprise pricing
- Tableau
- Enterprise pricing
- Power BI
- Complex modeling overhead
#
Orchestration
Cloud Composer
- Best framework around for building dependency-aware data workloads
- Unmatched community support
- Publish dags via ci/cd integrated with GCS
- Can use Dag Factory (https://astronomer.github.io/dag-factory) to declare dags as YAML
n8n
- Stay alert, perhaps n8n could solve some problems
Alternatives
- Astronomer
- Managed Cloud-hosted Airflow
- Better than Cloud composer, but keeping Cloud composer for integration with GCP account
- Prefect
- more simple concept
- Like Dagster, more dynamic
- Kestra
- yaml-defined transformations
- Dagster
- Like Airflow, but more data-model aware
- Great for fine-grained data lineage
#
Transformations
dbt Cloud
- Same concept as now
- Added interface for non-devs
- Auto lineage docs
- Supports many DW/sql-speaking db
Semantic layer - how to use withouy DBT Cloud - pull from clients?
Alternatives
- DBT Core
- More dev friendly
- Not so analyst-friendly
- Dataform
- Exclusive to BigQuery
- Comparable to dbt
- SQL + JavaScript for templating
- Pay only for BigQuery
- Hex
- Also works as transformations engine
- Not 1st class though
#
Governance
- Datahub
- OSS
- Pricing not public
Alternatives
- OpenMetadata
- OSS
- SaaS still not available
- Atlan
- Bad reputation
- Alation
- data.world