#
Status overview
UME People looked after me to put together a set of suggestions and propose a data transformation plan.
Currently they have a sistematic problem on how their teams of around 200 people govern and consume data.
#
Toolset
They primarily utilize GCP as their cloud provider and when it comes to data the tools mostly used are:
- GCS - serving as data ingestion method
- BigQuery - serves as data storage and query layer, sometimes serving applications that need detailed records.
- Google Looker Studio - part of their team - Management mostly - consumes dashboards produced here.
- Metabase - with an old outdated deployment, most of the team uses this tool.
#
Governance
Even though Metabase seem to be the place where people go for data and everyone has access to all datasets produced
- potentially with some exceptions -, this is where a lot of users go to produce ad-hoc queries and kpis and share dashboards.
One of the major pains and perhaps the most strategic one to tackle is the number of different versions of the same indicator. For example, the FPD indicator when searched on Metabase has more than a hundred occurences. Many versions were produced of the same indicator and different data sources (Questions on Metabase) are saved as a copy by just changing a couple of filters to represent different business slices, such as customers and time.
#
Behaviour
With the way the tools are available, users experiment a lot of freedom, but at the same time, chaos.
As the company experiences scaling at this moment, the number of people at the data engineering team, which is very reduced, will not be sufficient to sustain growth and strategic implementations.
Instead of this team be handling the upcoming implementation of multi tenancy, and enabling central kpis, for example. they are running after bad user behaviour.
#
An example
In one such behaviour cought recently, a developer who isn't familiar with the intricacies of cloud stogage costs, created a data source on BigQuery that consumes live log data that an app dumps on a stream to GCS.
This log stream consists of a set of jsonl files that are consumed as a single column on BigQuery, making it sub-optimal for querying. This resulted in tens of thousands of Brazilian Reais in a given period of time.
So the DE team has to constantly do ad-hoc work like fixing such issues and other tasks like producing reports on demand that should otherwise be the result of the combination of existing well-governed data sources - i.e.: working more on strategic and less on transactional activations.
#
Data (Waregousing) strategy
Even though there are current efforts from the small Data Engineering team to tame the usage of data and build a central Data Warehouse, what we can currently see on Metabase are examples of:
Shift-right - reports that brings to their queries all the complexities of building a KPI. In a well designed enterprise data platform, all complexities that can be solved early in ETL are solved early.
Duplicates - historical duplications of known indicators and the lack of a KPI lifecycle management.
There is no unified command and control of tiered data lake areas such as bronze-silver-gold. Data is not centrally classified on maturity levels and they lack a central definition for teams to adopt.
#
Same KPIs, different areas and values
Some KPIs have different interpretations for the same KPIs. Some KPIs can make sense calculated in a way for a given area whereas the value will slightly differ for another area, due to inherent ways they see their values. Some level of variation due to interpretation is expected, but there should be a discrete number of indicators, all well-governed (with responsible people assigned), curated and validated at lifecycle checkpoints.
Leaders at UME are extremely detail-oriented, to the point that if an indicator seem off by even a cent, they will notice and this deviation can undermine the trust on the data and on the team, not to say, tenants can lose the trust on the business.
Since UMEs business deals with large sums of money, the trust in the way money, credit, taxes and fees are handled are extremely sensitive and can represent life and death of the business.
Another pain reported was the need for reproducibility across reports produced on different points in time. The proposed solution should account for the versioning of KPIs and I understand that, in a given solution, if we have a central KPI definition, this versioning includes the KPI Definition and underlyig data (through PIT Snapshots or time travel mechanisms)
#
Data Science
Data science team currently has a tremendous difficulty of consolidating valules, since there is no single source of truth. Because of that, some DS Members have to expand their reach and assist on data engineering to make sure that data comes with the right shape, correct and at the right schedule.
Data scientists currently produce data on their own Jupyter Notebooks and the team is currently working on modernizing the engineering that goes from conceptualizing a model, through building it, deploying in production and observing. Traditionally, DS practice lacked SLDC lifecycle aspects such as code reuse, deployment lifecycle, repeatability concerns and monitoring.
#
Source systems
UME has a number of (AXAXA) source systems where data comes from. ...
#
Data protection
Currently there is low to no governance when it comes to access policies, giving users wide access to information they should probably not be accessing. On the same page, the company lacks controls that proactively identify Personally Identifiable Information across their data lake.