Two hundred years ago almost 90% of jobs were on the farm. Today less than 2% of people work on farms. During that steady decline in labor, the quality, reliability and variety of food available continued to grow to what we enjoy today. One of the major contributors to the improvements in quality (beyond simple productivity of farming benefits) was that some of the freed-up farming workforce shifted to improve other areas in the food value chain, including storage, distribution, retail, and preparation.
The same opportunity exists to improve how companies manage data. It seems to me that too much effort is still on “farming” the data. It is inefficient and takes away from the opportunity to redeploy that labor to more valuable areas such as analytics, ML and AI.
I believe ETL is the root of this problem.
Since the early days of computers and databases, Extract, Transform, and Load (ETL) code has been the way developers moved data. Look at the day-to-day workings of any large enterprise and you will find 10s or 100s of thousand ETL jobs behind the scenes that represent the glue of how information flows.
One of the first steps of any technology project is to identify the up and downstream data needs and write the new ETLs needed to support that effort.
Writing ETL code is relatively simple. Plus, it’s darn useful in answering almost any business question (are my sales up or down, what’s the trends with service calls, how are different managers spends vs budget looking, etc.). Lastly, there are many popular database tool vendors with their own spin on ETL, and IT service vendors LOVE to provide ETL staff. On the surface, ETL appears really easy, useful, and lots of vendors are eager to help.
But while ETL may seem appealing, over time it leads to serious problems:
- Lack of trust in the results: business people are not sure where the data came from or what calculations are done in each job or why (this is called lineage in the industry). This makes it virtually impossible to create an ‘enterprise view’ of the data.
- Data definition confusion (metadata): various people across the org using the same term to mean different things or different terms to mean the same thing.
- Daisy-chain jobs: ETL job1 feeds job 2, then 3, etc. People lose sight of where the data originated from. Everyone is looking at their own source of truth “my sales data doesn’t match your sales data”.
- High maintenance costs: All those jobs need to be maintained and upgraded. They break when someone changes an underlying source of data (which happens often).
- Security risks: In these days of mounting security breaches and growing penalties, all those ETL jobs shooting data all over the place is a nightmare.
All too often we hear companies tell us “we have a legacy data warehouse with thousands of ETL jobs, no documentation, frustrated business users, no trust in the data and high costs to maintain. We’d like to port, clean-up, and retire but it’s a risky and costly effort to untangle all the spaghetti ETL code”.
Every company today would like to throw a hand grenade into their legacy data and start over again with an enterprise data lake (EDL). When implemented correctly, an EDL will address the issues listed above and make it easier to pull accurate insights across the breadth of the company.
However, if you choose to ETL data into the lake, the same problems will arise again. Lakes will come swamps very quickly.
At Next Pathway we have adopted the mantra “no ETL” and have developed tools and processes that eliminate the army of ETL developers.
Simply put, we do not ETL data.
Instead we rely on the metadata (literally, data about the data) and business defined transformation rules to automatically migrate data and code.
The operational benefits of cost and time are immediate. And because we avoid manual ETLs, data lineage and metadata are accurately captured. The long-term benefit is that the “data swamps” that result from all that manual ETL over time are avoided. This is because the descriptions of how and why the data was moved are necessarily created and guaranteed to be accurate. Anyone can look back over time to understand where the data came from and what rules were applied. True enterprise data is achieved.