I recently wrote that ETL is the Root of all Data Problems and I meant it. Traditionally, ETL code is written using an editor to create SQL statements, often embedded within a procedural language like Java. Historically ETL has been the only way to get data from A to B. But writing ETL code leads to a lot of problems most notably, lack of trust in data due to lack of lineage, and metadata; high maintenance costs and security risks. When the norm was simple departmental data warehouses, manual ETL was tolerable. But as data lakes become enterprise-wide in size and complexity, manual coding breaks down as discussed in the earlier article.
Companies running into these ETL problems sometimes make the mistake of adopting GUI-based ETL tools to replace a standard editor. Several vendors offer these GUI tools but at their core they all basically use drag and drop instead of writing lines of code. So instead of writing something like SELECT NAME, ADDRESS, PHONE FROM CUSTOMER_FILE WHERE ZIPCODE =”90210″, the user would drag the desired fields from a visual representation of the target table and drag ZIPCODE to the condition area, indicating it must equal “90210”.
There is one, and only one benefit to a GUI approach – it ensures the correct syntax for the command. That’s it. Period! Full stop! I don’t even think these GUI approaches are any faster or easier to learn, and if they are, it is marginal. SQL code is relatively easy, which is why there is so much of it (again see the previous article for why ETL is the root problem).
For the limited value of GUI approaches, there are serious downsides:
- Vendor lock-in: Any company adopting a GUI tool for their ETL has just locked their company into a long-term commitment. This is because the “rules” of how you want all that data modified is now locked into a proprietary model inside the GUI tool. If the vendor goes under or discontinues the tool, or if a better option comes along, you are in for an expensive migration. ETL tends to hang around long-term and generate lots of jobs, so you will inevitably you will hit a very expensive wall.
- Data governance dependency: If data quality, metadata, lineage etc are requirements (and they should be!), your ability to capture it is only as good as the GUI tool. And since your Data Stewards will want a consistent repository for data governance, either you must force your entire enterprise into the tool or abandon automated data governance and instead do it manually (which always ends in disaster).
- Vendor-driven upgrade distractions: As with all vendor tools, they are constantly being updated, and support for older versions is dropped. The growing pile of GUI-generated code needs to be rerun, tested, etc. with every new update. You know the drill – unnecessary costs and testing cycles.
- Optimization limitations: What happens if the GUI generated code isn’t performing well in your environment? Can you roll up your sleeves and adjust with the underlying SQL to improve it? Maybe. Most GUI tools give you something like this, but if you change the underlying code it breaks the link to the GUI.
Any one of these 4 downsides should be enough to scare any IT leader away from these tools. Honestly you are better off coding straight ETL than using a GUI tool. They are just lipstick on a pig! Which brings us back to square one: ETL is the root of all data problems! GUI tools are a terrible choice but manually coding doesn’t end well either. So what should you do?
At Next Pathway, we pioneered the “no ETL” approach to moving data and building data lakes that last.
Our tools use standard governance artifacts as input which enforces data quality by design. We automatically generate the necessary source code (Spark, Hive, Scala, etc) from these, which your company’s engineers own, manage in their code repository, and maintain over time. This keeps you the IT employees (and data stewards) in the company in control, with no vendor lock-in.