Two-Speed Data

This past week I attended the Google Cloud Conference “Next 18” in San Francisco along with over 25,000 others. Last year attendance was 8,000, the growth reflecting how cloud is hitting the mainstream. The progress Google made in the past twelve months was impressive not just with many new technical advances, but by growing their partner ecosystem and seriously beefing up their own organization with many more sales, engineering and support people.

As you’d expect, much of the focus was on data capabilities as well as the Analytics, Machine Learning and AI tools that rely on data. In the middle of all the impressive advances, I was struck by a thought: there exists two fundamental use cases for ingesting data into the cloud. The first is where the quality and lineage is less important than cost and speed. This is very common with unstructured data, ad-hoc analytics, and ML/AI needs. The second is when quality and lineage is a core requirement, such as when the data support regulatory reporting. The difference is subtle since of course people always strive for quality. But think about it as the difference between 99% being sufficient and 99.999% is required.

In the excitement of a conference where the focus is on showing how fast and easy everything is, virtually every example was the former. And Google certainly wowed the crowd with good capabilities to quickly ingest data in support of their leading analytics, ML, and AI tools. But the reality for many businesses is much or more of the cases they have are the latter, where quality and lineage are critical needs. A big “gotcha” to watch out for is when “both” is the answer. Take, for example, a bank loading credit card transaction into their data lake. The data analysts want it fast and cheap, so they can start mining it for insights. But other groups will also need that same data for regulatory, business, or financial reporting. So, the right thing to do is take the care up front to make sure you’re addressing both. Ignoring the other needs in order to help the data analysts get a quick win only results in throw-away and rework.

Historically, the burden of tracking lineage has been quite high. This is because, across a large enterprise where lots of developers manually write ETL jobs (including many contractors who come and go for various projects), lineage and metadata are tracked manually as an afterthought. This approach rarely works. It takes a significant number of skilled staff (and funding) to train developers, track change, and accurately update metadata/lineage records. Even in the companies committed to doing this well, after-the-fact, manual updating breaks down with the volume of changes and developers in large enterprises.

If quick data ingestion takes one unit of effort, high-quality ingestion historically took 3-5 units. And as pointed out above, it rarely holds up over time. The result is a lot of companies have given up and those still trying have frustrated both developers and data governance teams. Unfortunately, the problem hasn’t gone away. Enterprise Data Lakes require accurate metadata and lineage, and that only works if the incremental effort is small and the method holds up over time.

Fortunately, at Next Pathway, we have solved both these problems.

At Next Pathway, we accelerate the ingestion of high-quality data and automatically capture lineage. Our Cornerstone product uses a patented process to eliminate the manual work of writing ETL jobs, which is where quality and lineage errors are introduced by developers. And the ETL code we automatically generate includes the capture of lineage information. The result is ingestion speed approaching that of cases where quality is not as important.

Our Fuse product takes the next step. Once quality, traceable data lands in the lake using Cornerstone, Fuse takes the landed data and performs the complex transformations necessary for an enterprise data lake or regulatory reporting needs. Fuse also eliminates the need for manually writing ETL jobs. Instead, experts define the target data schema and Fuse then generates the code necessary to physicalize the target EDL from the landed source data.

So please think about all your needs when ingesting data into a lake and not just getting some quick wins. If lineage capture is required, be sure to address it through automation. Manual approaches tend to not work and are costly. We’d be happy to discuss your needs further or arrange a demo.