How Second Wave Data Lake Technologies can Help your Big Data Project

Getting a big data project off the ground is tougher than a lot of us might think. As of 2017, over half of big data projects failed while still in the pilot phase – and far less than a quarter of Hadoop deployments actually make it to production.

But why? Generally speaking, it is not because companies don’t value big data or don’t want to make use of big data software, but because the biggest challenges of implementing big data projects are often misunderstood or underestimated. Big data is complicated. A lot of deploying organizations and providers out there tend to undersell how complicated it really is – that’s why most projects end up being unsuccessful.

One mechanism a lot of people have turned to in order to store their big data cheaply and independently is data lakes. The use of data lakes has been increasingly popular since around 2010, the start of what is referred to as the “first wave” of data lake technologies. If built and managed properly, a data lake is a great way for highly skilled analysts to be able to look into data analysis techniques and data refinement, without any of the compromises or costs of more traditional data stores, like a data mart or data warehouse. The problem is without those costs and rules, data in a data lake can end up stagnant, unusable or poorly managed.

That’s where automation comes in. By automating your data lake as much as possible to replace traditional hand coding, you can manage your data lake without over-governing it.

Newer, more automated data lake technologies are being referred to as the “second wave” of data lake mechanisms. The technology available today is helping a lot of people who may have abandoned or given up on their data lake give it another shot – this time successfully.

If you are someone who previously thought they don’t have the skills or resources to build and manage a data lake, automation could be your rescue. Here are a few of the biggest issues people face with building and using a data lake, and how automation can help them.

Loading Data into a Data Lake

The Challenge

Just getting started with a data lake is more complicated than a lot of people think. Raw data needs to be ingested or moved into the new data environment – which can take a long time, depending on the size of the data set.

With simpler data environments the loading time of bigger data sets isn’t generally a problem. In a development sandbox or proof of concept platform, analysts just need data to be loaded one time, and usually not renewed or updated. With a data lake, though, big data is stored for continuous exploration and analysis, so the long loading times of larger data sets can actually present a huge challenge, which many people are not prepared for at the start. As data sets change, they need to be re-ingested over and over again. It becomes an issue of wasted time and energy.

How Automation Can Help

Change data capture (or CDC) is a process that relies on a set of software design patterns to target and track changing data, allowing you to work only with the data that has changed. This system can apply to data ingestion in your data lake to minimize the loading time of huge data sets.

Using CDC, you can track small changes and then load only those changes into the environment, rather than ingesting the entire data set over and over again. This is a strategy some data engineers have used to make their data lake more efficient – it allows them to load all the data at once, and then load and integrate minimal changes to the source data. In theory, CDC could really revolutionize data ingestion in your data lake.

Unfortunately, it is not a perfect system. The changes tracked by CDC have to be able to merge into the original data set for it to be effective. That is easier said than done. As of now, most data stores do not allow you to automatically merge or update data, which means you have to take the time and energy to merge the changes yourself. So using CDC for your data ingestion may or may not make your data lake more efficient, in the long run.

That said, this is an issue further automation of data lakes in the future may be able to counter. With the second wave movement, more and more automation can be expected in future of data lake technologies – the tools to merge changes with base data in a data lake could help to allow engineers to expedite data ingestion using CDC.

Prepping Data in a Data Lake

The Challenge

The purpose of a data lake is to allow analysts access to raw data to work with and refine. That means data in the lake is not cleaned or prepped in any way when it is loaded; it has to be prepared to combine with other data sets in the lake.

This can be challenging for an individual developer handwriting code. Data pipelines and analytics need to be created quickly to keep pace, and they need to function efficiently. They also have to be able to adapt to the data set in question – to operate according to the workflow and to work within data environments of varying sizes without needing to be changed or rewritten. It is possible to meet this challenge, but it can become another issue of time and energy efficiency, especially for smaller companies and organizations with limited programming and analytical skills.

How Automation Can Help

As of now, one of the most common solutions to this challenge is to use data wrangling tools to work with data within the lake. Through an intuitive programming interface like Python, R or Spark, data scientists can make use of data wrangling tools to search through data already stored in the lake and prep data sets for further analysis. This is generally faster than writing code to prep data, and easier – a team of researchers with limited programming skills will be able to prep and cleanse large data sets efficiently.

Of course, using wrangling tools to prep your data is not perfect. While intuitive, visual interfaces make it easier, they also offer more of a blanket solution than individually coding pipelines and analytics. More customized tools will still be necessary. For example, most data wrangling software does not support shared learning – meaning wrangling tools will not be reusable across an organization. Prepping data in the data lake with the use of data wrangling tools will still require a lot of time and energy on the part of analysts.

Again, the hope is further automation with second wave data lake technologies will help to counter this problem, further adapting data wrangling tools to make them more efficient to use on large data sets within a data lake.

Data Visualization Tools and Queries

The Challenge

After data has been loaded into the data lake and prepped for analysis, visual results need to be created for the data to be presented to the rest of the company or organization who do not have the analytical skills to understand it as it is.

Generally, with smaller data, this can be done fairly efficiently using business intelligence and data visualization tools. There is plenty of software out there designed to make complex data appear neat and presentable. The problem is, most of those tools are not equipped to handle the huge volumes of data are typically stored in a data lake. This is one of the reasons big data projects are often not meant with success.

Big data is more complex, and so queries regarding big data are going to be more complex. Sources like Hive or NoSQL are not meant to compute highly complex queries – they may be successful with big data from a data lake, but they will take longer, and they will generally not perform as well.

There are a few ways to manage this issue, and they all mean taking up a lot of time and resources, rendering your data lake less cost-effective. One solution is to generate in-memory OLAP cubes or data models – again, this requires a lot of time, and programming talent to ensure they will be able to handle larger data volumes and more complex queries. Otherwise, some data scientists move data sets back into a warehouse or store, which in a way defeats the purpose of the data lake in the first place.

How Automation Can Help

Just like with the two challenges outlined above, the root of this issue is data management tools and software are not designed to handle big data or work within data lakes, which means programmers are left to create solutions on their own.

Engineering data models or OLAP cubes to make data from the data lake visualized might be more trouble than it is worth, depending on the resources of your organization. Improved automation in second wave data lake technologies should be able to reform data visualization tools to be able to handle larger, more complex data.

Reliability and Portability of Data Pipelines

The Challenge

Even with a successful data lake up and running, hand-coded solutions are going to require continuous work and upkeep.

For one thing, the pipelines and analytics used in your data lake need to remain operational and functional for continued use. A data lake intends to continuously analyze and use data to drive decision-making within your organization – it is not a matter of gaining one insight and then forgetting about the data. In other words, programmers working within a data lake are not creating a solution to work just once; they are creating solutions that need to be reliable over and over again without errors.

Portability is the other side of this issue. There may be multiple big data environments running within one organization, by different deployers. A hand-coded pipeline will need to be compatible with different interfaces, like Google Cloud, Azure, etc. That is a difficult challenge to meet, even for the most advanced programmers.

How Automation Can Help

The solution to this challenge? Moving away from hand-coded tools and solutions and making use of increasingly automated technology.

When it comes to reliability, automated analytical tools will be better at reducing the chances of error and functioning time after time, to continue providing new insights. Automation will also make it easier for pipelines to be compatible across multiple platforms, connecting big data environments across an organization to help everything become more streamlined and efficient.

Final Thoughts

The concept of a data lake to manage your organization’s big data is certainly appealing. It is a cheaper way to manage big data and allows unique opportunities for the analysis and refinement of data. That appeal is what lead to the first wave surge of data lake technologies in the first place.

However, without a team of highly-skilled programmers dedicated to the project, initiatives to build and manage a data lake often fail. The technology is simply not there yet to fully automate the functions of a data lake, and so they require a lot of fine-tuning, upkeep and coding to really be useful.

Fortunately, the second wave of data lake technologies has arrived, and with any luck, as the concept of data lakes rises in popularity again, more and more automated technologies will be adapted and made available. Automation will make data lakes more time efficient, reliable and easier to use. The more an organization can move away from hand-coded solutions, the more effective their data lake is going to be.

In turn, improved data lake technologies can make big data projects more achievable for all companies. With this second wave, we can hope to see an increase in successful big data projects.