Next Pathway //
May 19, 2020
Next Pathway //
June 25, 2020
Recently named by The Globe and Mail as Canada’s hottest cloud start-up company, Next Pathway automates the end-to-end challenges our customers experience when migrating applications to the cloud
Join the team!
Our work environment rewards people for hard work, loyalty, innovation and mutual support
Getting a big data project off the ground
is tougher than a lot of us might think. As of 2017, over half of big data
projects failed while still in the pilot phase – and far less than a quarter of
Hadoop deployments actually make it to production.
But why? Generally speaking, it is not
because companies don’t value big data or don’t want to make use of big data
software, but because the biggest challenges of implementing big data projects
are often misunderstood or underestimated. Big data is complicated. A lot of deploying
organizations and providers out there tend to undersell how complicated it
really is – that’s why most projects end up being unsuccessful.
One mechanism a lot of people have turned
to in order to store their big data cheaply and independently is data lakes. The
use of data lakes has been increasingly popular since around 2010, the start of
what is referred to as the “first wave” of data lake technologies. If built and
managed properly, a data lake is a great way for highly skilled analysts to be
able to look into data analysis techniques and data refinement, without any of
the compromises or costs of more traditional data stores, like a data mart or
data warehouse. The problem is without those costs and rules, data in a data
lake can end up stagnant, unusable or poorly managed.
That’s where automation comes in. By
automating your data lake as much as possible to replace traditional hand
coding, you can manage your data lake without over-governing it.
Newer, more automated data lake
technologies are being referred to as the “second wave” of data lake
mechanisms. The technology available today is helping a lot of people who may
have abandoned or given up on their data lake give it another shot – this time
If you are someone who previously thought
they don’t have the skills or resources to build and manage a data lake,
automation could be your rescue. Here are a few of the biggest issues people
face with building and using a data lake, and how automation can help them.
Just getting started with a data lake is
more complicated than a lot of people think. Raw data needs to be ingested or
moved into the new data environment – which can take a long time, depending on
the size of the data set.
With simpler data environments the loading
time of bigger data sets isn’t generally a problem. In a development sandbox or
proof of concept platform, analysts just need data to be loaded one time, and
usually not renewed or updated. With a data lake, though, big data is stored
for continuous exploration and analysis, so the long loading times of larger
data sets can actually present a huge challenge, which many people are not
prepared for at the start. As data sets change, they need to be re-ingested
over and over again. It becomes an issue of wasted time and energy.
Automation Can Help
Change data capture (or CDC) is a process
that relies on a set of software design patterns to target and track changing
data, allowing you to work only with the data that has changed. This system can
apply to data ingestion in your data lake to minimize the loading time of huge
Using CDC, you can track small changes and
then load only those changes into the environment, rather than ingesting the
entire data set over and over again. This is a strategy some data engineers
have used to make their data lake more efficient – it allows them to load all
the data at once, and then load and integrate minimal changes to the source
data. In theory, CDC could really revolutionize data ingestion in your data
Unfortunately, it is not a perfect system.
The changes tracked by CDC have to be able to merge into the original data set
for it to be effective. That is easier said than done. As of now, most data
stores do not allow you to automatically merge or update data, which means you
have to take the time and energy to merge the changes yourself. So using CDC
for your data ingestion may or may not make your data lake more efficient, in
the long run.
That said, this is an issue further
automation of data lakes in the future may be able to counter. With the second
wave movement, more and more automation can be expected in future of data lake
technologies – the tools to merge changes with base data in a data lake could
help to allow engineers to expedite data ingestion using CDC.
The purpose of a data lake is to allow
analysts access to raw data to work with and refine. That means data in the
lake is not cleaned or prepped in any way when it is loaded; it has to be
prepared to combine with other data sets in the lake.
This can be challenging for an individual
developer handwriting code. Data pipelines and analytics need to be created
quickly to keep pace, and they need to function efficiently. They also have to
be able to adapt to the data set in question – to operate according to the
workflow and to work within data environments of varying sizes without needing
to be changed or rewritten. It is possible to meet this challenge, but it can
become another issue of time and energy efficiency, especially for smaller
companies and organizations with limited programming and analytical skills.
As of now, one of the most common solutions
to this challenge is to use data wrangling tools to work with data within the
lake. Through an intuitive programming interface like Python, R or Spark, data
scientists can make use of data wrangling tools to search through data already
stored in the lake and prep data sets for further analysis. This is generally
faster than writing code to prep data, and easier – a team of researchers with
limited programming skills will be able to prep and cleanse large data sets
Of course, using wrangling tools to prep
your data is not perfect. While intuitive, visual interfaces make it easier,
they also offer more of a blanket solution than individually coding pipelines
and analytics. More customized tools will still be necessary. For example, most
data wrangling software does not support shared learning – meaning wrangling
tools will not be reusable across an organization. Prepping data in the data
lake with the use of data wrangling tools will still require a lot of time and
energy on the part of analysts.
Again, the hope is further automation with
second wave data lake technologies will help to counter this problem, further
adapting data wrangling tools to make them more efficient to use on large data
sets within a data lake.
After data has been loaded into the data lake
and prepped for analysis, visual results need to be created for the data to be
presented to the rest of the company or organization who do not have the
analytical skills to understand it as it is.
Generally, with smaller data, this can be
done fairly efficiently using business intelligence and data visualization
tools. There is plenty of software out there designed to make complex data
appear neat and presentable. The problem is, most of those tools are not
equipped to handle the huge volumes of data are typically stored in a data
lake. This is one of the reasons big data projects are often not meant with
Big data is more complex, and so queries
regarding big data are going to be more complex. Sources like Hive or NoSQL are
not meant to compute highly complex queries – they may be successful with big
data from a data lake, but they will take longer, and they will generally not
perform as well.
There are a few ways to manage this issue,
and they all mean taking up a lot of time and resources, rendering your data
lake less cost-effective. One solution is to generate in-memory OLAP cubes or
data models – again, this requires a lot of time, and programming talent to
ensure they will be able to handle larger data volumes and more complex
queries. Otherwise, some data scientists move data sets back into a warehouse
or store, which in a way defeats the purpose of the data lake in the first
Just like with the two challenges outlined
above, the root of this issue is data management tools and software are not
designed to handle big data or work within data lakes, which means programmers
are left to create solutions on their own.
Engineering data models or OLAP cubes to
make data from the data lake visualized might be more trouble than it is worth,
depending on the resources of your organization. Improved automation in second
wave data lake technologies should be able to reform data visualization tools
to be able to handle larger, more complex data.
Even with a successful data lake up and
running, hand-coded solutions are going to require continuous work and upkeep.
For one thing, the pipelines and analytics
used in your data lake need to remain operational and functional for continued
use. A data lake intends to continuously analyze and use data to drive
decision-making within your organization – it is not a matter of gaining one
insight and then forgetting about the data. In other words, programmers working
within a data lake are not creating a solution to work just once; they are
creating solutions that need to be reliable over and over again without errors.
Portability is the other side of this
issue. There may be multiple big data environments running within one
organization, by different deployers. A hand-coded pipeline will need to be
compatible with different interfaces, like Google Cloud, Azure, etc. That is a
difficult challenge to meet, even for the most advanced programmers.
The solution to this challenge? Moving away
from hand-coded tools and solutions and making use of increasingly automated
When it comes to reliability, automated
analytical tools will be better at reducing the chances of error and
functioning time after time, to continue providing new insights. Automation
will also make it easier for pipelines to be compatible across multiple
platforms, connecting big data environments across an organization to help
everything become more streamlined and efficient.
The concept of a data lake to manage your
organization’s big data is certainly appealing. It is a cheaper way to manage
big data and allows unique opportunities for the analysis and refinement of
data. That appeal is what lead to the first wave surge of data lake
technologies in the first place.
However, without a team of highly-skilled
programmers dedicated to the project, initiatives to build and manage a data
lake often fail. The technology is simply not there yet to fully automate the
functions of a data lake, and so they require a lot of fine-tuning, upkeep and
coding to really be useful.
Fortunately, the second wave of data lake
technologies has arrived, and with any luck, as the concept of data lakes rises
in popularity again, more and more automated technologies will be adapted and
made available. Automation will make data lakes more time efficient, reliable
and easier to use. The more an organization can move away from hand-coded
solutions, the more effective their data lake is going to be.
In turn, improved data lake technologies
can make big data projects more achievable for all companies. With this second
wave, we can hope to see an increase in successful big data projects.
Copyright © 2020 Next Pathway Inc. All rights reserved.SHIFT™ is an existing, applied for or registered trademark of Next Pathway Inc.