Data lakes have become the cornerstone of many big data initiatives, just as they offer easier and more flexible options to scale when working with high volumes of data that’s being generated at a high velocity – such as web, sensor or app activity data. Since these types of data sources have become increasingly prevalent, interest in data lakes has also been growing at a rapid pace, as can be seen from this Google Trends chart:
However, as with any emerging technology, there is no
one-size-fits-all: a data lake might be an excellent fit for some scenarios,
but in other cases, sticking to tried-and-tested database architectures will be
the better solution. In this article we’ll look at five indications that should
help you understand whether it’s time to join the data lake bandwagon or if you
should stick to traditional data warehousing. But first, let’s set the
parameters of the discussion by defining the term ‘data lake’.
Data Lakes: A Functional Definition
The data lake is an approach often defined to be a big data architecture that focuses on storing unstructured or semi-structured data in its original form, in a single repository that serves multiple analytic use cases or services. Storage and compute resources are decoupled, so that data at rest resides on inexpensive object storage, such as Hadoop on-premise or Amazon S3, while various tools and services such as Apache Presto, Elasticsearch and Amazon Athena can be used to query that data.
This differs from traditional database or data warehouse architectures, where compute and storage are coupled, and the data is structured upon ingestion in order to enforce a set schema. Data lakes make it easier to adopt a ‘store now, analyze later’ approach, as there is very little effort involved in ingesting data into the lake; however, when it comes to analyzing the data, some of the traditional data preparation challenges can appear.
Now that we have a definition, let’s go on to ask –
does your organization need a data lake? Start by looking at these 5 key
indicators.
1. How Structured is your Data?
Data lakes are excellent for storing large volumes of
unstructured and semi-structured data. Storing this type of data in a database
will require extensive data preparation, as databases are built around
structured tables rather than raw events which would be in JSON / XML format.
If most of your data is composed of structured tables
– e.g. preprocessed CRM records or financial balance sheets – it could be
easier to stick to a database. However, if you’re working with a large volume
of event-based data such as server logs or clickstream, it might be easier to
store that data in its raw form and build specific ETL flows based on your use
case.
2. How Complex is your ETL Process?
ETL (extract-transform-load) is typically a
prerequisite to actually putting your data to use; however, when working with
big or streaming data it can become a major roadblock due to the complexity of
writing ETL jobs using code-intensive frameworks such as Spark/Hadoop.
To minimize the amount of resources you are spending on ETL, try to identify where the main bottleneck occurs. If you’re mostly struggling with trying to ‘fit’ semi-structured and unstructured data into your relational database, it might be time to think of making the transition to a data lake. However, you might still run into a lot of challenges in creating ETL flows from the lake to the various target services you’ll use for analytics, machine learning, etc. – in which case you might want to use a data lake ETL tool in order to automate some of these processes.
3. Is Data Retention an Issue?
Since databases couple storage with compute, storing
very large volumes of data in a database becomes expensive. This leads to a lot
of fidgeting with data retention – either pruning certain fields off the data,
or limiting the period in which we hold on to historical data, in order to
control costs.
If your organization is constantly struggling to
strike the right balance between holding on to data for analytical purposes
versus getting rid of data to control costs, a data lake solution might be in
order – as data lake architectures built around inexpensive object storage
allow you to hold on to terabytes or even petabytes of historical data without
paying through the nose.
4. Is Your Use Case Predictable or Experimental?
The final question you should ask is what you intend
to do with the data. If you’re just trying to build a report (or set of
reports, or dashboards) that will essentially be built by running a
predetermined set of queries against tables that are regularly updated, a data
warehouse will probably be a very good solution, as you will be able to simply
set up such a solution using SQL and available data warehouse and business
intelligence tools.
However, for more experimental use cases – such as
machine learning and predictive analytics – it’s more difficult to know in
advance what data you’ll need and how you would like to query it. In these
cases, data warehouses can be highly inefficient as the predefined schema will
limit your ability to explore the data. In these cases, a data lake could be a
better fit.
Conclusion: Is a Data Lake Right for You?
Ending an article with “it depends” always feels like a cop-out, but the reality of the matter is that most tech questions don’t have a single answer. When your data reaches a certain level of size and complexity, data lakes are definitely the way to go. Is your organization there yet? You can use the four questions detailed above to try and reach an answer to that question.
About the Author
Eran Levy is Director of Marketing at Upsolver. Upsolver is cloud-native platform that you configure using a simple, visual UI and SQL. The world’s most innovative companies use Upsolver to automate all data lake operations: ingestion, storage management, schema management and ETL flows (including aggregations and joins).
Sign up for the free insideBIGDATA newsletter.
Leave a Reply