Do You Actually Need a Data Lake?

Data lakes have become the cornerstone of many big data initiatives, just as they offer easier and more flexible options to scale when working with high volumes of data that’s being generated at a high velocity – such as web, sensor or app activity data. Since these types of data sources have become increasingly prevalent, interest in data lakes has also been growing at a rapid pace, as can be seen from this Google Trends chart:

However, as with any emerging technology, there is no one-size-fits-all: a data lake might be an excellent fit for some scenarios, but in other cases, sticking to tried-and-tested database architectures will be the better solution. In this article we’ll look at five indications that should help you understand whether it’s time to join the data lake bandwagon or if you should stick to traditional data warehousing. But first, let’s set the parameters of the discussion by defining the term ‘data lake’.

Data Lakes: A Functional Definition

The data lake is an approach often defined to be a big data architecture that focuses on storing unstructured or semi-structured data in its original form, in a single repository that serves multiple analytic use cases or services. Storage and compute resources are decoupled, so that data at rest resides on inexpensive object storage, such as Hadoop on-premise or Amazon S3, while various tools and services such as Apache Presto, Elasticsearch and Amazon Athena can be used to query that data.

This differs from traditional database or data warehouse architectures, where compute and storage are coupled, and the data is structured upon ingestion in order to enforce a set schema. Data lakes make it easier to adopt a ‘store now, analyze later’ approach, as there is very little effort involved in ingesting data into the lake; however, when it comes to analyzing the data, some of the traditional data preparation challenges can appear.

Now that we have a definition, let’s go on to ask – does your organization need a data lake? Start by looking at these 5 key indicators.

1. How Structured is your Data?

Data lakes are excellent for storing large volumes of unstructured and semi-structured data. Storing this type of data in a database will require extensive data preparation, as databases are built around structured tables rather than raw events which would be in JSON / XML format.

If most of your data is composed of structured tables – e.g. preprocessed CRM records or financial balance sheets – it could be easier to stick to a database. However, if you’re working with a large volume of event-based data such as server logs or clickstream, it might be easier to store that data in its raw form and build specific ETL flows based on your use case.

2. How Complex is your ETL Process?

ETL (extract-transform-load) is typically a prerequisite to actually putting your data to use; however, when working with big or streaming data it can become a major roadblock due to the complexity of writing ETL jobs using code-intensive frameworks such as Spark/Hadoop.

To minimize the amount of resources you are spending on ETL, try to identify where the main bottleneck occurs. If you’re mostly struggling with trying to ‘fit’ semi-structured and unstructured data into your relational database, it might be time to think of making the transition to a data lake. However, you might still run into a lot of challenges in creating ETL flows from the lake to the various target services you’ll use for analytics, machine learning, etc. – in which case you might want to use a data lake ETL tool in order to automate some of these processes.

3. Is Data Retention an Issue?

Since databases couple storage with compute, storing very large volumes of data in a database becomes expensive. This leads to a lot of fidgeting with data retention – either pruning certain fields off the data, or limiting the period in which we hold on to historical data, in order to control costs.

If your organization is constantly struggling to strike the right balance between holding on to data for analytical purposes versus getting rid of data to control costs, a data lake solution might be in order – as data lake architectures built around inexpensive object storage allow you to hold on to terabytes or even petabytes of historical data without paying through the nose.

4. Is Your Use Case Predictable or Experimental?

The final question you should ask is what you intend to do with the data. If you’re just trying to build a report (or set of reports, or dashboards) that will essentially be built by running a predetermined set of queries against tables that are regularly updated, a data warehouse will probably be a very good solution, as you will be able to simply set up such a solution using SQL and available data warehouse and business intelligence tools.

However, for more experimental use cases – such as machine learning and predictive analytics – it’s more difficult to know in advance what data you’ll need and how you would like to query it. In these cases, data warehouses can be highly inefficient as the predefined schema will limit your ability to explore the data. In these cases, a data lake could be a better fit.

Conclusion: Is a Data Lake Right for You?

Ending an article with “it depends” always feels like a cop-out, but the reality of the matter is that most tech questions don’t have a single answer. When your data reaches a certain level of size and complexity, data lakes are definitely the way to go. Is your organization there yet? You can use the four questions detailed above to try and reach an answer to that question.

About the Author

Eran Levy is Director of Marketing at Upsolver. Upsolver is cloud-native platform that you configure using a simple, visual UI and SQL. The world’s most innovative companies use Upsolver to automate all data lake operations: ingestion, storage management, schema management and ETL flows (including aggregations and joins).

Sign up for the free insideAI News newsletter.