Protecting Your Data Lake Requires a New Mindset

Print Friendly, PDF & Email

If you work in corporate IT, you can’t help but be aware that 2020 was a terrible year for data security and most of the rest of humanity. Many reasons have been put forward for this, most of which focus on short-term factors – the pandemic leading to an increase in phishing attacks, mainly.

Look a little deeper into these hacks, however, and you’ll spot a more fundamental, longer-term pattern at play. Though phishing remains as effective as ever, the quantity of data breached in 2020 was partially also a consequence of just how much data is now stored in large data lakes that lack internal security measures. This means that if an attacker gains access to a system, they can move around within it very easily.

This is not a new observation. It has led some, in fact, to question whether they actually need a data lake. While the answer is often positive, this doesn’t mean you’re doomed to the certainty of successful cyber attacks – as long as you are not relying on hope as a strategy, that is.

  1. The Security Onion

Here’s the way that a lot of organizations still think of data security. The need to centralize data pulls multiple discrete databases together into a data lake, either to improve access to data or as part of a mergers and acquisitions process. This leads to multiple types of data – each with their own access and security typologies – being housed in a large, undifferentiated mass on a server. The solution? Throw up a security cordon around the whole lot and then write an API to pull out the data that you actually need into a data lake.

Over time, multiple iterations of this process can lead to a security “architecture” (though maybe it’s a stretch to use that term here!) which looks a little like an onion. That is, with multiple systems nested inside each other, each protected with it’s own “security layer.”

The problem with this approach is two-fold. One is that it is simply a waste of resources, both in terms of money spent on security solutions and in terms of the time that you will spend working with such a complex system. The second is that it is not secure.

  1. Data Lakes as Hacking Shortcuts

In order to see why, it helps to go back to basics, and to think about how data lakes were supposed to work. Back when they first emerged as a practical concept, the idea was that data would be pulled into a temporary data lake, processed, and would then disappear. The problem is that, for many organizations, data lakes are no longer a temporary invocation. They are, for all intents and purposes, permanent data structures.

Like its namesake, a data lake is a single repository of a large amount of something in an unmanaged state. It creates the tendency to throw large amounts of data into one place and then pull and analyze as needed. Data analysis and hacking are both so convenient because all the data is in one place.

This means that they are very “useful” tools for hackers. Some of the most massive data breaches in history occurred AFTER hackers were able to gain access to a large repository of unprotected data online, then use it’s privileges to move into the databases it pulls from. Because this data was presumed to be protected by a perimeter security “wall,” once inside the hacker is granted a large degree of lateral freedom. In other words, data lakes act as a shortcut into the heart of the onion.

So what’s the solution? Well, many would continue with the same flawed logic we’ve seen above, and say: wrap your data lake in another layer of security. Encase your entire system, data lake and all, in yet another layer of perimeter security. 

  1. Keep it Temporary

There are a few reasons why you shouldn’t do that. One is that adding yet another layer of security around a lake undermines the ease of access that is the whole reason you are using a data lake. The second is that you are adding another layer of complexity that you will have to manage, and unless you are one of the very few who are receiving real-time analytics from your data lake, you are unlikely to have the time, skill, or other resources necessary to manage intrusion attempts.

At a more fundamental level, however, building a perimeter security system for a data lake should be an inherently absurd suggestion, because data lakes are supposed to be temporary. Also, and as you likely point out when you help your clients with cybersecurity, simple security is generally better security. 

Therefore, you should recognize the value of data lakes lies in providing quick, effective access to the heterogeneous data you manage, though you should also recognize that they are an inherent security risk. In other words, use them carefully. Above all, this means using them in the way they were intended – invoking them quickly, performing the necessary analysis, and then closing them down just as quickly.

That might represent a change of mindset for many administrators, and particularly those of the younger generation. Data lakes have represented a temptation for many years now, and it’s a temptation that many of us have fallen for. As a result, our lakes are less like the temporary lakes of summer showers, and more like the ancient, eternal lakes of alpine valleys.

  1. The Future

The importance of this lesson will only grow in the coming years. Despite some of the challenges that they represent, and not least when it comes to security, no-one seriously doubts the future of the concept. Data lakes are the future of data warehousing, whether we like it or not.

That’s all the more reason to learn to use them properly. Make sure your data lakes are temporary and disappear as quickly as they are invoked. This will keep your systems much, MUCH more secure. 

About the Author

Bernard Brode has spent a lifetime delving into the inner workings of cryptography and now explores the confluence of nanotechnology, AI/ML, and cybersecurity.

Speak Your Mind



  1. I got pretty good information from this article