In this special guest feature Stan Christiaens, Co-founder and CTO at Collibra, shares strategies to ensure that data moves beyond raw material to take its rightful place as a valuable business asset. The article outlines common problems with data lakes, strategies for how business can avoid those problems, and how governance enables a data lake to become more than just a data repository. Stan leads a global product organization with a focus on driving innovation in data governance and catalog software. Prior to co-founding Collibra, Stan was a senior researcher at the Vrije Universiteit of Brussels, a leading semantic research center in Europe, where he focused on application-oriented research in semantics. He holds a Master of Science degree in Information Technology and a Master’s degree in AI from Katholieke Universiteit Leuven and a Postgraduate in Industrial Corporate Governance from Europese Hogeschool Brussel.
Big data. Although the term is ubiquitous today, it wasn’t so long ago that “big data” wasn’t part of the everyday lexicon. The growth of data, in all its forms, during the last five years has been dizzying, and has caught many organizations worldwide flat-footed.
In light of this, finding a way to deal with all of this data has become big business. IDC estimates that worldwide revenue for big data and business analytics will grow to more than $260 billion by 2022. Organizations have made significant investments in hardware, software and services to deal with the onslaught of data.
Data lakes quickly emerged as a technology front-runner in the race to make data more digestible – and to finally get it in one place. Data lakes are flexible, scalable and offer an easy solution to store data. They serve as central repositories for all types of “raw” data, including structured, semi-structured and unstructured. The data structure and requirements aren’t defined until the data is needed. Ideally, a data lake is the go-to location for data scientist and business users alike, fueling all analytics activities across the business.
The reality is that getting insight and value from so much data is challenging. Forrester finds that between 60 – 73% of all enterprise data goes unused for analytics. It’s all too common for the majority of users to only find a small percentage of truly valuable in this wide array of assets. In the rush to aggregate our data somewhere, lakes have become swamps of undefined data from a variety of sources. Data scientists and everyday business users struggle to find and understand data. Even worse, once they find a source, can they trust it?
It begs the question: Are we fooling ourselves? Can a single, centralized repository for the business really exists?
The answer is yes. It’s not the lake, but rather how we organize and govern it.
The first issue to address when it comes to data lakes is how we organize them. It’s easy to misalign on the purpose and content of the lake. Therefore, it’s imperative to establish a comprehensive set of processes and controls before a single byte of data finds it way into the lake. Key questions about the data should include:
- What is it?
- Who owns it?
- Why should it be put in?
- Does it really belong in the lake?
- Is it the right source?
The next issue to address is how the lake is governed. Gartner has long warned that data lakes without the right level of governance will devolve into disconnected data pools. A common misconception is that governance builds walls between business users and data. The opposite is true. Governance creates transparency across the organizations. So much data, generated so quickly, makes it difficult to understand the data’s origin, format and lineage, as well as how it is organized, classified and connected. These unknowns result in poorer quality outcomes. Knowing these features is critical to its use. Data governance provides the structure and management the lake desperately needs, making data more accessible and meaningful, resulting in greater trust and quality. Without such a framework, it’s impossible to know what’s in the lake, who owns it, or its overall value.
Every data governance effort should include a data catalog to serve as a single source of intelligence for data users to discover and consume data. A data catalog should contain data for all of the categories comprising the lake, and the catalog should identify the most valuable data sets. For example, if the majority of users only use 10% of the data in the lake, the catalog must detect and label those assets as the most valuable. This allows data scientists and business users to spend less time concerned about the quality of the data. Instead, they can focus their energy on analyzing that trusted data to gain new insights and better meet customer needs.
The promise of big data is to enable organizations to analyze their data to gain better insight, and make more informed decisions, as never before. To realize this promise, we must look at how we collect that data and the processes we put in place to ensure we give data the appropriate meaning and context. Otherwise, we will only create more data dumping grounds. It’s not too late to implement the right level of governance to ensure your data lake becomes a dynamic tool that enable users to improve decision-making and drive innovation.
Sign up for the free insideAI News newsletter.
Speak Your Mind