Data lakes are enterprise-wide data management platforms designed for storing and analyzing vast amounts of information from disparate data sources in their native format. The idea is to place data into a data lake in their native structure instead of a repository built for a specific purpose such as a data warehouse or data mart. This new type of elastic data store eliminates the usual extract-tranform-load (ETL) process and once the data reside in the lake, they become accessible to users across the organization for data analysis and reporting. The primary driver for data lakes is to increase accessibility and agility for analytics.
The concept of a data lake is closely tied to Apache Hadoop, the distributed computing architecture and its growing ecosystem of open source projects. Discussions of the data lake quickly lead to a description of how to build a data lake using the power of Hadoop. Hadoop is popular here because of its ability to store a variety of data formats in a cost-effective and technologically feasible way. Organizations are discovering the data lake as an evolution from their existing data architecture, most notably enterprise data warehouses and silos. Also of note is Apache Spark that is coming on strong in the data lake realm due to its storage and deployment agnosticism.
Business users require immediate access to data with no silos and no movement of data. Existing approaches today to achieve the above require a middleman to accomplish (usually IT), said Steve Wooledge, Vice President, Product Marketing, MapR Technologies. “In order to achieve data agility within the data lake and make self-service data exploration possible for the business user the tools that are being delivered today must reduce the distance to data for the business user. We’re seeing data agility coupled with innovations that provide secure, long-term storage so that pattern recognition can be performed across large volumes and large time frames of data. With these protections in place the data lake becomes a mission critical system of record of data. This market is also about time-to-value. Reducing the time-to-value to minutes from weeks or months with traditional data approaches is where the business user will see the most value.”
A report by CITO Research identified four important stages in implementing a data lake:
- Getting the plumbing in place and learning to acquire and transform data at scale.
- Improving the ability to transform and analyze data.
- Providing broad operational impact by getting data and analytics into the hands of as many people as possible.
- Adding enterprise capabilities to the data lake.
The rise of the data lake is a well-conceived reaction to the need for embracing the 3 Vs of big data. The next step is demonstrating the means for facilitating the Industrial Internet – to learn more about this area of technological evolution, stay tuned for Part 2 of this article.
Daniel D. Gutierrez, Managing Editor – insideAI News
Sign up for the free insideAI News newsletter.
I would like to propose a variation of the concept of Data Lakes. This variation deviates from the notion of just simply grafting the concepts of an Enterprise Datawarehouse on top of a Hadoop cluster.
This rather revolves around the concept of a more agile method towards Data Integration (see: http://blog.alluviate.com/?p=46 )