The data lake concept generates a lot of excitement and discussion at organizations seeking to get more value from big data. The architectural approach promotes the Hadoop ecosystem to capture and store large volumes of big data quickly and inexpensively. This environment becomes a large “lake” of raw data assets where data scientists can fish for new correlations and insights. The promise has been compelling: A recent TechTarget/Eckerson Group survey found that half of companies with big data programs have or are implementing a data lake.
But disillusionment has already set in. “Data swamp” jokes abound. Others report “dry lakes,” empty environments that are technically cool, but too difficult to load and use. Analyst firm Gartner has published notes with positive, yet cautionary advice for would-be lake swimmers. Data lakes can be as valuable as promised, if implemented correctly as part of an overall analytics strategy.
Here are five key considerations before beginning a data lake project:
What is the specific use case? Data lake is a catchall term for an environment that organizations use differently. One company plans to use data lakes for sensor data from consumer devices, accessed by R&D professionals using visualization tools. Another uses the data lake for staging and batch processing of external data sets. There are countless variations, with important nuances. Every use case introduces different requirements. Before starting a data lake project, catalog the planned use cases and get specific. What types of data will you store now — and in the future? Where will data come from? How much data will you store, and how often will you load new data? Which tools, analytic functions or processes will the lake need to support? Also, consider how the organization will measure success and how you will operationalize discoveries from the lake into the business.
Who will access the data lake? Carefully evaluate who will use the data lake, now and in the future. Where are users located? How will users access the lake: SQL? R? Visualization tools? ETL? What are their skill levels? Obviously, user adoption plays a huge role in the success of the data lake. The ease of use and features must be a fit for the target user groups. This can have a big impact on how an organization deploys a data lake (on-premises vs. in the cloud) and chooses technology options, such as adding a SQL interface or using a commercial solution.
What are the relevant compliance and regulatory requirements? Some data must be inside the firewall, while there are advantages to storing other data types outside the firewall. This is why many organizations have embraced hybrid data architectures, which include both cloud and on-premises systems. Data pipelines often include cloud data lakes for staging and pre-processing data, then export smaller subsets to production systems behind the firewall. Some organizations put marts in the cloud with approved subsets of data for customers or partners, which makes access easy and compliant.
How will we architect and manage the data lake? Different use cases drive different approaches to building data lakes. Will it be a hub-and-spoke architecture or part of a pipeline? How will storage and compute be configured to serve the needs of the users? And like any other data environment, data lakes require management and governance. From regular tasks like data loading and ingest of new sources, to ongoing upgrades, patching and management, lifeguarding the data lake is more than a summer job. Technology choices upfront have a significant, long-term impact. Pure open-source software may allow for more upfront cost-savings on licenses, but require more resources for deployment and administration.
What level of agility can we deliver? Data lakes often come up as a solution for a specific pain happening right now. Delivering a solution five or six months later isn’t very “agile.” That’s how long it took one financial services company, which underestimated the time it would take to configure open-source software on commodity hardware for a new data lake. The cloud can help deliver agility with several vendor options that automate and speed data lake deployment, from Hadoop-as-a-Service to broader Big Data-as-a-Service, if enterprise needs go beyond Hadoop. There is incredible value in deploying a data lake quickly, and seeing how the organization really uses it.
Of all these important considerations, agility comes up over and over again in my meetings. Many organizations are still collectively learning how to make use of new big data sources, democratize data access and still maintain governance and security. There’s no one-size-fits-all answer, and there’s no substitute for experience gained through agile deployment and iteration.
Contributed by industry veteran Prat Moghe of Cazena, discussing key considerations for data lake projects and shares tips on where the cloud may help agility. Prat is founder and CEO of Cazena and also a successful big data entrepreneur with more than 18 years of experience inventing next-generation technology products and building strong teams in the technology sector. As senior vice president of strategy, products and marketing at IBM Netezza, he led a worldwide 400-person team that launched the latest Netezza data warehouse appliance, which became a market leader in price and performance, as well as IBM’s first big data appliance. Following Netezza’s sale to IBM for $1.7 billion in 2010, Prat drove the company’s growth strategy and was the force behind its thought leadership in appliances and analytics.
Sign up for the free insideAI News newsletter.