In this special guest feature, Ben Werther, Founder and Executive Chairman of Platfora, shares his thoughts about how many view the data lake as a big data Holy Grail and what the future may hold for this segment of the big data technology stack. Ben Werther launched Platfora to transform the way businesses use big data analytics. Under Ben’s leadership, Platfora is now one of the hottest companies in Silicon Valley and a leader in the big data analytics space. Before founding Platfora, Ben was vice president of products for DataStax, where he shaped the company’s enterprise and Hadoop strategy, and was also head of product at big data analytics company Greenplum. Ben has a B.S. in Computer Science from Monash University and an M.S. in Computer Science from Stanford.
There’s a common misconception that building a “data lake” is just industry jargon for loading your data into Hadoop. But these two ideas — the “data lake” as the flexible store and mixing bowl for disparate data of all kinds, and Hadoop as the technology for processing that data — aren’t as intertwined as you may think. This has some very interesting implications for where things are headed.
It is true that the emergence of Hadoop started this transformation. The inspiration for this open-source technology was two landmark papers out of Google. One, about GFS (Google file system), described a massively scalable and efficient way to store files. This was the inspiration for Hadoop’s HDFS file system — the typical storage layer for today’s data lake. The second, about MapReduce, described a way to do work across these files in a parallel fashion to process or analyze any scale of data. Hadoop emulated this and made MapReduce its primary interface to all of the data in HDFS.
As time marched on, the value of this file system (HDFS) has become more and more clear. Unlike traditional relational databases where data must be loaded in a very regimented way with the end in mind, this kind of storage allows companies to bring together data of all formats and sizes in an agile fashion. Whether it is transactional records, web clickstream data, IoT events, security logs, or anything else, it can be written into HDFS now and becomes part of the world of data to be combined and analyzed at any time now or later. HDFS becomes the universal common ground for whatever follows.
At the same time, we’ve seen MapReduce age and increasingly be surpassed by other technologies. This happened inside Google (with the emergence of internal technologies such as Flume and MillWheel), and likewise in the Hadoop ecosystem with the emergence of Spark and a flurry of other technologies for directly processing on data in HDFS. This is a good thing and just as it should be — i.e. modern Hadoop is built around a control layer called YARN that explicitly makes it possible to plug in new engines and processing on a Hadoop cluster.
This is exciting stuff — by putting data in HDFS (aka the data lake), you are bringing together all relevant and disparate data sets from across your organization into one mixing bowl. And there is no lock-in to the technologies that can be used against that data — the tools and technologies will keep getting better.
But it still isn’t as easy as it should be. Setting up a data lake on your own server infrastructure does involve complexity and has only appealed to companies with the resources and technical capability to do so up to this point.
Newer delivery options target a broader audience
The cloud is the new norm for wide swaths of what was IT’s purview only a decade ago. From SaaS apps like Salesforce.com and Workday, to PaaS platforms like Google’s App Engine and Salesforce’s Heroku, to IaaS infrastructure like Amazon’s EC2 server platform, the cloud is here. Traditional BI in the cloud has come a little slower, since most company data assets remain on-premises, but is growing as more data is generated from cloud-based SaaS apps. But traditional on-premises data assets themselves — i.e centralized databases, as well as shared infrastructure such as Hadoop data lakes — have been held back from the cloud by IT due to concerns about security/privacy, governance and SLA requirements.
We’ve seen the tone of this discussion change over the last six months or so. Where before there was skepticism about running these in the cloud, companies have been watching the drumbeat of new services and capabilities from the major cloud vendors. Amazon’s recent re:Invent conference drove home to many that these cloud platforms (Amazon, as well as Google and Microsoft) are innovating at a surprising speed, and companies that otherwise would be intimidated to build a data lake now have an easier on-ramp to this new world of data.
Interestingly, every one of these platforms has a notion of the data lake, but in some cases with no actual Hadoop code involved. Each implements a massive scale ‘data lake’ storage engine (Amazon S3, Google Cloud Storage, Microsoft Data Lake Store) that supports the same interfaces, but share no actual code with Hadoop. Each supports standard Hadoop processing (and options for running the major Hadoop distros), Spark, and a range of proprietary services that unlock unique capabilities for transformation pipelines, stream processing, machine learning, and more. And new capabilities are being unveiled every month.
Today we see about 30% of our customers running Platfora and their data lake on these cloud platforms. But the chorus is increasing, and I’d bet on this being in excess of 50% within 24 months. On-premises deployments will still be significant over the next five years — this kind of shift to the cloud will take time to really take hold for enterprises — but we’ll see pockets that move much faster.
So, my prediction is that we’ll see half of all new enterprise big data projects (and associated data lakes) being implemented in the cloud within 24 months.
Empowering your new data superstars
So, you build your data lake — whether on-premises or in the cloud. Then the big data discovery is ready to unfold, right? Not exactly. Even in the cloud, you still need to figure out how to use the data lake so you can get to the answers and insights you seek—and that involves much more than pointing cloud BI at Hadoop.
Big data discovery is a journey. It’s a more complete approach to big data analysis, and is different from any process that came before it. Hadoop enables that journey, and without question, it transforms the way people work with data. More than that, it fundamentally changes the roles of the people who work with your data.
When your organization embraces the Hadoop data lake—and regardless of when or whether you move your data lake to the cloud—you need to find the right people to explore the data lake. These are people who can use advanced data analytics and ask the right questions of data, and help the business start building real-world practices around big data.
You may find that the individuals who can help the business get the most value from Hadoop are actually not data scientists. In fact, more than likely, you will find that a curious business analyst in your organization is poised to become your next data superstar.
Why? Because business analysts are often subject-matter experts and know how to ask business questions—skills that are vital for using the Hadoop mixing bowl to its full advantage.
You will need to identify and empower these “full-stack personas” in your organization. Whether it’s a business analyst with the skills of a data scientist, or a data scientist who can think like a business analyst, these are the people who will understand how to find, combine, structure, and use the data that’s flowing into the Hadoop data lake from many disparate sources.
Adopting a data lake can be challenging. But it’s not as difficult as you might think. There are also plenty of resources available to help get you there—and they are only increasing in number, and effectiveness. So really, there’s no reason for any business to delay its big data discovery journey, unless it wants to miss the train altogether.
Sign up for the free insideAI News newsletter.