In this special guest feature, Andrew Warfield, Chief Technology Officer/Co-Founder of Coho Data, examines how Big data creates unique challenges within enterprise environments, and to better understand these challenges, we have to look at what big data really means for enterprises, particularly from a storage perspective. Andy Warfield is an established researcher in computer systems, specializing in storage, virtualization and security. At Coho Data, Andy leads the technology vision and directs the engineering team in building elegant and functional systems that enable customers to focus on the data and applications instead of the underlying infrastructure that drives them.
Big data continues to be a hot topic these days. People are talking not just about the large amounts of data that organizations now have to contend with, but the different formats and sources of that data as well. For many, finding a way to leverage it offers a key competitive advantage; one that helps them stand out in today’s increasingly dynamic marketplace. But, that’s only part of the story.
The truth is big data is nothing new. It’s not even the biggest driver in terms of market opportunity for many companies these days. It does; however, provide insight into where industry is headed in terms of the extreme data quantities enterprises have to deal with day-to-day. And that’s where things get interesting.
Big data creates unique challenges within enterprise environments, and it’s that fact that is driving the need for a radically different way of doing big data analytics. To better understand these challenges, we have to look at what big data really means for enterprises, particularly from a storage perspective.
Most enterprises today don’t have large-scale big data deployments with thousands of compute nodes, sophisticated analytic applications and teams taking huge advantage of data at scale. What they do have is existing IT infrastructure, which they’d like to use for big data. There’s just one problem—the requirements are different.
That’s where Shadow IT enters the picture. It does what the existing infrastructure can’t; namely, handle big data analytics via small-scale big data clusters consisting of around 8-12 nodes. The analytics tools deployed in these clusters often spring up organically, and while CIOs and enterprises would like them to follow the same norms they enforce for their storage (e.g., the same reliability and availability, and similar performance), that just isn’t happening.
Instead, the big data installs are non-standard, even when on standard distributions. Developers wanting to try new things often extend standard distributions by hand-installing additional tools. This makes enterprises agile, but it also makes deploying and maintaining a single central cluster for an entire organization difficult.
Since these analytics environments are typically deployed as separate silos alongside traditional enterprise IT, data must be bulk copied out of enterprise storage into HDFS, where jobs are run, and then copied back. The process is inefficient and costly. The main challenge; however, is that data has gravity, and this makes it hard to keep moving it from different systems. As industry quickly moves toward the Internet of Things (IoT), that movement will only get more difficult.
With traditional storage infrastructure, an enterprise wanting to run a database or file services simply goes out and buys storage to meet those requirements. As a result, they end up with many different storage platforms in the datacenter. But, meeting the big data challenges facing enterprises demands a fundamentally new approach to storage.
One significant change taking place on that front is the acknowledgement by big data distributions and enterprise storage vendors of HDFS as more of a protocol than a file system. To date, several traditional storage vendors have announced support for direct HDFS protocol-based access to data, even though that data isn’t stored in HDFS (the file system). And, analytics distributions are acknowledging that data may be stored on external HDFS-interfaced storage. All of this means that enterprises can now access data services, as well as the large volumes of incumbent data, not already stored in those legacy systems.
While that’s great news for the enterprise, what they really want is the ability to support multiple workloads—both existing IT and big data analytics—on a single scalable platform. That platform has to be flash-tuned, ultra-high-performance, and able to natively run file, block and HDFS on the same platform with secure multi-tenancy, QoS, and container support. It hinges on the ability to think of HDFS as a protocol, rather than a file system.
With this platform, enterprises could set up a private cloud that acts much like a public one with its elastic scale and economies. They could go to it and say: I need X storage resources for Y type of workload, and be able to carve it out easily and cost-effectively. CIOs and others would gain a way to create an infrastructure that allows enterprises to make shadow IT mainstream, while converting data into an intelligent source of value for their company.
It’s an approach that takes full advantage of the emerging technologies reshaping IT, particularly the massive growth around big data. And that foretells a bright new future for big data analytics; one that can withstand the onslaught of things like the IoT and provide the infrastructure for enterprises to get their business needs met, while adding to their bottom line.
Sign up for the free insideAI News newsletter.