Hadoop 3.0 Perspectives by Hortonwork’s Hadoop YARN & MapReduce Development Lead, Vinod Kumar Vavilapalli

The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, recently announced Apache® Hadoop® v3.0.0, the latest version of the Open Source software framework for reliable, scalable, distributed computing. Over the past decade, Apache Hadoop has become ubiquitous within the greater Big Data ecosystem by enabling firms to run and manage data applications on large hardware clusters in a distributed computing environment.

In the brief Q&A below, Hortonwork’s Hadoop YARN & MapReduce Development Lead, Vinod Kumar Vavilapalli, offers his perspectives on the new release.

It’s great to see ApacheHadoop 3.0 finally come to market. The community of committers has been working on this for a long time,” said Vinod Kumar Vavilapalli, committer & PMC member on the Apache Hadoop project and director of engineering at Hortonworks. “Many of us in the community realize we can’t operate with updates spread far apart, and we are excited that 2018 will bring more capabilities to users at an increased rate to better meet demands the big data community is facing.”

Question: What are the key elements that distinguish this release from R.2.0?

Vinod:

One key element centers around storage.

Erasure-Coding

Apache Hadoop 3.0.0 brings erasure coding as an optional storage mechanism, in addition to the classic replication based system. This is primarily a storage-efficiency solution that can help organizations avoid unnecessarily paying up with a lot of storage capacity for rarely accessed ‘colder’ data. It does so by not replicating it three times as usual and instead moving it to erasure-coded format which saves on space usage but trades off with slower recovery when faults happen in the form of loss of disks / nodes / racks.

Interop with cloud

Cloud is one of the major industry movements going on now. Hadoop 3.0.0 also brings a host of improvements for cloud storage systems such as Amazon S3, Microsoft Azure Data Lake, and Aliyun Object Storage System. Many of these improvements focus on performance, so your big-data workloads when run against cloud storage systems can get much faster than they are today.

Another key element focuses on compute.

Extensible resource-types

Hadoop 3.0.0 is extending YARN, the compute platform piece, to have an extensible framework for managing additional resource-types beyond memory and CPU that YARN supports today.

One use-case for this extensible framework is to enable bringing machine learning and deep learning workloads to your Hadoop cluster by pooling GPU and FPGA resources and elastically sharing them between different business-units and users from different parts of the organization. The extensible framework is implemented as part of 3.0.0, but the underlying implementation for supporting GPUs and FPGA are going to come in one of the future 3.x releases.

YARN Federation

In 3.0.0, Hadoop now supports federation of YARN clusters. Till now, users setup independent YARN clusters and run workloads on them, with each cluster completely oblivious to others. YARN federation helps put an over-arching layer on top of multiple clusters to solve one primary use-case – scale. Federation enables YARN to be scaled to 100s of thousands of nodes, far beyond the original design goal of 10,000 machines.

Question: How do the updates fit into the larger picture of distributed data processing and machine learning today?

Vinod: Apache Hadoop is the defacto distributed data-processing system. Few of the major industry trends of today are: next level scale and proliferation of big-data, Cloud and machine learning / deep learning. The key features of Hadoop 3.0.0 cater to several of these key-trends.

As organization pile on more and more data, storage efficiency through erasure coding helps organization put more data into their cluster without a proportional increase in hardware investment. YARN federation makes sure that organization can trust the Hadoop architecture to scale to whatever size they eventually grow to. The cloud storage enhancements help organization run their workload on-prem and on-cloud without major loss of performance. GPU/FPGA enablement on YARN makes sure organization don’t have to setup separate hardware and resort to static cluster management for machine learning / deep learning workloads. These workloads need access to the data in the data-lake, so being able to tap into the same cluster resources for compute is a huge-win.

Question: Certainly, we’ve seen a lot of changes in the original “classic” definition of Hadoop. Some people are using the term “post-Hadoop” to describe the present era – How do you respond?

Vinod: If you have followed the history of Hadoop, the community is constantly reinventing itself.

We started as a single purpose batch system to be a more general purpose big-data processing and resource management system in YARN. This way we could now start supporting both the batch workloads along side ad-hoc, stream processing, on-line data serving workloads. With Hadoop 3, we are now also moving into larger scale, better storage efficiency, into deep learning / AI workloads as well as interopability with Cloud. In the immediate updates to Hadoop, there are further roadmap items to have GA support for containerized workloads to run on the same cluster as well as potential APIs for object storage.

Given all this, we will definitely move to a place where Hadoop becomes more powerful, enabling newer and newer use-cases as they springup, but at the same time also becomes easy and boring – as it should be. Irrespective of that, this constant reinvention tells me that Hadoop is always going to be relevant and there in the background of all the critical infrastructure that powers the increasingly data-driven world.

 

Sign up for the free insideAI News newsletter.