Businesses are discovering the huge potential of big data analytics across all dimensions of the business, from defining corporate strategy to managing customer relationships, and from improving operations to gaining competitive edge. The open source Apache Hadoop project, a software framework that enables high-performance analytics on unstructured data sets, is the centerpiece of big data solutions. Hadoop is designed to process data-intensive computational tasks, in parallel and at a scale, that previously were possible only in high-performance computing (HPC) environments.
The Hadoop ecosystem consists of many open source projects. One of the central components is the HDFS, a distributed file system designed to run on commodity hardware. Other related projects facilitate workflow and the coordination of jobs, support data movement between Hadoop and other systems, and implement scalable machine learning and data mining algorithms. However, HDFS lacks the enterprise-class functions necessary for reliability, data management and data governance.
GPFS is a POSIX-compliant file system that offers an enterprise-class alternative to HDFS. GPFS has been used in many of the world’s fastest supercomputers, including IBM Blue Gene®, IBM WatsonTM (the supercomputer featured on Jeopardy!) and the Argonne National Labs MIRA system. Apart from supercomputers, GPFS is commonly found in thousands of commercial mission-critical installations worldwide, from biomedical research to financial analytics.
In 2009, GPFS was extended to work seamlessly in the Hadoop ecosystem and is available through a feature called GPFS File Placement Optimizer (GPFS-FPO). Storing your Hadoop data using GPFS-FPO allows you to gain advanced functions and the high I/O performance required for many big data operations. GPFS-FPO provides Hadoop compatibility extensions to replace HDFS in a Hadoop ecosystem, with no changes required to Hadoop applications.
The best practices presented in this paper show how to deploy GPFS-FPO as a file system platform for big data analytics. The paper covers a variety of Hadoop deployment architectures, including InfoSphere BigInsights, Platform Symphony, direct-from-Hadoop open source (also referred as Do-It-Yourself) or with a Hadoop distribution from another vendor to work with GPFS. Architecture-specific notes are included where applicable.
All information that you supply is protected by our privacy policy. By submitting your information you agree to our Terms of Use.
* All fields required.