Databricks Sets New World Record for CloudSort Benchmark Using Apache Spark at $1.44 Per Terabyte

databricks_logo_NEWDatabricks®, the company founded by the the team that created the popular Apache® Spark™ project, announced that in collaboration with industry partners, it has broken the world record in the CloudSort Benchmark, a third-party industry benchmarking competition for processing large datasets.

Utilizing Apache Spark and working in close collaboration with Nanjing University and Alibaba Group to form the team, NADSort, Databricks architected an efficient cloud platform for data processing. The platform sorted 100 terabytes (TB) of data at a total cost of USD $144, or $1.44 per TB, worth of cloud computing resources for both the Daytona and Indy CloudSort competitions. This record outperformed the previously held record by University of California, San Diego of $4.51 per TB, with savings of 68 percent.

The objective and purpose of CloudSort Benchmark entry is to measure the lowest cost in public cloud pricing per terabyte, reducing the total cost of ownership of the cloud architecture (a combination of software stack, hardware stack, and tuning) and encouraging organizations to adopt and deploy big data applications onto the public cloud. In 2014, Databricks set the record for Gray Sort Benchmark, sorting 100TB of data, or 1 trillion records in 23 minutes, which was 30 times more efficient per node than the previous record held by Apache Hadoop. The sorting program, based on the Databricks’ 2014 record and updated for better efficiency for the cloud, ran on 394 ECS.n1.large nodes on the Alibaba Cloud, each equipped with an Intel Haswell E5-2680 v3 processor, 8 Gigabytes of memory, and 4×135 GB SSD Cloud Disk.

Databricks reduced the per terabyte cost from 4.51 dollars, the previous world record held by University of California, San Diego in 2014, to 1.44 dollars, meaning our optimizations and advances in cloud computing have tripled the efficiency of data processing in the cloud,” said Databricks Chief Architect and leader of the CloudSort Benchmark project, Reynold Xin. “With these innovations, to process the same amount of data in 2016 in the cloud costs one third of the price in 2014!”

Catalysts for Cost Efficiency Improvements

Three important factors made this CloudSort cost efficiency possible, according to Reynold Xin in his blog:

  1. Cost-effectiveness of cloud computing: Increased competition among major cloud providers has lowered the cost of resources, making deploying applications in the cloud economically feasible and scalable;
  2. Efficiency of software: Continued innovations in Apache Spark, such as Project Tungsten, Catalyst, and whole-stage code generation, has benefited Spark enormously improving all aspects of the Spark stack;
  3. Optimization of Spark and cloud-native architecture: Combined in-house expertise in Spark and deep expertise gained in operating and tuning cloud-native data architecture of tens of thousands of clusters for customers have led to incremental gains of efficiency, developing the most efficient cloud architecture for data processing.

The achievements of two world records in two years leave us humbled, yet they validate the technology trends we’ve invested in heavily,” said Databricks CEO, Ali Ghodsi. “First, we believe open source software is the future of software evolution, and Apache Spark is the most efficient engine for data processing. And second, cloud computing is becoming the most cost-efficient, effective, and scalable architecture to deploy big data applications.”

 

Sign up for the free insideAI News newsletter.