Databricks, the company founded by the creators of popular open-source Big Data processing engine Apache Spark, announced today that it has broken the world record for the GraySort, a third-party, industry benchmarking competition for sorting large on-disk datasets. Databricks completed the Daytona GraySort, which is a distributed sort of 100 terabyte (TB) of on-disk data, in 23 minutes with 206 machines with 6,592 cores during this year’s Sort Benchmark competition. That feat beat the previous record, held by Yahoo, of 70 minutes using a large, open-source Hadoop cluster of 2100 machines for data processing. This means that Spark sorted the same data three times faster using ten times fewer machines.
Additionally, while no official petabyte sort competition exists, Databricks pushed Spark further to also sort one petabyte (PB) of data on 190 machines in under four hours (234 minutes). This PB time beats previously reported results based on Hadoop MapReduce.
Spark is well known for its in-memory performance, but Databricks and the open source community have also invested a great deal in optimizing on-disk performance, scalability, and stability,” said Ion Stoica, CEO of Databricks. “Beating this data processing record previously set on a large Hadoop MapReduce clusters not only validates the work we’ve done, but also demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for all data processing needs. This competition asked us to sort a set of data far larger than what most companies will have to compute, meaning the value of Spark can be realized at organizations of any size with quantities of data large and small.”
With less than one-tenth of the computing power used by Yahoo’s cluster of 2,100 machines, Databricks’ 206 machines used Apache Spark on top of Hadoop File System (HDFS) and sorted through 100 TB of on-disk data in just 23 minutes. The cluster was hosted by Amazon EC2, using 206 i2.8xlarge instances, with an aggregate of 49 TB of memory, 6,592 cores, and 1.2 PB of solid-state drive space.
This sort benchmark is particularly challenging because sorting 100 TB of data generates 500 TB of disk I/O and 200 TB of network I/O. At the core of sorting is an operation called shuffle, which moves data across all the machines. Shuffle is also a critical operation underpinning almost all workloads, including running SQL queries on multiple data sources and doing predictive analytics.” said Reynold Xin, who leads this effort at Databricks. “A lot of development has gone into improving Spark since the 1.0 release for such demanding workloads. Some notable examples include a new sort-based shuffle module that was optimized for large workloads, and a new network transport module that was able to sustain 1.1GB/s/node during shuffle. All these improvements will be available in Spark 1.2.”
In breaking the previous record, Databricks demonstrates that Spark has the power to scale for both in-memory and on-disk data processing, and can do so with one-tenth of the resources and in a fraction of the time once required.
Sign up for the free insideAI News newsletter.