TPC-DS 2.0: The First Industry Standard Benchmark for SQL-based Big Data Systems

Today the TPC is announcing the first industry standard benchmark for measuring the performance of SQL-based Big Data systems, TPC-DS 2.0. Building upon the well-studied TPC-DS benchmark, Version 2.0 was specifically designed for SQL-based Big Data, while retaining all key characteristics of a decision support benchmark. In the previous two years, the Hadoop community has adopted portions of the original TPC-DS Version 1.0 workload for performance characterization. The richness and broad applicability of the schema, the ability to generate 100TB of realistic data on clustered systems, and the very large number of complex queries made TPC-DS 1.0 the top candidate to show off performance of SQL-based Big Data solutions.

However, no Big Data vendor ran the entire TPC-DS 1.0 benchmark to completion, most likely because they could not fulfill all necessary requirements or, maybe, because the performance of their Big Data solution was sub-optimal. Consequently, they cherry-picked portions of the benchmark, e.g. subset of the schema and queries, and executed the benchmark in their own way, reporting a metric that their system into the spotlight.

It was exactly this kind of “benchmarketing” that triggered Omri Serlin to found the TPC 25 years ago. Instead of questioning the credibility of these claims and fining the companies for violating TPC’s fair use polices, the TPC invited companies developing SQL-based Big Data systems to join the TPC and to help design TPC-DS 2.0. The following paragraphs showcase the differences between versions 1.0 and 2.0. For a detailed description of the benchmark please visit http://www.tpc.org/tpcds/default.asp.

  • TPC-DS 2.0 increases the minimum raw database size to 1TB and allows benchmark publications of up to 100TB. Since Big Data solutions aim at operating on even larger data sets, the TPC is evaluating data sets size up to 1PB.
  • The update statements on dimension tables were eliminated from the benchmark specification because of limitations in current Big Data implementations. The TPC believes that the current inserts and delete operations on fact tables are sensible and adequate given the state of Big Data systems.
  • Instead of requiring compliance with Atomicity, Consistency, Isolation and Durability (ACID), TPC-DS 2.0 requires the system under test to continue executing queries and data maintenance functions with full data access during and after a permanent irrecoverable failure of any single durable medium.
  • The metric in TPC-DS 2.0 has been changed from being an arithmetic mean of load, single user, multi user and data maintenance to a geometric mean of the same components. This addresses concerns that the original metric could, for some implementations, be dominated by the data maintenance time.
  • TPC-DS 1.0 required that all defined constraints, i.e. primary, foreign key, and “not null” must also be enforced. In TPC-DS 2.0, both enforced and non-enforced constraints are allowed so that query compilers can understand basic data relationships to generate reasonable query plans.
  • TPC-DS 2.0 separates the querying of data from data maintenance. Since the overlapping of queries and data maintenance requires ACID, TPC-DS 2.0 reverted to a simpler model in which queries and data maintenance are strictly distinct.

TPC-DS 2.0 has now become the TPC’s second Big Data benchmark. In August 2014, the TPC introduced TPCx-HS – the industry’s first standard for benchmarking Big Data systems, designed to assess a broad range of system topologies and implementation methodologies. TPCx-HS was also the TPC’s first “Express” class benchmark, publicly available via downloadable kit.

Looking ahead, we are already working on a third Big Data benchmark, an “Express” class benchmark – TPCx-BB, which is open for public review HERE.  The TPC is encouraging interested parties to provide their reviews. Beyond that, the TPC is encouraging anyone interested in our efforts in Big Data or other areas, to visit our membership page, or submit original, unpublished papers for our upcoming Technology Conference (TPCTC) HERE.

Meikel PoessContributed by: Meikel Poess and Raghunath Nambiar. Meikel Poess is chairman of the TPC-DS 2.0 committee and principal developer at Oracle Corporation. He is a software developer with 14 years of experience in performance tuning in all phases of software development and sizing of database systems. Raghunath Nambiar is the chairman of the TPCTC, a distinguished engineer at Cisco, and chief architect of big data and analytics solution engineering. His current focus areas include emerging technologies, data center solutions, and big data and analytics strategy.

Raghunath Nambiar

 

 

 

Sign up for the free insideAI News newsletter.