SparkFuzz: searching correctness regressions in modern query engines

Bogdan Ghit,Nicolas Poggi,Josh Rosen,Reynold Xin,Peter Boncz

SparkFuzz: searching correctness regressions in modern query engines

2020

Bogdan Ghit
Nicolas Poggi
Josh Rosen
Reynold Xin
Peter Boncz

With more than 1200 contributors, Apache Spark is one of the most actively developed open source projects. At this scale and pace of development, mistakes are bound to happen. In this paper we present SparkFuzz, a toolkit we developed at Databricks for uncovering correctness errors in the Spark SQL engine. To guard the system against correctness errors, SparkFuzz takes a fuzzing approach to testing by generating random data and queries. Spark-Fuzz executes the generated queries on a reference database system such as PostgreSQL which is then used as a test oracle to verify the results returned by Spark SQL. We explain the approach we take to data and query generation and we analyze the coverage of SparkFuzz. We show that SparkFuzz achieves its current maximum coverage relatively fast by generating a small number of queries.

Keywords:

Computer science
Information retrieval
Correctness
Database

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations