Automatic Caching Decision for Scientific Dataflow Execution in Apache Spark

Vinicius M. Gottin,Edward Pacheco,Jonas Dias,Angelo E. M. Ciarlini,Bruno Costa,Wagner Vieira,Yania Molina Souto,Paulo F. Pires,Fábio Porto,João Guilherme Nobre Rittmeyer

Automatic Caching Decision for Scientific Dataflow Execution in Apache Spark

2018

Demands for large-scale data analysis and processing have led to the development and widespread adoption of computing frameworks that leverage in-memory data processing, largely outperforming disk-based processing systems. One such framework is Apache Spark, which adopts a lazy-evaluation execution mode. In this model, the execution of a transformation dataflow operation is delayed until its results are required by an action. Furthermore, transformation's results are not kept in memory by default, and the same transformation must be re-executed whenever required by another action. In order to spare unnecessary re-execution of entire pipelines of frequently referenced operations, Spark enables the programmer to explicitly define a cache operation to persist transformation results. However, many factors affect the efficiency of a cache in a dataflow, including the existence of other cache operations. Thus, even with a reasonably small number of transformations, choosing the optimal combination of cache operations poses a nontrivial problem. The problem is highlighted by the fact that intuitive strategies -- especially considered in isolation - may actually be harmful to the dataflow efficiency. In this work, we present an automatic procedure to compute the substantially optimal combination of cache operations given a dataflow definition and a simple model for the cost of the operations. Our results over an astronomy dataflow use case show that our algorithm is resilient to changes in the dataflow and cost model, and that it outperforms intuitive strategies, consistently deciding on a substantially optimal combination of caches.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations