GML: Efficiently Auto-tuning Flink's Configurations via Guided Machine Learning

2021 
The increasingly popular fused batch-streaming big data framework, Apache Flink, has many performance-critical as well as untamed configuration parameters. However, how to tune them for optimal performance has not yet been explored. Machine learning (ML) has been chosen to tune the configurations for other big data frameworks (e.g., Apache Spark), showing significant performance improvements. However, it needs a long time to collect a large amount of training data by nature. In this article, we propose a guided machine learning (GML) approach to tune the configurations of Flink with significantly shorter time for collecting training data compared to traditional ML approaches. GML innovates two techniques. First, it leverages generative adversarial networks (GANs) to generate a part of training data, reducing the time needed for training data collection. Second, GML guides a ML algorithm to select configurations that the corresponding performance is higher than the average performance of random configurations. We evaluate GML on a lab cluster with 4 servers and a real production cluster in an internet company. The results show that GML significantly outperforms the state-of-the-art, DAC (Datasize-Aware-Configuration) (Z. Yu et al. 2018) for tuning the configurations of Spark, with 2.4× of reduced data collection time but with 30 percent reduced 99th percentile latency. When GML is used in the internet company, it reduces the latency by up to 57.8× compared to the configurations made by the company.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    1
    Citations
    NaN
    KQI
    []