Tree2tree Structural Language Modeling for Compiler Fuzzing

2020 
Compiler fuzzing requires well-formed test cases. Only syntactically correct programs can pass the parsing stage of a compiler. Recently, advanced compiler fuzzers produce test cases by learning a generative language model of regular programs. They treat programs as natural language texts without leveraging any syntactic structure, making them hard to produce syntactically correct programs when programs get long. In this paper, we propose a novel tree-to-tree (tree2tree) model to leverage the structural information for a robust test case generation. We adopt an encoder-decoder architecture to map a partial abstract syntax tree (AST) to a complete AST. We introduce a C compiler fuzzing framework, named TSmith. Specifically, TSmith employs a tree-based encoder to encode the input partial AST to capture the hierarchical structure information. It then adopts a tree decoder to generate a complete AST by expanding the non-terminals in a top-down manner. Finally, the output ASTs are converted into corresponding test programs. Experimental results show that our generation strategies have a maximum parsing pass rate of 83%, which is about 21% higher than sequential models. Besides, TSmith significantly improves the code coverage of the compiler. Benefiting from the high pass rate and broad code coverage, TSmith has found 14 new bugs in GCC.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    3
    Citations
    NaN
    KQI
    []