Research Report: ICARUS: Understanding De Facto Formats by Way of Feathers and Wax

2020 
When $a$ data format achieves a significant level of adoption, the presence of multiple format implementations expands the original specification in often-unforeseen ways. This results in an implicitly defined, de facto format, which can create vulnerabilities in programs handling the associated data files. In this paper we present our initial work on ICARUS: a toolchain for dealing with the problem of understanding and hardening de facto file formats. We show the results of our work in progress in the following areas: labeling and categorizing a corpora of data format samples to understand accepted variations of a format; the detection of sublanguages within the de facto format using both entropy- and taint-tracking-based methods, as a means of breaking down the larger problem of learning how the grammar has evolved; grammar inference via reinforcement learning, as a means of tying together the learned sublanguages; and the defining of both safe subsets of the de facto grammar, as well as translations from unsafe regions of the de facto grammar into safe regions. Real-world data formats evolve as they find use in real-world applications, and a comprehensive ICARUS toolchain for understanding and hardening the resulting de facto formats can identify and address security risks arising from this evolution.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    0
    Citations
    NaN
    KQI
    []