Optimizing Multi-class Classification of Binaries Based on Static Features

2021 
Classification of binaries is often done with limited resources spent on pre-processing the input, assuming that the resource-intensive machine learning techniques will find the optimal results. In this paper, we identify pre-processing methods to perform faster malware multi-class classification of high accuracy, and we also use the same techniques to classify author (programmer) identification from executables. One method is via eight different types of code simplifications of the disassembled code to reduce storage and calculation time. Another is through visual analysis from running TFIDF N-gram analysis using both Random Forest and SVM, for a large range of different N-grams. The results show interesting features from our classification of executables which we base solely on the analysis of the disassembled code. We have in addition looked at using different training data sizes, compiler optimized code, and both ELF and PE-files and demonstrate methods for optimizing storage and computational complexity when classifying executable files. Our findings show that a higher size N-gram is only preferable for some code simplifications, and that some code simplifications can give a very high accuracy (99.2%) based on only a fraction of the code. In addition, the amount of training data can be quite low and still yield an accuracy of over 95%.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    0
    Citations
    NaN
    KQI
    []