Recognizing the Data Type of Firmware Data Segments with Deep Learning

2020 
Data segment analysis is of great value for firmware analysis. The data segment contains abundant information such as pointers and strings which is helpful for accelerating the process of code segment analysis. In this paper, we propose a novel approach of applying deep learning to solve the problem of data type identification in data segments, that is a fundamental problem in data segment analysis. We define 3 data types of data segment, then design several data segment byte feature extraction methods to construct feature sequences, and finally present a deep learning-based approach with feature sequences as input to recognize the data type byte by byte. Then, the recognized type can be further corrected efficiently by prior knowledge. Based on the data segment of a firmware, we built a dataset that included 18,032,352 samples (in bytes of data segment). We implement a prototype system and evaluate it with our dataset, then determine reasonable models and hyperparameters through several experiments, and eventually confirm that deep learning techniques are suitable for identifying the data type in data segment. Kappa coefficient of our data type recognition reached 0.96 and the models can be retained quickly. Using 131,072 samples in our dataset for 32 seconds of training, the accuracy can reach 90%; the accuracy can reach 97% with 273 seconds of training and 950,272 samples. Furthermore, our approach has higher accuracy than IDA in string recognition. In experiments, the recall and precision of our approach reached 96.5% and 90% respectively, whereas corresponding results of IDA is 92.9% and 85.7%. In addition, we selected 8 open source software to compile and test, and compared the detection results with TypeMiner. Experiments show that our method has certain cross-platform and operating system capabilities, and performs better than TypeMiner on some software.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    7
    References
    0
    Citations
    NaN
    KQI
    []