An Optimized Positive-Unlabeled Learning Method for Detecting a Large Scale of Malware Variants

2019 
Malicious softwares (Malware) are able to quickly evolve into many different variants and evade existing detection mechanisms, rendering the ineffectiveness of traditional signature-based malware detection systems. Many researchers have proposed advanced malware detection techniques by using Machine Learning. Although the machine learning based techniques perform well in detecting a wide range of malware variants, there still remain some problems when meeting the real scene in the industry. Since the volume of new malware variants grows fast and labelling data is expensive and takes a lot of labor, companies cannot label every one of those samples. They tend to label a small part of the malware samples and treat the rest of the unlabeled samples as benign samples in which the original malware samples are treated as mislabeled. This causes a bias of decision boundary which severely limits the accuracy. To address such a problem, in this paper, we propose a cost-sensitive boosting method to train an unbiased detection model with the malicious-unlabeled executables to improve the accuracy. Along with that, in order to detect malware variants efficiently, we propose a byte co-occurrence matrix as a representation of byte streams of executables to detect malware variants directly. Experimental results show that the machine learning methods optimized by our approach can achieve 80% to 90% accuracy while the original machine learning methods can only achieve 50% to 85% accuracy when the unlabeled data contain different rates of mislabeled positive data.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    4
    Citations
    NaN
    KQI
    []