Finding similarities in source code through factorization

Michel Chilowicz,Étienne Duris,Gilles Roussel

Finding similarities in source code through factorization

2008

Michel Chilowicz
Étienne Duris
Gilles Roussel

The high availability of a huge number of documents on the Web makes plagiarism very attractive and easy. This plagiarism concerns any kind of document, natural language texts as well as more structured information such as programs. In order to cope with this problem, many tools and algorithms have been proposed to ﬁnd similarities. In this paper we present a new algorithm designed to detect similarities in source codes. Contrary to existing methods, this algorithm relies on the notion of function and focuses on obfuscation with inlining and outlining of functions. This method is also eﬃcient against insertions, deletions and permutations of instruction blocks. It is based on code factorization and uses adapted pattern matching algorithms and structures such as suffix arrays.

Keywords:

Natural language
Factorization
Suffix
Obfuscation
Pattern matching
Theoretical computer science
Source code
High availability
Computer science
Permutation

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations