Text Mining in Program Code

Alexander Dreweke,Ingrid Fischer,Tobias Werth,Marc Wörlein

Text Mining in Program Code

2009

Alexander Dreweke
Ingrid Fischer
Tobias Werth
Marc Wörlein

Searching for frequent pieces in a database with some sort of text is a wellknown problem. A special sort of text is program code as e.g. C++ or machine code for embedded systems. Filtering out duplicates in large software projects leads to more understandable programs and helps avoiding mistakes when reengineering the program. On embedded systems the size of the machine code is an important issue. To ensure small programs, duplicates must be avoided. Fast program execution can be ensured, when frequently used duplicates are encoded in hardware. The most successful approaches for finding code duplicates are based on graphs representing the data and control flow of the program and graph mining algorithms. Compared to applications of suffix tries on the code or fingerprinting, where some kind of special form of program parts is calculated, more duplicates are found.

Keywords:

sort
Filter (signal processing)
Software
Suffix
Suffix tree
Database
Machine code
Control flow
Text mining
Computer science
Theoretical computer science
Business process reengineering
Programming language

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations