Applying probabilistic models to C++ code on an industrial scale

Abstract
Machine learning approaches are widely applied to different research tasks of software engineering, but C/C++ code presents a challenge for these approaches because of its complex build system. However, C and C++ languages still remain two of the most popular programming languages, especially in industrial software, where a big amount of legacy code is still used. This fact prevents the application of recent advances in probabilistic modeling of source code to the C/C++ domain. We demonstrate that it is possible to at least partially overcome these difficulties by the use of a simple token-based representation of C/C++ code that can be used as a possible replacement for more precise representations. Enriched token representation is verified at a large scale to ensure that its precision is good enough to learn rules from. We consider two different tasks as an application of this representation: coding style detection and API usage anomaly detection. We apply simple probabilistic models to these tasks and demonstrate that even complex coding style rules and API usage patterns can be detected by the means of this representation. This paper provides a vision of how different research ML-based methods for software engineering could be applied to the domain of C/C++ languages and show how they can be applied to the source code of a large software company like Samsung.

This publication has 21 references indexed in Scilit: