Applying probabilistic models to C++ code on an industrial scale

Abstract

Machine learning approaches are widely applied to different research tasks of software engineering, but C/C++ code presents a challenge for these approaches because of its complex build system. However, C and C++ languages still remain two of the most popular programming languages, especially in industrial software, where a big amount of legacy code is still used. This fact prevents the application of recent advances in probabilistic modeling of source code to the C/C++ domain. We demonstrate that it is possible to at least partially overcome these difficulties by the use of a simple token-based representation of C/C++ code that can be used as a possible replacement for more precise representations. Enriched token representation is verified at a large scale to ensure that its precision is good enough to learn rules from. We consider two different tasks as an application of this representation: coding style detection and API usage anomaly detection. We apply simple probabilistic models to these tasks and demonstrate that even complex coding style rules and API usage patterns can be detected by the means of this representation. This paper provides a vision of how different research ML-based methods for software engineering could be applied to the domain of C/C++ languages and show how they can be applied to the source code of a large software company like Samsung.

Keywords

This publication has 21 references indexed in Scilit:

Towards a universal code formatter through machine learning
Published by Association for Computing Machinery (ACM) ,2016
Bugram: bug detection with n-gram language models
Published by Association for Computing Machinery (ACM) ,2016
CACHECA: A Cache Language Model Based Code Suggestion Tool
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Learning natural coding conventions
Published by Association for Computing Machinery (ACM) ,2014
Code completion with statistical language models
Published by Association for Computing Machinery (ACM) ,2014
A statistical semantic language model for source code
Published by Association for Computing Machinery (ACM) ,2013
The GHTorent dataset and tool suite
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Smart Formatter: Learning Coding Style from Existing Source Code
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2007
PR-Miner
ACM SIGSOFT Software Engineering Notes, 2005
Mining specifications
ACM SIGPLAN Notices, 2002

Cited by 2 articles