Applying probabilistic models to C++ code on an industrial scale
- 27 June 2020
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM) in Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops
Abstract
Machine learning approaches are widely applied to different research tasks of software engineering, but C/C++ code presents a challenge for these approaches because of its complex build system. However, C and C++ languages still remain two of the most popular programming languages, especially in industrial software, where a big amount of legacy code is still used. This fact prevents the application of recent advances in probabilistic modeling of source code to the C/C++ domain. We demonstrate that it is possible to at least partially overcome these difficulties by the use of a simple token-based representation of C/C++ code that can be used as a possible replacement for more precise representations. Enriched token representation is verified at a large scale to ensure that its precision is good enough to learn rules from. We consider two different tasks as an application of this representation: coding style detection and API usage anomaly detection. We apply simple probabilistic models to these tasks and demonstrate that even complex coding style rules and API usage patterns can be detected by the means of this representation. This paper provides a vision of how different research ML-based methods for software engineering could be applied to the domain of C/C++ languages and show how they can be applied to the source code of a large software company like Samsung.Keywords
This publication has 21 references indexed in Scilit:
- Towards a universal code formatter through machine learningPublished by Association for Computing Machinery (ACM) ,2016
- Bugram: bug detection with n-gram language modelsPublished by Association for Computing Machinery (ACM) ,2016
- CACHECA: A Cache Language Model Based Code Suggestion ToolPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- Learning natural coding conventionsPublished by Association for Computing Machinery (ACM) ,2014
- Code completion with statistical language modelsPublished by Association for Computing Machinery (ACM) ,2014
- A statistical semantic language model for source codePublished by Association for Computing Machinery (ACM) ,2013
- The GHTorent dataset and tool suitePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- Smart Formatter: Learning Coding Style from Existing Source CodePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2007
- PR-MinerACM SIGSOFT Software Engineering Notes, 2005
- Mining specificationsACM SIGPLAN Notices, 2002