Towards a universal code formatter through machine learning
- 20 October 2016
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM) in Proceedings of the 2016 ACM SIGPLAN International Conference on Software Language Engineering
- p. 137-151
- https://doi.org/10.1145/2997364.2997383
Abstract
There are many declarative frameworks that allow us to implement code formatters relatively easily for any specific language, but constructing them is cumbersome. The first problem is that “everybody” wants to format their code differently, leading to either many formatter variants or a ridiculous number of configuration options. Second, the size of each implementation scales with a language’s grammar size, leading to hundreds of rules. In this paper, we solve the formatter construction problem using a novel approach, one that automatically derives formatters for any given language without intervention from a language expert. We introduce a code formatter called CodeBuff that uses machine learning to abstract formatting rules from a representative corpus, using a carefully designed feature set. Our experiments on Java, SQL, and ANTLR grammars show that CodeBuff is efficient, has excellent accuracy, and is grammar invariant for a given language. It also generalizes to a 4th language tested during manuscript preparation.Keywords
This publication has 10 references indexed in Scilit:
- A Pretty Good Formatting PipelineLecture Notes in Computer Science, 2013
- A language independent framework for context-sensitive formattingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2006
- High-Fidelity C/C++ Code TransformationElectronic Notes in Theoretical Computer Science, 2005
- Pretty printing with lazy dequeuesACM Transactions on Programming Languages and Systems, 2005
- Pretty-printing for software reengineeringPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- A prettier printerPublished by Bloomsbury Academic ,2003
- The Asf+Sdf Meta-environment: A Component-Based Language Development EnvironmentLecture Notes in Computer Science, 2001
- Generation of formatters for context-free languagesACM Transactions on Software Engineering and Methodology, 1996
- Program indentation and comprehensibilityCommunications of the ACM, 1983
- PrettyprintingACM Transactions on Programming Languages and Systems, 1980