Linguistic Economy Applied to Programming Language Identifiers

Abstract
Though many different readability metrics have been created, there still is no universal agreement defining readability of software source code. The lack of a clear agreement of source code readability has ramifications in many areas of the software development life-cycle, not least of which being software maintainability. We propose a measurement based on Linguistic Economy to bridge the gap between mathematical and behavioral aspects. Linguistic Economy describes efficiencies of speech and is generally applied to natural languages. In our study, we create a large corpus of words that are likely to be found in a programmer’s vocabulary, and a corpus of existing identifiers found in a collection of open-source projects. We perform a usage analysis to create a database from both of these corpora. Linguistic Economy suggests that words requiring less effort to speak are used more often than words requiring more effort. This concept is applied to measure how difficult program identifiers are to understand by extracting them from the program source and comparing their usage to the database. Through this process, we can identify source code that programmers find difficult to review. We validate our work using data from a survey where programmers identified unpleasant to review source files. The results indicate that source files identified as unpleasant to review source code have more linguistically complicated identifiers than pleasant programs.

This publication has 6 references indexed in Scilit: