Abstract
Summary: Objectives: To develop an automated, accurate and scalable method by which acronym-definition pairs can be identified within text. Its primary advantage is in enabling information processing methods to resolve author-defined acronyms, but it also allows an automated creation of a reference work on acronym definitions. This has several advantages over manual or semi-automated methods, besides time and effort saved, such as enabling identification of relative frequencies for alternate acronyms and definitions as well as spelling, phrasing and hyphenation variants for a unique acronym-definition pair. It also aids users in identifying acronym/ definition variants present in the literature that may not necessarily be in biomedical databases. Methods: A set of heuristics to accurately locate and identify the boundaries of acronym-definition pairs was developed and refined in terms of precision and recall on subsets of MEDLINE records. These training sets were gradually increased in size and heuristics re-evaluated to ensure scalability. Results: Our final set of Acronym Resolving General Heuristics (ARGH) had a sample-based estimated rate of 96.5 ±0.4% precision and 93.0 ± 2.7% recall when tested on over 12 million MEDLINE records, identifying more than 174,000 unique acronyms and their 737,000 associated definitions. Conclusions: We estimate that as much as 36% of the acronyms in MEDLINE are associated with more than one definition and, conversely, up to 10% of definitions are associated with more than one acronym. The number of unique acronyms in MEDLINE is increasing at a rate of approximately 11,000 per year, while the number of definitions associated with them is growing at approximately four times that rate. Access to the ARGH database is available online at http://lethargy.swmed.edu/ARGH argh.asp. The heuristic module and database are available upon request.