Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts

Top Cited Papers

1 January 2013

journal article
research article
Published by Cambridge University Press (CUP) in Political Analysis

Vol. 21 (3), 267-297
https://doi.org/10.1093/pan/mps028

Abstract

Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods—they are no substitute for careful thought and close reading and require extensive and problem-specific validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation.

Keywords

This publication has 55 references indexed in Scilit:

Affective News: The Automated Coding of Sentiment in Political Texts
Political Communication, 2012
General purpose computer-assisted clustering and conceptualization
Proceedings of the National Academy of Sciences of the United States of America, 2011
MPs for Sale? Returns to Office in Postwar British Politics
American Political Science Review, 2009
Computer-Assisted Topic Classification for Mixed-Methods Social Science Research
Journal of Information Technology & Politics, 2008
Super Learner
Statistical Applications in Genetics and Molecular Biology, 2007
UK OC OK? Interpreting Optimal Classification Scores for the U.K. House of Commons
Political Analysis, 2007
An algorithm for suffix stripping
Program: electronic library and information systems, 2006
Beyond the Median: Voter Preferences, District Heterogeneity, and Political Representation
Journal of Political Economy, 2004
Measuring praise and criticism
ACM Transactions on Information Systems, 2003
Robust Locally Weighted Regression and Smoothing Scatterplots
Journal of the American Statistical Association, 1979

Cited by 1737 articles