Large-scale machine learning at twitter
- 20 May 2012
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 793-804
- https://doi.org/10.1145/2213836.2213958
Abstract
The success of data-driven solutions to difficult problems, along with the dropping costs of storing and processing massive amounts of data, has led to growing interest in large-scale machine learning. This paper presents a case study of Twitter's integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform. We begin with an overview of this platform, which handles "traditional" data warehousing and business intelligence tasks for the organization. The core of this work lies in recent Pig extensions to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification. In particular, we have identified stochastic gradient descent techniques for online learning and ensemble methods as being highly amenable to scaling out to large amounts of data. In our deployed solution, common machine learning tasks such as data sampling, feature generation, training, and testing can be accomplished directly in Pig, via carefully crafted loaders, storage functions, and user-defined functions. This means that machine learning is just another Pig script, which allows seamless integration with existing infrastructure for data management, scheduling, and monitoring in a production environment, as well as access to rich libraries of user-defined functions and the materialized output of other scripts.Keywords
This publication has 20 references indexed in Scilit:
- Detecting adversarial advertisements in the wildPublished by Association for Computing Machinery (ACM) ,2011
- High-precision phrase-based document classification on a modern scalePublished by Association for Computing Machinery (ACM) ,2011
- An architecture for parallel topic modelsProceedings of the VLDB Endowment, 2010
- HadoopDBProceedings of the VLDB Endowment, 2009
- MAD skillsProceedings of the VLDB Endowment, 2009
- Pig latinPublished by Association for Computing Machinery (ACM) ,2008
- Opinion Mining and Sentiment AnalysisFoundations and Trends® in Information Retrieval, 2008
- Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web searchACM Transactions on Information Systems, 2007
- Random ForestsMachine Learning, 2001
- Arcing classifier (with discussion and a rejoinder by the author)The Annals of Statistics, 1998