Large-scale machine learning at twitter

20 May 2012

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

p. 793-804
https://doi.org/10.1145/2213836.2213958

Abstract

The success of data-driven solutions to difficult problems, along with the dropping costs of storing and processing massive amounts of data, has led to growing interest in large-scale machine learning. This paper presents a case study of Twitter's integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform. We begin with an overview of this platform, which handles "traditional" data warehousing and business intelligence tasks for the organization. The core of this work lies in recent Pig extensions to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification. In particular, we have identified stochastic gradient descent techniques for online learning and ensemble methods as being highly amenable to scaling out to large amounts of data. In our deployed solution, common machine learning tasks such as data sampling, feature generation, training, and testing can be accomplished directly in Pig, via carefully crafted loaders, storage functions, and user-defined functions. This means that machine learning is just another Pig script, which allows seamless integration with existing infrastructure for data management, scheduling, and monitoring in a production environment, as well as access to rich libraries of user-defined functions and the materialized output of other scripts.

Keywords

This publication has 20 references indexed in Scilit:

Detecting adversarial advertisements in the wild
Published by Association for Computing Machinery (ACM) ,2011
High-precision phrase-based document classification on a modern scale
Published by Association for Computing Machinery (ACM) ,2011
An architecture for parallel topic models
Proceedings of the VLDB Endowment, 2010
HadoopDB
Proceedings of the VLDB Endowment, 2009
MAD skills
Proceedings of the VLDB Endowment, 2009
Pig latin
Published by Association for Computing Machinery (ACM) ,2008
Opinion Mining and Sentiment Analysis
Foundations and Trends® in Information Retrieval, 2008
Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search
ACM Transactions on Information Systems, 2007
Random Forests
Machine Learning, 2001
Arcing classifier (with discussion and a rejoinder by the author)
The Annals of Statistics, 1998

Cited by 108 articles