Input data organization for batch processing in time window based computations

18 March 2013

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

p. 363-370
https://doi.org/10.1145/2480362.2480437

Abstract

Applications based on event processing are often designed to continuously evaluate set of events defined by sliding time windows. Solutions employing long-running continuous queries executed in-memory show their limits in applications characterized by a staggering growth of available sources that continuously produce new events at high rates (e.g. intrusion detection systems and algorithmic trading). Problems arise due to the complexities in maintaining large amounts of events in memory for continuous elaboration, and due to the difficulties in managing at run-time the network of elaborating nodes. A batch approach to this kind of computation provides a viable solution for scenarios characterized by non frequent computations of very large time windows. In this paper we propose a model for batch processing in time window event computations that allows the definition of multiple metrics for performance optimization. These metrics specifically take into account the organization of input data to minimize its impact on computation latency. The model is then instantiated on Hadoop, a batch processing engine based on the MapReduce paradigm, and a set of strategies for efficiently arranging input data is described and evaluated. Copyright 2013 ACM

Keywords

Funding Information

Ministero dell'Istruzione, dell'Università e della Ricerca

This publication has 17 references indexed in Scilit:

Transformer: A New Paradigm for Building Data-Parallel Programming Models
IEEE Micro, 2010
GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch Applications
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2008
MapReduce
Communications of the ACM, 2008
Dryad
Published by Association for Computing Machinery (ACM) ,2007
Run-time operator state spilling for memory intensive long-running queries
Published by Association for Computing Machinery (ACM) ,2006
Adaptive load shedding for windowed stream joins
Published by Association for Computing Machinery (ACM) ,2005
Aurora: a new model and architecture for data stream management
The VLDB Journal, 2003
Models and issues in data stream systems
Published by Association for Computing Machinery (ACM) ,2002
Distributed processing of very large datasets with DataCutter
Parallel Computing, 2001
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM, 1978

Cited by 1 article