Input data organization for batch processing in time window based computations
- 18 March 2013
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 363-370
- https://doi.org/10.1145/2480362.2480437
Abstract
Applications based on event processing are often designed to continuously evaluate set of events defined by sliding time windows. Solutions employing long-running continuous queries executed in-memory show their limits in applications characterized by a staggering growth of available sources that continuously produce new events at high rates (e.g. intrusion detection systems and algorithmic trading). Problems arise due to the complexities in maintaining large amounts of events in memory for continuous elaboration, and due to the difficulties in managing at run-time the network of elaborating nodes. A batch approach to this kind of computation provides a viable solution for scenarios characterized by non frequent computations of very large time windows. In this paper we propose a model for batch processing in time window event computations that allows the definition of multiple metrics for performance optimization. These metrics specifically take into account the organization of input data to minimize its impact on computation latency. The model is then instantiated on Hadoop, a batch processing engine based on the MapReduce paradigm, and a set of strategies for efficiently arranging input data is described and evaluated. Copyright 2013 ACMKeywords
Funding Information
- Ministero dell'Istruzione, dell'Università e della Ricerca
This publication has 17 references indexed in Scilit:
- Transformer: A New Paradigm for Building Data-Parallel Programming ModelsIEEE Micro, 2010
- GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch ApplicationsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- MapReduceCommunications of the ACM, 2008
- DryadPublished by Association for Computing Machinery (ACM) ,2007
- Run-time operator state spilling for memory intensive long-running queriesPublished by Association for Computing Machinery (ACM) ,2006
- Adaptive load shedding for windowed stream joinsPublished by Association for Computing Machinery (ACM) ,2005
- Aurora: a new model and architecture for data stream managementThe VLDB Journal, 2003
- Models and issues in data stream systemsPublished by Association for Computing Machinery (ACM) ,2002
- Distributed processing of very large datasets with DataCutterParallel Computing, 2001
- Time, clocks, and the ordering of events in a distributed systemCommunications of the ACM, 1978