Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms
- 20 May 2015
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 249-263
- https://doi.org/10.1145/2745754.2745772
Abstract
Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and near-optimal algorithms for approximating arbitrary one dimensional data distributions by histograms. A k-histogram is a piecewise constant function with k pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld in PODS 2012: given samples from a distribution p over {1,...,n}, compute a k histogram that minimizes the l2-distance from p, up to an additive ε. We design an algorithm for this problem that uses the information-theoretically minimal sample size of m = O(1/ε2), runs in sample-linear time O(m), and outputs an O(k)-histogram whose l2-distance from p is at most O(optk) +ε, where optk is the minimum l2-distance between p and any k-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size. We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude im rovement in terms of empirical running times over previous state-of-the-art algorithms.Keywords
Funding Information
- Marie Curie CIG
- NSF (CCF-1217921)
- SICSA
- DOE (DE-SC0008923)
- EPSRC (EP/L021749/1)
- MIT-Shell Energy Initiative
This publication has 18 references indexed in Scilit:
- Approximating and testing k-histogram distributions in sub-linear timePublished by Association for Computing Machinery (ACM) ,2012
- Synopses for Massive Data: Samples, Histograms, Wavelets, SketchesFoundations and Trends® in Databases, 2011
- Approximation and streaming algorithms for histogram construction problemsACM Transactions on Database Systems, 2006
- REHISTRelative Error Histogram Construction AlgorithmsPublished by Elsevier BV ,2004
- Dynamic multidimensional histogramsPublished by Association for Computing Machinery (ACM) ,2002
- Fast, small-space algorithms for approximate histogram maintenancePublished by Association for Computing Machinery (ACM) ,2002
- Dynamic multidimensional histogramsPublished by Association for Computing Machinery (ACM) ,2002
- Polynomial splines and their tensor products in extended linear modeling: 1994 Wald memorial lectureThe Annals of Statistics, 1997
- On the method of bounded differencesPublished by Cambridge University Press (CUP) ,1989
- Splines in StatisticsJournal of the American Statistical Association, 1983