Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms

20 May 2015

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

p. 249-263
https://doi.org/10.1145/2745754.2745772

Abstract

Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and near-optimal algorithms for approximating arbitrary one dimensional data distributions by histograms. A k-histogram is a piecewise constant function with k pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld in PODS 2012: given samples from a distribution p over {1,...,n}, compute a k histogram that minimizes the l₂-distance from p, up to an additive ε. We design an algorithm for this problem that uses the information-theoretically minimal sample size of m = O(1/ε²), runs in sample-linear time O(m), and outputs an O(k)-histogram whose l²-distance from p is at most O(opt_k) +ε, where opt_k is the minimum l₂-distance between p and any k-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size. We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude im rovement in terms of empirical running times over previous state-of-the-art algorithms.

Keywords

Funding Information

Marie Curie CIG
NSF (CCF-1217921)
SICSA
DOE (DE-SC0008923)
EPSRC (EP/L021749/1)
MIT-Shell Energy Initiative

This publication has 18 references indexed in Scilit:

Approximating and testing k-histogram distributions in sub-linear time
Published by Association for Computing Machinery (ACM) ,2012
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends® in Databases, 2011
Approximation and streaming algorithms for histogram construction problems
ACM Transactions on Database Systems, 2006
REHISTRelative Error Histogram Construction Algorithms
Published by Elsevier BV ,2004
Dynamic multidimensional histograms
Published by Association for Computing Machinery (ACM) ,2002
Fast, small-space algorithms for approximate histogram maintenance
Published by Association for Computing Machinery (ACM) ,2002
Dynamic multidimensional histograms
Published by Association for Computing Machinery (ACM) ,2002
Polynomial splines and their tensor products in extended linear modeling: 1994 Wald memorial lecture
The Annals of Statistics, 1997
On the method of bounded differences
Published by Cambridge University Press (CUP) ,1989
Splines in Statistics
Journal of the American Statistical Association, 1983

Cited by 17 articles