Efficiently supporting ad hoc queries in large datasets of time sequences

1 June 1997

journal article
Published by Association for Computing Machinery (ACM) in ACM SIGMOD Record

Vol. 26 (2), 289-300
https://doi.org/10.1145/253262.253332

Abstract

Ad hoc querying is difficult on very large datasets, since it is usually not possible to have the entire dataset on disk. While compression can be used to decrease the size of the dataset, compressed data is notoriously difficult to index or access. In this paper we consider a very large dataset comprising multiple distinct time sequences. Each point in the sequence is a numerical value. We show how to compress such a dataset into a format that supports ad hoc querying, provided that a small error can be tolerated when the data is uncompressed. Experiments on large, real world datasets ( AT&T customer calling patterns) show that the proposed method achieves an average of less than 5% error in any data value after compressing to a mere 2.5% of the original space ( i.e. , a 40:1 compression ratio), with these numbers not very sensitive to dataset size. Experiments on aggregate queries achieved a 0.5% reconstruction error with a space requirement under 2%.

Keywords

This publication has 6 references indexed in Scilit:

BIRCH
Published by Association for Computing Machinery (ACM) ,1996
Principal Component Analysis
Springer Series in Statistics, 1986
A Survey of Recent Advances in Hierarchical Clustering Algorithms
The Computer Journal, 1983
Extended Boolean information retrieval
Communications of the ACM, 1983
A universal algorithm for sequential data compression
IEEE Transactions on Information Theory, 1977
Differential files
ACM Transactions on Database Systems, 1976

Cited by 93 articles