The Smallest Grammar Problem
Top Cited Papers
- 27 June 2005
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Information Theory
- Vol. 51 (7), 2554-2576
- https://doi.org/10.1109/tit.2005.850116
Abstract
This paper addresses the smallest grammar problem: What is the smallest context-free grammar that generates exactly one given string /spl sigma/? This is a natural question about a fundamental object connected to many fields such as data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem's inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, the worst case behavior) to establish provable performance guarantees and to address shortcomings in the classical measure of redundancy in the literature. Our first results are concern the hardness of approximating the smallest grammar problem. Most notably, we show that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569/8568 unless P=NP. We then bound approximation ratios for several of the best known grammar-based compression algorithms, including LZ78, B ISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and RE-PAIR. Among these, the best upper bound we show is O(n/sup 1/2/). We finish by presenting two novel algorithms with exponentially better ratios of O(log/sup 3/n) and O(log(n/m/sup */)), where m/sup */ is the size of the smallest grammar for that input. The latter algorithm highlights a connection between grammar-based compression and LZ77.Keywords
This publication has 27 references indexed in Scilit:
- Universal lossless compression via multilevel pattern matchingIEEE Transactions on Information Theory, 2000
- Grammar-based codes: a new class of universal lossless source codesIEEE Transactions on Information Theory, 2000
- Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform. I. Without context modelsIEEE Transactions on Information Theory, 2000
- Compression and Explanation using Hierarchical GrammarsThe Computer Journal, 1997
- Sequential codes, lossless compression of individual sequences, and Kolmogorov complexityIEEE Transactions on Information Theory, 1996
- The testability-preserving concurrent decomposition and factorization of Boolean expressionsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1992
- Data compression via textual substitutionJournal of the ACM, 1982
- Compression of individual sequences via variable-rate codingIEEE Transactions on Information Theory, 1978
- A universal algorithm for sequential data compressionIEEE Transactions on Information Theory, 1977
- On the Complexity of Finite SequencesIEEE Transactions on Information Theory, 1976