A new text compression technique based on language structure

1 April 1995

journal article
research article
Published by SAGE Publications in Journal of Information Science

Vol. 21 (2), 87-94
https://doi.org/10.1177/016555159502100203

Abstract

This paper describes a new data compression technique which utilises some of the common structural characteris tics of languages. The proposed algorithm is designed to partition a word into its root and suffix(es), which are then replaced by shorter bit representations. The method uses three dictionaries in the form of binary search trees and one character array. The first two dictionaries are for roots, whereas the third one is for suffixes. The character array is used for both searching compressible words and coding incompressible words. The number of bits in representing a substring depends on the number of the entries in the dictionary in which the substring is found. The proposed algorithm is implemented in the Turkish language and tested using three different text groups with different lengths. The results indicate a compression of up to 47%.

Keywords

This publication has 4 references indexed in Scilit:

Efficient dispersal of information for security, load balancing, and fault tolerance
Journal of the ACM, 1989
Data compression
ACM Computing Surveys, 1987
A universal algorithm for sequential data compression
IEEE Transactions on Information Theory, 1977
Optimum binary search trees
Acta Informatica, 1971

Cited by 1 article