A new text compression technique based on language structure
- 1 April 1995
- journal article
- research article
- Published by SAGE Publications in Journal of Information Science
- Vol. 21 (2), 87-94
- https://doi.org/10.1177/016555159502100203
Abstract
This paper describes a new data compression technique which utilises some of the common structural characteris tics of languages. The proposed algorithm is designed to partition a word into its root and suffix(es), which are then replaced by shorter bit representations. The method uses three dictionaries in the form of binary search trees and one character array. The first two dictionaries are for roots, whereas the third one is for suffixes. The character array is used for both searching compressible words and coding incompressible words. The number of bits in representing a substring depends on the number of the entries in the dictionary in which the substring is found. The proposed algorithm is implemented in the Turkish language and tested using three different text groups with different lengths. The results indicate a compression of up to 47%.Keywords
This publication has 4 references indexed in Scilit:
- Efficient dispersal of information for security, load balancing, and fault toleranceJournal of the ACM, 1989
- Data compressionACM Computing Surveys, 1987
- A universal algorithm for sequential data compressionIEEE Transactions on Information Theory, 1977
- Optimum binary search treesActa Informatica, 1971