A Syllable-Based Technique for Uyghur Text Compression

Open Access

22 March 2020

journal article
research article
Published by MDPI AG in Information

Vol. 11 (3), 172
https://doi.org/10.3390/info11030172

Abstract

To improve utilization of text storage resources and efficiency of data transmission, we proposed two syllable-based Uyghur text compression coding schemes. First, according to the statistics of syllable coverage of the corpus text, we constructed a 12-bit and 16-bit syllable code tables and added commonly used symbols—such as punctuation marks and ASCII characters—to the code tables. To enable the coding scheme to process Uyghur texts mixed with other language symbols, we introduced a flag code in the compression process to distinguish the Unicode encodings that were not in the code table. The experiments showed that the 12-bit coding scheme had an average compression ratio of 0.3 on Uyghur text less than 4 KB in size and that the 16-bit coding scheme had an average compression ratio of 0.5 on text less than 2 KB in size. Our compression schemes outperformed GZip, BZip2, and the LZW algorithm on short text and could be effectively applied to the compression of Uyghur short text for storage and applications.

Funding Information

National Natural Science Foundation of China (61762084, 61662077, 61462083)
National Key Research and Development Project of China (2017YFB1002103)

This publication has 16 references indexed in Scilit:

Different surface functionalized nano-Fe3O4 particles for EVA composite adhesives
International Journal of Adhesion and Adhesives, 2014
Compression of small text files
Advanced Engineering Informatics, 2008
Compression of Short Text on Embedded Systems
Journal of Computers, 2006
A method for the construction of minimum-redundancy codes
Resonance, 2006
Word-based text compression using the Burrows–Wheeler transform
Information Processing & Management, 2005
A new text compression technique based on language structure
Journal of Information Science, 1995
Word‐based text compression
Software: Practice and Experience, 1989
Compression of individual sequences via variable-rate coding
IEEE Transactions on Information Theory, 1978
A universal algorithm for sequential data compression
IEEE Transactions on Information Theory, 1977
A Mathematical Theory of Communication
Bell System Technical Journal, 1948

Cited by 4 articles