Query Expansion for Transliterated Text Retrieval

20 July 2021

journal article
research article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Asian and Low-Resource Language Information Processing

Vol. 20 (4), 1-34
https://doi.org/10.1145/3447649

Abstract

With Web 2.0, there has been exponential growth in the number of Web users and the volume of Web content. Most of these users are not only consumers of the information but also generators of it. People express themselves here in colloquial languages, but using Roman script (transliteration). These texts are mostly informal and casual, and therefore seldom follow grammar rules. Also, there does not exist any prescribed set of spelling rules in transliterated text. This freedom leads to large-scale spelling variations, which is a major challenge in mixed script information processing. This article studies different existing phonetic algorithms to handle the issue of spelling variation, points out the limitations of them, and proposes a novel phonetic encoding approach with two different flavors in the light of Hindi transliteration. Experiments performed over Hindi song lyrics retrieval in mixed script domain with three different retrieval models show that proposed approaches outperform the existing techniques in a majority of the cases (sometimes statistically significantly) for a number of metrics like [email protected], [email protected], [email protected], MAP, MRR, and Recall.

Keywords

This publication has 18 references indexed in Scilit:

Overview of the FIRE 2013 Track on Transliterated Search
Published by Association for Computing Machinery (ACM) ,2013
Frontiers, challenges, and opportunities for information retrieval
ACM SIGIR Forum, 2012
Machine transliteration survey
ACM Computing Surveys, 2011
Mining Synonymous Transliterations from the World Wide Web
ACM Transactions on Asian Language Information Processing, 2010
Probabilistic models of information retrieval based on measuring the divergence from randomness
ACM Transactions on Information Systems, 2002
A probabilistic model of information retrieval: development and comparative experiments: Part 2
Information Processing & Management, 2000
Machine translation vs. dictionary term translation
Published by Association for Computational Linguistics (ACL) ,1998
PHONIX: The algorithm
Program: electronic library and information systems, 1990
Term-weighting approaches in automatic text retrieval
Information Processing & Management, 1988
‘Fisching fore werds’: phonetic retrieval of written text in information systems
Program: electronic library and information systems, 1988