Learning a dual-language vector space for domain-specific cross-lingual question retrieval

25 August 2016

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

p. 744-755
https://doi.org/10.1145/2970276.2970317

Abstract

The lingual barrier limits the ability of millions of non-English speaking developers to make effective use of the tremendous knowledge in Stack Overflow, which is archived in English. For cross-lingual question retrieval, one may use translation-based methods that first translate the non-English queries into English and then perform monolingual question retrieval in English. However, translation-based methods suffer from semantic deviation due to inappropriate translation, especially for domain-specific terms, and lexical gap between queries and questions that share few words in common. To overcome the above issues, we propose a novel cross-lingual question retrieval based on word embeddings and convolutional neural network (CNN) which are the state-of-the-art deep learning techniques to capture word- and sentence-level semantics. The CNN model is trained with large amounts of examples from Stack Overflow duplicate questions and their corresponding translation by machine, which guides the CNN to learn to capture informative word and sentence features to recognize and quantify semantic similarity in the presence of semantic deviations and lexical gaps. A uniqueness of our approach is that the trained CNN can map documents in two languages (e.g., Chinese queries and English questions) in a dual-language vector space, and thus reduce the cross-lingual question retrieval problem to a simple k-nearest neighbors search problem in the dual-language vector space, where no query or question translation is required. Our evaluation shows that our approach significantly outperforms the translation-based method, and can be extended to dual-language documents retrieval from different sources.

Keywords

This publication has 29 references indexed in Scilit:

Mining Analogical Libraries in Q&A Discussions -- Incorporating Relational and Categorical Knowledge into Word Embedding
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
Detecting Semantically Equivalent Questions in Online User Forums
Published by Association for Computational Linguistics (ACL) ,2015
Convolutional Neural Networks for Sentence Classification
Published by Association for Computational Linguistics (ACL) ,2014
Exploiting user feedback to learn to rank answers in q&a forums
Published by Association for Computing Machinery (ACM) ,2013
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
IEEE Signal Processing Magazine, 2012
Improving IR‐based traceability recovery via noun‐based indexing of software artifacts
Journal of Software: Evolution and Process, 2012
Integrating information retrieval, execution and link analysis algorithms to improve feature location in software
Empirical Software Engineering, 2012
Software verification and validation research laboratory (SVVRL) of the University of Kentucky
Published by Association for Computing Machinery (ACM) ,2011
A unified architecture for natural language processing
Published by Association for Computing Machinery (ACM) ,2008
Neural Probabilistic Language Models
Published by Springer Science and Business Media LLC ,2006

Cited by 40 articles