Arabic Dialect Identification
- 1 March 2014
- journal article
- research article
- Published by MIT Press in Computational Linguistics
- Vol. 40 (1), 171-202
- https://doi.org/10.1162/coli_a_00169
Abstract
The written form of the Arabic language, Modern Standard Arabic (MSA), differs in a non-trivial manner from the various spoken regional dialects of Arabicthe true native languages of Arabic speakers. Those dialects, in turn, differ quite a bit from each other. However, due to MSA's prevalence in written form, almost all Arabic data sets have predominantly MSA content. In this article, we describe the creation of a novel Arabic resource with dialect annotations. We have created a large monolingual data set rich in dialectal Arabic content called the Arabic On-line Commentary Data set (Zaidan and Callison-Burch 2011). We describe our annotation effort to identify the dialect level (and dialect itself) in each of more than 100,000 sentences from the data set by crowdsourcing the annotation task, and delve into interesting annotator behaviors (like over-identification of one's own dialect). Using this new annotated data set, we consider the task of Arabic dialect identification: Given the word sequence forming an Arabic sentence, determine the variety of Arabic in which it is written. We use the data to train and evaluate automatic classifiers for dialect identification, and establish that classifiers using dialectal data significantly and dramatically outperform baselines that use MSA-only data, achieving near-human classification accuracy. Finally, we apply our classifiers to discover dialectical data from a large Web crawl consisting of 3.5 million pages mined from on-line Arabic newspapers.Keywords
This publication has 18 references indexed in Scilit:
- Dialect Classification via Text-Independent Training and Testing for Arabic, Spanish, and ChineseIEEE Transactions on Audio, Speech, and Language Processing, 2010
- Introduction to Arabic Natural Language ProcessingSynthesis Lectures on Human Language Technologies, 2010
- Language Identification on the Web: Extending the Dictionary MethodPublished by Springer Science and Business Media LLC ,2009
- An Automatic Intelligent Language ClassifierPublished by Springer Science and Business Media LLC ,2009
- On Arabic TransliterationPublished by Springer Science and Business Media LLC ,2007
- MAGEADPublished by Association for Computational Linguistics (ACL) ,2006
- Comparison of four approaches to automatic language identification of telephone speechIEEE Transactions on Speech and Audio Processing, 1996
- ArabicJournal of the International Phonetic Association, 1990
- Measuring nominal scale agreement among many raters.Psychological Bulletin, 1971