The Saudi Novel Corpus: Design and Compilation
Open Access
- 30 June 2022
- journal article
- research article
- Published by MDPI AG in Applied Sciences
- Vol. 12 (13), 6648
- https://doi.org/10.3390/app12136648
Abstract
Arabic has recently received significant attention from corpus compilers. This situation has led to the creation of many Arabic corpora that cover various genres, most notably the newswire genre. Yet, Arabic novels, and specifically those authored by Saudi writers, lack the sufficient digital datasets that would enhance corpus linguistic and stylistic studies of these works. Thus, Arabic lags behind English and other European languages in this context. In this paper, we present the Saudi Novels Corpus, built to be a valuable resource for linguistic and stylistic research communities. We specifically present the procedures we followed and the decisions we made in creating the corpus. We describe and clarify the design criteria, data collection methods, process of annotation, and encoding. In addition, we present preliminary results that emerged from the analysis of the corpus content. We consider the work described in this paper as initial steps to bridge the existing gap between corpus linguistics and Arabic literary texts. Further work is planned to improve the quality of the corpus by adding advanced features.This publication has 25 references indexed in Scilit:
- A Simple Set of Rules for Characters and Place Recognition in French NovelsFrontiers in Digital Humanities, 2017
- arTenTen: Arabic Corpus and Word SketchesJournal of King Saud University - Computer and Information Sciences, 2014
- A 700M+ Arabic corpus: KACST Arabic corpus design and constructionLanguage Resources and Evaluation, 2014
- The International Corpus of Arabic: Compilation, Analysis and EvaluationPublished by Association for Computational Linguistics (ACL) ,2014
- New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing ToolPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- Comparative evaluation of text classification techniques using a large diverse Arabic datasetLanguage Resources and Evaluation, 2013
- MULTEXT-East: morphosyntactic resources for Central and Eastern European languagesLanguage Resources and Evaluation, 2011
- Corpus Linguistics and the Study of Nineteenth-Century FictionJournal of Victorian Culture, 2010
- The design of a corpus of Contemporary ArabicCorpus Studies of Language Through Time, 2006
- Conrad in the computer: examples of quantitative stylistic methodsLanguage and Literature: International Journal of Stylistics, 2005