An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India

Open Access

29 July 2021

journal article
research article
Published by MDPI AG in Information

Vol. 12 (8), 306
https://doi.org/10.3390/info12080306

Abstract

The pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last few years, with promising results. The majority of these studies, however, deal with high-resource languages such as English due to the availability of datasets in these languages. Recent work has addressed offensive language identification from a low-resource perspective, exploring data augmentation strategies and trying to take advantage of existing multilingual pretrained models to cope with data scarcity in low-resource scenarios. In this work, we revisit the problem of low-resource offensive language identification by evaluating the performance of multilingual transformers in offensive language identification for languages spoken in India. We investigate languages from different families such as Indo-Aryan (e.g., Bengali, Hindi, and Urdu) and Dravidian (e.g., Tamil, Malayalam, and Kannada), creating important new technology for these languages. The results show that multilingual offensive language identification models perform better than monolingual models and that cross-lingual transformers show strong zero-shot and few-shot performance across languages.

Keywords

This publication has 27 references indexed in Scilit:

How to Fine-Tune BERT for Text Classification?
Published by Springer Science and Business Media LLC ,2019
Cross-domain and Cross-lingual Abusive Language Detection: A Hybrid Approach with Deep Learning and a Multilingual Lexicon
Published by Association for Computational Linguistics (ACL) ,2019
How Multilingual is Multilingual BERT?
Published by Association for Computational Linguistics (ACL) ,2019
Predicting the Type and Target of Offensive Posts in Social Media
Published by Association for Computational Linguistics (ACL) ,2019
Automatic cyberbullying detection: A systematic review
Computers in Human Behavior, 2018
A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection
Published by Association for Computational Linguistics (ACL) ,2018
Challenges in discriminating profanity from hate speech
Journal of Experimental & Theoretical Artificial Intelligence, 2017
Detecting Hate Speech in Social Media
Published by Assoc. for Computational Linguistics Bulgaria ,2017
Abusive Language Detection on Arabic Social Media
Published by Association for Computational Linguistics (ACL) ,2017
Overview for the First Shared Task on Language Identification in Code-Switched Data
Published by Association for Computational Linguistics (ACL) ,2014

Cited by 18 articles