Language independent gender classification on Twitter

25 August 2013

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

p. 739-743
https://doi.org/10.1145/2492517.2492632

Abstract

Online Social Networks (OSNs) generate a huge volume of user-originated texts. Gender classification can serve multiple purposes. For example, commercial organizations can use gender classification for advertising. Law enforcement may use gender classification as part of legal investigations. Others may use gender information for social reasons. Here we explore language independent gender classification. Our approach predicts gender using five color-based features extracted from Twitter profiles (e.g., the background color in a user's profile page). Most other methods for gender prediction are typically language dependent. Those methods use high-dimensional spaces consisting of unique words extracted from such text fields as postings, user names, and profile descriptions. Our approach is independent of the user's language, efficient, and scalable, while attaining a good level of accuracy. We prove the validity of our approach by examining different classifiers over a large dataset of Twitter profiles.

Keywords

This publication has 8 references indexed in Scilit:

Predicting age and gender in online social networks
Published by Association for Computing Machinery (ACM) ,2011
Classifying latent user attributes in twitter
Published by Association for Computing Machinery (ACM) ,2010
The WEKA data mining software
ACM SIGKDD Explorations Newsletter, 2009
KNIME - the Konstanz information miner
ACM SIGKDD Explorations Newsletter, 2009
Gender and genre variation in weblogs
Journal of Sociolinguistics, 2006
Chat Mining for Gender Prediction
Lecture Notes in Computer Science, 2006
Gender, genre, and writing style in formal written texts
Text & Talk - An Interdisciplinary Journal of Language, Discourse & Communication Studies, 2003
A pilot study on gender differences in conversational speech on lexical in richness measures
Literary and Linguistic Computing, 2001

Cited by 45 articles