Language independent gender classification on Twitter
- 25 August 2013
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 739-743
- https://doi.org/10.1145/2492517.2492632
Abstract
Online Social Networks (OSNs) generate a huge volume of user-originated texts. Gender classification can serve multiple purposes. For example, commercial organizations can use gender classification for advertising. Law enforcement may use gender classification as part of legal investigations. Others may use gender information for social reasons. Here we explore language independent gender classification. Our approach predicts gender using five color-based features extracted from Twitter profiles (e.g., the background color in a user's profile page). Most other methods for gender prediction are typically language dependent. Those methods use high-dimensional spaces consisting of unique words extracted from such text fields as postings, user names, and profile descriptions. Our approach is independent of the user's language, efficient, and scalable, while attaining a good level of accuracy. We prove the validity of our approach by examining different classifiers over a large dataset of Twitter profiles.Keywords
This publication has 8 references indexed in Scilit:
- Predicting age and gender in online social networksPublished by Association for Computing Machinery (ACM) ,2011
- Classifying latent user attributes in twitterPublished by Association for Computing Machinery (ACM) ,2010
- The WEKA data mining softwareACM SIGKDD Explorations Newsletter, 2009
- KNIME - the Konstanz information minerACM SIGKDD Explorations Newsletter, 2009
- Gender and genre variation in weblogsJournal of Sociolinguistics, 2006
- Chat Mining for Gender PredictionLecture Notes in Computer Science, 2006
- Gender, genre, and writing style in formal written textsText & Talk - An Interdisciplinary Journal of Language, Discourse & Communication Studies, 2003
- A pilot study on gender differences in conversational speech on lexical in richness measuresLiterary and Linguistic Computing, 2001