Automatic Detection of Gender Identity: the Phenomenon of Russian Women's Prose

Abstract

The article deals with the method of automatic detection of authors’ gender identity on the material of fiction prose of 1980–2000. During this period, there is a special construct, called “women’s prose”, which is characterized by a special genre and stylistic originality. We set ourselves the task to find out whether the concept of “women’s prose” refers only to the non-text reality or is clearly reflected at the level of language. We have collected corpus of texts 1980–2000 and conducted that identified the most effective machine learning algorithms for the classification of male and female prose. This research focuses on methods for automatically determining the gender identity of authors on the material of prose from 1960 to 2000. The purpose of this work is to identify optimal methods for automatically determining the gender identity of the authors. The objectives of this study include highlighting the grammatical and stylistic features of prose from 1960 to 2000 and, in particular, women's prose and texts of 18th – 19th centuries; tracing the changes in the distribution of usage different parts of speech and punctuation for a specified period and conducting an experiment to identify the most effective algorithm for the classification of literary texts by using machine learning. The analysis revealed that women and men often use in their texts the following parts of speech: nouns, verbs, prepositions, pronominal nouns, conjunctions, and adjectives that reflects the specific artistic style. In addition, analysis was made of the use of the most commonly used punctuation marks from the given list: question mark, exclamation point, comma, colon, semicolon, period, comma. It has been observed that women are more actively using the means of punctuation as a means of expression in modern literature: the share of the use of exclamation, question marks and commas the writers is much higher than the value obtained through the analysis of men’s texts. The work also contains an analysis of the distribution of parts of speech and punctuation of literary texts of men and women of 18th – 19th centuries. We performed experiment to identify the most effective algorithm for determining the gender identity of the author. It was found that the most effective classifiers of literature are the implementation of algorithms as BayesNet and SMO.

Automatic Detection of Gender Identity: the Phenomenon of Russian Women's Prose

Abstract

Keywords