For gender, the system checks the profile for about 150 common male and 150 common female first names, as well as for gender related words, such as father, mother, wife and husband.
We achieved the best results, 95.5% correct assignment in a 5-fold cross-validation on our corpus, with Support Vector Regression on all token unigrams.
For each blogger, metadata is present, including the blogger s self-provided gender, age, industry and astrological sign. The creators themselves used it for various classification tasks, including gender recognition (Koppel et al. The men, on the other hand, seem to be more interested in computers, leading to important content words like software and game, and correspondingly more determiners and prepositions.
One gets the impression that gender recognition is more sociological than linguistic, showing what women and men were blogging about back in A later study (Goswami et al.
For all techniques and features, we ran the same 5-fold cross-validation experiments in order to determine how well they could be used to distinguish between male and female authors of tweets.
In the following sections, we first present some previous work on gender recognition (Section 2). Currently the field is getting an impulse for further development now that vast data sets of user generated data is becoming available. (2012) show that authorship recognition is also possible (to some degree) if the number of candidate authors is as high as 100,000 (as compared to the usually less than ten in traditional studies).
For our experiment, we selected 600 authors for whom we were able to determine with a high degree of certainty a) that they were human individuals and b) what gender they were.