Exploring Gender through the Language of Twitter
With the age of communication upon us, there are a wide variety of forms of communication: text messaging, twitter, blogs, Facebook, reddit, YouTube, Skype, and the like. Online social media sites allow for informal means of communication, which offers language specialists and computer analysts a vast amount of data readily available for large-scale statistical modeling. Whatever the goal of one’s research, there is a tacit assumption that the language we use is associated with, and influences our perceptions of the world around us, and how we display our identity within this world.
This research, then, studies the relationship between gender, and linguistic style in social media text. This study is supported by a dataset of microblog posts from the social media service, Twitter, which allows users to post (“tweet”) messages of 140 characters or less (“tweets”). We use a medium-sized corpus of 20,000 tweets (a total of 341, 442 words: 177,584 from women, 163,858 from men) from 52 individuals (21 females, 21 males), aged 20 to 24, and collected over the course of several weeks (March 15 through April 12). We perform a computational analysis, using a classifier, to determine the influence of gender on linguistic style. We use two tests on the data, one test that looks at all of the words in the tweets, and one test that looks at words that are largely considered “gendered words.” Since gender is difficult to look at without considering other social variables, we chose to look at twitter users that we know, so as to also know what is the socioeconomic background of these users. With this being said, this research does not cover, by any means, a multicultural aspect of the use of language in women and men, rather it is based on the use of language by white, American, college-aged persons; in fact, most of these users are either in college, or just out of college.
In Language and Women’s Place (LWP), Robin Lakoff describes nine features of language that she attributes to women, six of which are relevant to this research. Lakoff’s features include: large stock of words related to ‘women’s work’, ‘empty’ adjectives, question intonation (use of a question mark, instead of a period), hedges, hypercorrect grammar, super-polite forms (euphemisms in place of profanity). And, to add to this, the use of “expressive” (contextual) language will also be analyzed. The foil to the construction of the female’s language is that of the men’s, which Deborah Cameron states as being: “competitive, hierarchically organized, centering on impersonal topics and the exchange of information, and foregrounding speech genres such as joking, trading insults and sports statistics.” From Lakoff, we can gain that men use profanity, rather than euphemisms, and other topics may include: women, and the consumption of alcohol. In opposition to the use of women’s hypercorrect grammar, would be the use of a vernacular rather than the standard, for example “gunna,” “u,” “ur,” and other such abbreviations that one may find in online social media. From this, we were able to construct a single list of words that are considered “gendered” words, which varied with respect to nouns, verbs, and adjectives, and combined the words of both genders. It is comprised of about 130 words, and is implemented as a stop word list might be.
With these two tests, we want to see if a classifier can correctly sort the tweets into the appropriate gender file, using all of the words in the tweets, and using just the words that were labeled as gendered. If the results are statistically relevant, and the tweets are sorted in their correct gender file, then what this will show is first, if the theoretical underpinnings provided have any relevance for language in use, and second, how gender can be enacted, depending on the context, the speaker’s stylistic choices, and the interactions between gender and other aspects of personal identity.
For the tweets that were collected in the database, we used a probabilistic classification method that allowed us to look at the text in an unbiased way. As native speakers of English, we have ideas of whether or not something that is said or written is masculine, or feminine, often times, without our conscious awareness. However, doing so with our own eyes can lead to biased results. So, how can we teach a program to understand the same thing, without getting biased results? What are the essential parts of a text that say whether something is feminine or masculine? Is there a way to do this automatically? If so, what are the benefits and pitfalls of doing this?
How exactly does a system tell whether the unstructured text is masculine or feminine? As stated earlier, we created a list of words that drew on Lakoff’s description of female language, and Cameron’s description of male language. From here, we can count the number of “female” words in a text and compare it to the number of “male” words and classify it according to which number is higher. Since we had such a large number of tweets, we split the data up into ten bins, in order to test the data for the best accuracy. For each of the ten bins, we used one bin as the test set (wherein gender labels are hidden), and the remaining nine as the training set, and we did this ten times, using a different single bin to test on each time so as to, eventually, test on all of the data. The overall accuracy is the average of all ten sets. In using the Naive Bayes method, we can get a better idea of whether a person used male or female language.
With the Naive Bayes we cannot only make classifications, but we can make probabilistic classifications—one reason why the Naive Bayes is advantageous. Another benefit of the Bayesian method is that when given a training set, it immediately analyzes the data and builds a model from that data (called eager learners). Thus, when it needs to classify something, it uses this internal model to do so, making classification a faster, and unbiased process.
Say we have a large corpus of tweets from both males and females, and we want to figure out whether these documents are male or female language. Documents that reflect a masculine feedback are tagged as male and ones that reflect a feminine feedback are tagged as female. If we use the Naive Bayes without any refinement of the text beforehand, we go through each document and find what the probability is that the text is female given that the first word is “____”, and we do that for every word in the document, which is a lot of probabilities to calculate. However, with refinement of the text: for each hypothesis, we combine the documents tagged with that hypothesis (male or female) into one text file; then, we can determine the unique set of words within each text file (the vocabulary); for each word in the vocabulary, we can count how many times the word occurs in the text; finally, we can compute the words probability. Thus, we ran two different tests on the particular data: the first test with all of the words in place, and the second text with only the gender words in place. These tests give us a percentage as to how accurately the tweets were sorted. We then took the data and entered the female’s and male’s tweets into the Online Text Comparator, available to any persons wishing to compare two texts, (http://guidetodatamining.com/ngramAnalyzer/comparator.php), and compared them. What is important though is that from this comparator we can use the log likelihood to identify the significant differences between the male and female corpora.
Results and Analysis
CLASSIFICATION USING ALL WORDS
% Accuracy: 66.86
CLASSIFICATION USING GENDER WORD LIST
% Accuracy: 86.22 (= average recall)
Average precision: 85.66%
From the accuracy of these predictive models, it is indisputable that there is a strong relationship between language and gender, and that this relationship is visible at the level of single words and n-grams, which shows our bag-of-words model is highly justified and a strong predictor of gender. Both results are significant, which tells us they are reliable and that the theoretical underpinnings have relevance as to how one indexes their gender within their language. But with what degree do these prescriptive results license descriptive statements about the linguistic styles preferred by women and men? What can we infer about the percentage that was classified incorrectly?
Online Comparator Results
|Previous Article||In Our Data|
|Family terms||Mixed||Not Significant|
|CMC words (lol, omg)||F||F (weak)|
|Conjunctions||F (weak)||M (weak)|
|Quantifiers||Not Significant||F (weak)|
|Hesitation markers||F||M (weak)|
|Misc. terms (Sexual terms, food)||N/A||F|
There are a lot of words on this list that females use that males do not use at all. This may be due to the fact that females have about a 500 more tweets (equating to about 10,000 more words) than males. So, when analyzing these results and setting them up for the chart above, we do not include those terms that females use and males do not, and that do not have any relevance to the aggregates above. Hence, we use the category ‘Miscellaneous terms,’ which is comprised of sexual terms, food terms, slang words, alcohol-related terms, and other various words of interest. We also see this within the gendered word test; we include a statement that says ‘if the word has a probability of zero, classify as female,’ rather than male. Because of the likelihood that females are more likely to have zero probability within the gendered word test, meaning none of the words in the tweets are in our gender vocabulary, it is possible that females have a larger vocabulary than do the males.
Lexical Markers of Gender
All of the emotion-related terms detected by the online text comparator are associated with female authors: lust, love, trust, Happy, exciting, excitement, emotional, emotions, worry, shame, hate, mood, etc., with the exception of the words feel, and happy (with a lower case h), which are associated with male authors. Female markers include a large number of computer mediated communication terms like OMG, lol, idk, btw, hr (abbreviation for hour), bf, idky, hw, af, and numerous others associated with Twitter, and other social media services. Of the family terms that are gender markers, we only found two words brother, and mom, which were both said by females only, thus, we counted these as insignificant. Consistent with family terms, is the category of politeness terms—sorry, thanks, thx, please—all of which were said by only females. Backchannel sounds also appear as female markers (uk, ugh, Oh), as does expressive lengthening (nooooooooo, PLEASEEEE, uhhhhh, comfyyy, withhhh, mee). All of the clitics, except for I’m, I’ve, and can’t, appear as female markers. Both negation and assent terms appear as female markers, excluding no, which appears as a male marker. Finally, quantifiers, not counting some, little, much, or favorite, appear as female markers (i.e. most, all, best, least, rather, quite).
Female authors have an interesting mixture of words that are not used by male authors. These words include sexual terms (DILDOS, LESBIAN, COCK), food-related terms (pizza, food, foods, lunch, espresso, hunger, chicken, feed, nuggets), alcohol-related terms (drunkest), slang terms (yo, lyfe, Homeboys, grub, wanna, dancin, def), and various other terms that do not fit into any one category (Boromir, Gandalf, Waldorf, Blair, kitty, snuggle, Netflix, yoga).
Several of the male-associated terms include swear words such as, shit, damn, and hell. There are instances where females use swear words that males do not—fuck, FUCK, fucked, shittt, and bitch are words all used by females. We can also see expressive lengthening with shittt, which we noted as a female marker. Male markers also include numbers (2, 3), and possibly related to sports (first, 3rd). Oddly enough, hesitation terms mark the male gender, within this set of data, specifically the ellipses. Finally, with a few exceptions, most of the conjunctions (but, Or, so) and articles (a, A, an, the, The, this, This, that) are male markers. In each category that is marked male, there are several instances of words used by females that males did not use at all.
Pronouns, numbers, and prepositions do not have a strong gender association in our data. Rather, both genders seem to use them pretty evenly. Once again, however, there are several words in each of these categories that females use that males do not. Within pronouns, female markers include: I, you, mine, me, she, her (weak), we (weak), our, they, their (weak), my, My, and myself (weak); male markers include: I, your, its, his (weak), and it. Of the prepositions, female markers include: about, after, as, before (weak), between, during, except, for, in, into, outside, past, save, since, to, until, up, without, like (weak); male markers include: at, but, down, from, of, off, on (weak), over, through, with. Finally, as for numbers, male markers have already been stated; female markers include: 1st, two, five, 5, 12, 80, 21, second, triple, 26, and other various numbers having to do with time, weight, and money.
From the percentage that was classified incorrectly we can see how the standard analyses fail to capture important nuances of the relationship between language and gender. On the surface, these findings are generally in league with previous findings, with the exception of a few irregularities. We can assume the irregularities are due to the smaller size corpora; other research parties have built corpora over the period of months, and have gathered over 9 million tweets (Bamman, et al.) from over 14,000 authors. From these findings, we can argue that female language is expressive, which is supported by lengthenings like nooooooooo. Nonetheless, we could argue that male language is equally as expressive if we count swear words, but women also use swear words. However, the rejection of some swear words by males, may indicate a greater tendency towards standard or prestige language, and this is in accordance with men’s rejection of CMC terms like lol, and omg, and rejection of slang terms like yo, and lyfe. These results point to the need for a more nuanced analysis—perhaps even a case-by-case distinction. With this being said, we can assume that there are multiple ways that people express their gender identity. These results may be more conclusive if we know the meanings of each individual word, and apply the meanings to each individual word. For some of these words, excluding functional words because they mostly only have one meaning, it would have been helpful had we known what sense they were being used in.
With large or medium-scale quantitative analysis of gender, it is easy to forget that social categories are built out of individual interactions, and that the category of gender cannot be separated from other aspects of identity. There is a relationship between gender and the given words, but is it the case that if you are a female and you are talking about sports, does the mere act of talking about sports make you masculine? By starting with the assumption that there is a gender binary, then our analysis is incapable of revealing violations of this assumption. Since this research focused mostly on predicting gender based on text, this approach was adequate. However, if we were to take this research a step further and begin to characterize the interaction between language and gender, it becomes a house of mirrors, which by design can only find evidence to support the underlying assumption of a binary gender opposition. Social media services like Twitter make social networks explicit, and researchers have shown that it is possible to make disarmingly accurate predictions of a number of personal attributes, but by assuming a binary gender, we blind ourselves to the more nuanced ways in which notions of gender affect social and linguistic behavior. If gender were truly a binary category, then individuals who do not use a gender-typical linguistic style are statistical anomalies, outliers that are unavoidable in any large population. However, it seems a multidimensional model of gender would not be able to view language and social behavior as conditionally independent. Additionally, having done this research on a group of people that are (mostly) socially accepted by one another, we must ask, do these users employ similar language patterns to match those of their interlocutors? This research argues, in line with Bamman, et al., for a theory of gender not as a twofold category that is revealed by language; rather, gender that can be indexed by diverse linguistic resources, but the decision about how to use language to index gender depends on the context of the situation and the speaker’s audience.
Works Cited & Resources
Bamman, David, Jacob Eisenstein, and Tyler Schnoebelen. n. page. <http://arxiv.org/pdf/1210.4567v1.pdf>.
Cameron, Deborah. “Performing Gender Identity: Young Men’s Talk and the Construction of Heterosexual Masculinity.” (2011): 250-262.
Lakoff, Robin. “Language and Women’s Place.” Language and Women’s Place. (1975): 39-102.
Online Text Comparator: http://guidetodatamining.com/ngramAnalyzer/comparator.php
Special thanks to: Prof. Zacharski for coding the program that helped in the analysis of this data.