This particular article proved extremely interesting having taken a course in Lexicography, and it being extremely relevant to theoretical lexicography. As the title of this article says, they discuss the analysis of a culture using data of the language from millions of digitized books. First, utilizing the digitized books available from Google books, and largely books from universities, the authors separated the data into different categories to discover how language has changed–grammatically and within the lexicon. The complete data set was comprised of about 2 billion trajectories. By utilizing computational analysis of the English language, we can see how it has grown (immensely) and has changed over time–a process that is exceptionally difficult to do with out computers–in fact, near impossible. Trying to attempt this by hand, with this much data, would take years and by the time those years passed the language would have already gone on changing. This is the beauty of computational linguistics–we can finally analyze language without it drastically changing by the time we finish an analysis of it, and we can analyze huge amounts of data, giving us a more accurate description of the language in use. Perhaps, then, we can discover if the Sapir-Whorf hypothesis is accurate, and whether language is sexist/racist/ageist/etc. or if it is the language-users that are. There are so many things we can learn from utilizing a quantitative analysis of language. However, there are drawbacks to this too. I discuss this below.
I took some time to explore the Google books Ngram viewer from the aforementioned paragraph. I did groups of three words having to do with similar things. The first group that I tried was “feminism, The Feminist Movement, feminist” which all followed in a fairly similar pattern with feminist being the most notable. The second group that I tried was “Buddhism, Buddha, contemplate” which proved extremely interesting–Buddha and Buddhism followed along a very similar trajectory, whereas the word contemplate went in an opposite trajectory. It was rather fascinating. What’s moreover, Buddhism and Buddha have seen a steady rise in English over the past few years, with it’s highest points being in the 1960, 70s, and 2000s, whereas the word contemplate has seen it’s lowest points in recent years. I tried this group again with the word meditate in place of contemplate, and got similar results, but meditate was even lower to begin with than contemplate, but didn’t decline quite as much as contemplate. The next group of words I tried was “Buddhism, neurology, psychology” with psychology being the most prevalent and neurology being the least, which makes sense because neurology is a fairly new science as far as sciences go.
Unigrams, and bigrams, and trigrams, oh my by Matthew Jockers
In this article, Jockers points out the drawbacks of the Google Ngram viewer. He lays it out very nicely and brings up things that one might not realize by just looking at the Ngram viewer. The Ngram viewer utilizes millions of digitized books, but we don’t know very much about these books besides their distribution over time. We don’t know the gender of the authors and if there is an even distribution of male and female. We don’t know what genre this corpus is thick with, or if it has an even distribution from all genres. These factors are important for determining whether or not our statistical analysis is verifiable, or not. Sure, I can enter in several groups of words as I did above, but without the metadata, it is hard for me to draw meaningful conclusions about the English language. There is a lot of talk about language describing our culture, but how can it actually describe our culture if we are not drawing conclusions with meaning. Using a corpus without metadata is not the correct way to go about describing our culture directly. Another thing we don’t know and cannot pinpoint exactly by looking at the Ngram viewer is change. How does the Ngram viewer account for linguistic shift? Are some ngrams summed up to be random mutations, or are there reasons for these anomalies? And if so, what do those anomalies tell us about our linguistic history? Jockers also mentions the accuracy of the machines being used (OCR – optical character recognition) to scan in the data–how reliable are these readers? Do they account for the variations of font over time? And if they do, do they account for the fonts that aren’t on the grid for very long–”blips”–and not long term consistencies. I would also like to know if the font being used during that time suggests anything about the culture and the language that was in use. And finally, how many of these books include the title page and publishing information–the information we do not want to use to obtain a description of our language? These are all issues that may be part of the Ngram viewer, and thus, we should not take its trajectories at face value. We must instead look deeper and see the drawbacks, however marvelous a tool this really is. Maybe with a bit of manipulation, the Ngram viewer can some day reach a point that allows it the precision and accuracy necessary for it to give us an objective representation of our language.
Chapter 2 of Foundations of Statistical Natural Language Processing
This chapter essentially goes over basic probability. It talks briefly about statistics, and how this aspect of math fits in with NLP. I don’t want to get into too much detail here because having taken both a discrete mathematics course and a statistics course, I have learned most of what I need to know. The most important aspect of statistics and probability is that it is necessary for describing our language, and helps humans to determine what how our language has changed over time, and the videos that I will be watching next week will go into more detail next week.