Classification (into Semantic Categories)
Say we have a bunch of sequences of phrases and we want to put them into a different categories (i.e. dictionaries, artists, plants, etc), how could one go about doing this? One approach is to memorize common parts of the phrases (i.e. if the category is dictionaries, a common word might be “dictionary” or “webster”) or common placement of letters within the word if it is a one word phrase. A well-rounded classifier would utilize all of these features, so as to get the best categorization of phrases. It turns out that the ideas of compression and learning are closely related by information theory and the idea of entropy of an expression or the information content.
Specific Type of Classifier (Gzip)
If we have different files containing corpus of text of different languages and we want to sort these files into different categories depending on which language, we can look at each of these languages and try to find common three-letter sequences (i.e. “is_” in English, where “_” denotes a space). We can do this by using a compression algorithm (like Gzip), which carries out these functions.
What does one do if we have a sequencing language–a language that strings together words without spaces (Chinese)? The best segmentation is the one that maximizes the joint probability of the segmentation (see the video on Segmentation (#19) for the formula he uses, this helps with the explanation). We can make an approximation using the Markov Assumption or Naive Bayes (see last week’s write-up)–the best segmentation is the approximation equal to the one that maximizes. Using the simplification, language learning is still feasible.
How many ways can a string of letters be broken up into words? 2^(n-1), which is a lot (if the string of letters is long), and we would not want to list and number them all. So, we need some way of searching through them efficiently without having to consider the likelihood of every possible segmentation. If we use the Naive Bayes assumption (assumes independence of each letter from the last), we can say, okay, since there is no interrelations between any of the letters, we can consider each letter individually.
We have a string of letters “yeswecan” and we need to figure out the best way to split these up.
- The first word could be: “y” “ye” “yes” “yesw” “yeswe” (or so on)
- The rest would be: “eswecan” “swecan” “wecan” “ecan” (etc)
- And then the probability of the first word: (we get this from the corpus, by counting and then smoothing) and we find the likeliness of each of these first words occurring.
- Use the P(first) and times it with the best segmentation of the rest and we can see that “yes” is the best segmentation for the first word, and “we” “can” as the rest of the word.
It is interesting too, that the code for the this process is not much more complicated than the mathematical formula (To supplement this, I looked at Norvig’s actual Python code to see how simple the code is). Other things to note: if the corpus is larger, it might be helpful to use the Markov Assumption rather than the Naive Bayes; if the string of letters uses uncommon words that could potentially be broken down into commonly used words, it might be necessary to obtain more data, and change the smoothing techniques.
If we are given a word that is possibly misspelled, how do we figure out the best correction? It is a similar analysis to the segmentation analysis. We are going to find the probability of a word given the correction (we may be able to get these from a list of spelling corrections) times the probability of a correction (we get these from document counts). Spelling direction data may be difficult to come by, bu we might be able to gather this information from sites that deal with spelling correction. However, even then, we may not have enough data to cover all the possible misspellings. It may be more beneficial to deal with letter to letter errors rather than word to word errors; this way, it is easier to build up a probability table from a smaller amount of data.
Ch. 5 Summary of Foundations of Statistical NLP (M&S) “Collocations”
A collocation is an expression that consists of 2 or more words that correspond to some conventional way of saying things, such as: noun phrases, phrasal verbs, and other stock phrases. Collocations are characterized by limited compositionality, meaning, the expression as a whole can be predicted by the meaning of the parts. Idioms are the most extreme version of collocations, as they are given an entirely different meaning in their entirety, but have the parts individually have the same meaning. Idioms have an indirect historical relationship to the meanings of the parts. However, most collocations are a systematic composition of its parts, but still have an added meaning. There are a number of ways we can find collocations: selection of collocations by frequency, selection based on mean and variance of the distance between focal word and collocating word, hypothesis testing, and mutual information.
- Frequency: we can count within a corpus the number of times two (or three, or four) words occur together. If they occur together a lot, then perhaps that is evidence that they have a special function and not simply explained as the function that results from their combination. With bigrams, we might see a lot of words occurring together that are in fact, function words (i.e. “of the” “in the” “to be”, etc). To improve this, there is a simple heuristic (posed by Justeson and Katz 1995b), pass the candidates through a part-of-speech filter which only lets through those patterns that are likely to be phrases. Frequency techniques work well for fixed phrases.
- Mean and Variance: there are many collocations that consist of two words that stand in a more flexible relationship to one another (i.e. “knock” and “door”), thus, they are variable. However, they have enough regularity to allow us to determine what is the right word to use. The two methods described look at the pattern of varying distance between two words, and if the pattern is relatively predictable, then we have evidence of a collocation. The mean measure is the average offset; the variance measures how much the individual offsets deviate from the mean. Thus, we can use the mean and variance to discover collocations by looking at pairs with low deviation. This method proves difficult if the high frequency and low variance are accidental.
- Hypothesis Testing: this method finds collocations by first formulation a null hypothesis which states what should be true if two words do not form a collocation. We assume that two words are generated completely independently of one another. The probability of co-occurrence is the product of the probabilities of the individual words. Then, we utilize one of two options–the t test, or the chi-squared test. Both do about the same thing, but the chi-squared test does not assume that probabilities are approximately normally distributed making it more accurate when probabilities are high.
- Mutual Information: here we are looking at mutual information, which is roughly a measure of how much one word tells us about the other. If we look at the position of a word (e.g. Albert), the mutual information measure could tell us that the amount of information we have about the occurrence of Albert at position i in the corpus increases by a certain number of bits if we are told that Einstein occurs at position i + 1.
The notion of a collocation can be a bit confusing, especially when considering all of these different means of finding collocations. In fact, many different authors have different definitions of the notion of a collocation. (M&S) say authors in computational and statistical literature define a collocation as “two or more consecutive words with a special behavior” (pg 183, M&S). Others say that the words do not need to be consecutive and rely instead on non-compositionality, non-substitutability, and non-modifiability. A good way to test if a combination of words is a collocation, is to translate it into another language. If you can’t translate it, the that is evidence that we are dealing with a collocation.