A Glimpse at the Guide to Data Mining

Posted by
|

Ch. 6 “Naive Bayes and Unstructured Text”

This chapter in Zacharski’s “A Programmar’s Guide to Data Mining” deals with the classification of text into categories of positive or negative–does the person commenting like, or dislike the item they are commenting about? As native speakers of English, we can look at something and tell whether or not a person liked something or did not, but how can we teach a program to tell the same thing? What are the essential parts of a text that say whether something is positive or negative feedback? And is there a way to do this automatically? If so, what are the pros and cons of doing this? And why would we want to be able to do this? Answers to these questions can be found in this chapter of Zacharski’s book.

If your company has just developed some new item, and you want to know what people are saying about your product an automatice system that can read text and tell whether it is positive or negative might be quite informative, and helpful. So, how exactly does this system tell whether the unstructured text is positive or negative? We can create two lists, one that may have words that people use when they dislike something (gross, bad, terrible, hate, sucks, etc.) and a list that has words that people use when they like something (like, love, delicious, awesome, good, etc.). From here, we can count the number of like words in a text and compare it to the number of dislike words and classify it according to which number is higher.  However, this method may be flawed in that counting a number of like/dislike words in a large text could take a long time, and some of the dislike words could be being used in their colloquial sense, rather than their actual sense (i.e. bad, as in “that is one bad bitch,” means she is a tough, tomboyish, girl).  Using the naive Bayes method, we can get a better idea of whether a person liked a product or not.

The naive Bayes method we can not only make classifications, but we can make probabilistic classifications–one reason why the naive Bayes is an advantage. Another benefit of the Bayesian method is that when given a training set, it immediately analyzes the data and builds a model from that data (called eager learners). Thus, when it needs to classify something, it uses this internal model to do so, making classification a faster process.

Say we have a large corpus of documents, and we want to figure out whether these documents are positive or negative reviews of a product. Documents that reflect a positive feedback are tagged as like and ones that reflect a negative feedback are tagged as dislike. If we use the naive Bayes without any refinement of the text beforehand, we would have to go through each document and find what the probability is that the text is positive given that the first word is ____, and we would have to do that for every word in the document, which is a whole lot of probabilities that we would have to calculate. However, there is a solution to this: for each hypothesis, we can combine the documents tagged with that hypothesis (like or dislike) into one text file; then, we can determine the unique set of words within each text file (the vocabulary); for each word in the vocabulary, we can count how many times the word occurs in the text; finally, we can compute the words probability.  This can still leave us with some issues though–the English language (and any language for that matter) has a lot of functional words (stop words), and removing such words can increase the performance of a system even more. There are, however, dangers in throwing out the stop words–sometimes, it can change, or (in extreme cases) delete altogether the meaning of a phrase or word. The removal of stop words is something that should be carefully considered, before implementing.

So many things are possible with the process of classification; we can personalize things through the use of classification (classification of news articles we like to read/don’t care to read, of songs we like to listen to/don’t care to listen to, etc). Classification is how most email systems categorize email as being spam or not. There have been programs created that can grade essays, so that teachers don’t have to. There are systems to identify who the author is of some document. All you need is a training set, and then to train your naive Bayes classifier from this training set, and then we can utilize these classifiers to do many of the mundane applications that we hate to do ourselves.

 

Add a comment

ds106 in[SPIRE]