Text mining feature selection with TF-IDF

TF-IDF is pretty handy in the field of text mining even though it’s from the field of information retrieval (think search engines – Google or Bing).  At its core, a search engine just takes the phrase you entered and tried to figure out which document you want back.

TF-IDF is made up of two parts:

  • TF:  Term Frequency
  • IDF:  Inverse Document Frequency

The TF part is easy enough – if a word is in a document 50 times and there are 1000 terms in the entire document the TF = 50/1000 = 0.05.  If another document has the same term in it 3 times and there are 500 words TF = 3/500 = 0.006.  TF is a measure of how important a word is within a body of text.

Well who cares?  Let’s say you’re searching for a information about “agaves”.  And your search turns up two documents – one with a TF of 0.05 and one of 0.006.  You’d probably want the document with the higher TF at the top of the list since “agaves” is more important to the entire document so it’s more likely to be listed above “agaves” in your search results.

Then the next piece, IDF, is only slightly more difficult.  Before showing the math I’ll show what it’s for.

Let’s say you have 100 emails.  Of these emails 50 are spam and 50 are from friends.

It’d be nice if there was a metric to find words which might be useful to differentiate between the two types of email.

Such a metric would have a lower value on any terms that were in all the emails – like your name for example because a word that’s in all the emails will obviously be of no use in helping you decide which are valid emails and which are spam.  Likewise, this metric would also put a higher value on any terms that were in only some of the emails as these might be useful for figuring out which is which.

In case you haven’t guessed yet, IDF is that metric:

IDF = log(Total number of documents / number of documents containing term)

Don’t let the log function scare you – it has a couple of useful properties here.  Like that the log of 1 is zero.  So in the email example above, if your name was in every document its IDF score would be log(100/100) = 0.  Any terms which are only in one of the documents would have an IDF of log(100/1) = 4.60.  So a term with a high IDF might be useful here.  A term that was in half the documents would be log(100/50) = 0.69.

Another property of the log function is that the log of a number greater than one is itself always greater than one.   This means the IDF is always greater than one and you could add the IDFs for a bunch of words together to come up with the total IDF for a phrase.   Are you starting to see how this might be useful to a search engine?  A search engine wants to quickly figure out which words are the most useful for helping it return relevant results.

A side note on a related topic – stop words or stoplists.  Stop words are words like “the” or “and” which are generally not very good at differentiating between documents because they’re probably in most of them anyway – they also don’t have much meaning.  It’s kind of like a list of words where we automatically assume the IDF is always equal to 0 and we want to save time since calculating the IDF takes time.

So to borrow a point from Bilisoly (2008) any words with an IDF = 0 (because they’re in every document) are effectively put on a stoplist.  IDF is useful creating a problem specific stop list.

The final calculation for TF-IDF is to multiply TF * IDF (there are actually several ways to combine TF and IDF, but multiplying them is simple).

A few notes to remember:

  • TF is a measure of a term’s relevance within a document
  • IDF is a measure of a term’s relevance within a group of documents
  • TF-IDF combines TF and IDF and is a measure of a term’s relevance within each document when compared to all the other documents.

This process of deciding which words to use in a text mining problem is called feature selection.  Which features of the document are the most useful in helping me differentiate between the different classes of document?  In some fields, like genomics and text mining, feature selection is often the greatest challenge of creating a predictive model.  Once you’ve created a useful, efficient set of markers, the model is done.

Finally, in the references there’s a paper “Introduction to variable and feature selection” (Guyon and Elisseeff, 2003) which I thought was great and easy to understand.  One of my favorite things about this paper is the “feature selection check list”.


One Response to Text mining feature selection with TF-IDF

  1. sap says:

    Wonderful write- up.I was struggling to search to understand this concept of text mining and more specifically tf and idf ,feature selection but I truly understood.thanks so much.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: