Text mining feature selection with TF-IDF

TF-IDF is pretty handy in the field of text mining even though it’s from the field of information retrieval (think search engines – Google or Bing).  At its core, a search engine just takes the phrase you entered and tried to figure out which document you want back.

TF-IDF is made up of two parts:

  • TF:  Term Frequency
  • IDF:  Inverse Document Frequency

The TF part is easy enough – if a word is in a document 50 times and there are 1000 terms in the entire document the TF = 50/1000 = 0.05.  If another document has the same term in it 3 times and there are 500 words TF = 3/500 = 0.006.  TF is a measure of how important a word is within a body of text.

Well who cares?  Let’s say you’re searching for a information about “agaves”.  And your search turns up two documents – one with a TF of 0.05 and one of 0.006.  You’d probably want the document with the higher TF at the top of the list since “agaves” is more important to the entire document so it’s more likely to be listed above “agaves” in your search results.

Then the next piece, IDF, is only slightly more difficult.  Before showing the math I’ll show what it’s for.

Let’s say you have 100 emails.  Of these emails 50 are spam and 50 are from friends.

It’d be nice if there was a metric to find words which might be useful to differentiate between the two types of email.

Such a metric would have a lower value on any terms that were in all the emails – like your name for example because a word that’s in all the emails will obviously be of no use in helping you decide which are valid emails and which are spam.  Likewise, this metric would also put a higher value on any terms that were in only some of the emails as these might be useful for figuring out which is which.

In case you haven’t guessed yet, IDF is that metric:

IDF = log(Total number of documents / number of documents containing term)

Don’t let the log function scare you – it has a couple of useful properties here.  Like that the log of 1 is zero.  So in the email example above, if your name was in every document its IDF score would be log(100/100) = 0.  Any terms which are only in one of the documents would have an IDF of log(100/1) = 4.60.  So a term with a high IDF might be useful here.  A term that was in half the documents would be log(100/50) = 0.69.

Another property of the log function is that the log of a number greater than one is itself always greater than one.   This means the IDF is always greater than one and you could add the IDFs for a bunch of words together to come up with the total IDF for a phrase.   Are you starting to see how this might be useful to a search engine?  A search engine wants to quickly figure out which words are the most useful for helping it return relevant results.

A side note on a related topic – stop words or stoplists.  Stop words are words like “the” or “and” which are generally not very good at differentiating between documents because they’re probably in most of them anyway – they also don’t have much meaning.  It’s kind of like a list of words where we automatically assume the IDF is always equal to 0 and we want to save time since calculating the IDF takes time.

So to borrow a point from Bilisoly (2008) any words with an IDF = 0 (because they’re in every document) are effectively put on a stoplist.  IDF is useful creating a problem specific stop list.

The final calculation for TF-IDF is to multiply TF * IDF (there are actually several ways to combine TF and IDF, but multiplying them is simple).

A few notes to remember:

  • TF is a measure of a term’s relevance within a document
  • IDF is a measure of a term’s relevance within a group of documents
  • TF-IDF combines TF and IDF and is a measure of a term’s relevance within each document when compared to all the other documents.

This process of deciding which words to use in a text mining problem is called feature selection.  Which features of the document are the most useful in helping me differentiate between the different classes of document?  In some fields, like genomics and text mining, feature selection is often the greatest challenge of creating a predictive model.  Once you’ve created a useful, efficient set of markers, the model is done.

Finally, in the references there’s a paper “Introduction to variable and feature selection” (Guyon and Elisseeff, 2003) which I thought was great and easy to understand.  One of my favorite things about this paper is the “feature selection check list”.

DM 101: Intro to Data Mining

There are lots of books and kits about doing science experiments with kids.  Things like making a stink bomb out of a ping pong ball or a paper-mache volcano.  I also noticed my kids (2nd and 4th graders) are already bringing home math homework with probability and statistics and I’ve, umm, been learning quite a bit.

So I’m starting this “DM 101” series (for “Data Mining 101” in case it’s not obvious) to be kind of like a “365 Chemistry Experiments” except it’s with data, spreadsheets, and computers.

Kids love computers – laptops, Wii’s, DSi’s, cell phones.  My son wants a cell phone not to call people but so he can check the radar.  And to love computers is to love data science.  Watch your kids the next time they’re playing Pokemon – they’re analyzing strength and weakness ratings, and running thousands of epochs on their Personal Neural Nets (PNN) analyzing attack probabilities.  Reminds me of the countless hours I spent playing Dungeons and Dragons [wiping away tear].

So my goals in DM 101 are:

  • Use real data – sports, astronomy, weather
  • Depending on who you talk to, data mining is anywhere from 50-90% data preparation.  I’ll have details about how to get the data you’ll be mining.
  • Make decisions and predictions with this data and see if they come true
  • Explain terminology – so like above an “epoch” in data mining is a training example and you usually need a lot of them to train a neural net (artificial or personal).

Through marriage I am blessed with many non-technical friends.  They are copy writers, book editors, marketing job guys, and theater types.  They generally use Macs and kick my butt in games like Scrabble.  They will also be the target audience.  If they can understand then anyone can….

Which brings me to one final note – most of the software I use will be open source.  I say most because I’m probably going to use Microsoft Excel.  I’ve used OpenOffice’s spreadsheet and it’s great – use that if you don’t have or want Excel.    Other than that I’ll just be using programming languages like R, Perl, and Python.

Good luck…

R implementation of Stepwise Hybrid Feature Selection

From Dziuda’s Data Mining for Genomics and Proteomics

I had mentioned doing this while taking the Genomics class and several people expressed interest in seeing it.  I was going to do it in some other language, but since taking Linear Models I’m much more comfortable in R.  Also, I found someone’s implementation of the T2 function so I modified it (website is in the notes).

The code contains comments and call examples.  Also, there’s another function at the bottom, WriteT2Values(), which writes a file of the T2 for each column.  Per the note on the bottom of p. 147 instead of selecting any random variable to start with it’s probably a better idea to choose from the top 1000, for example.

I’ve been playing around with it for text mining – sometimes the solve() errors out because the matrix is singular (which I’m assuming is because the matrices in text mining are much more sparse than in genomics).

I would appreciate any feedback or suggestions for improvement:

The code:


#Stepwise Hybrid Feature Selection
SHFS = function(TrainingData, stop_p, stop_T2, excludeColumns, classCondition, randomFlag=TRUE, J=2) {
#-- Input
#-- TrainingData
#-- J:  number of classes, default=2
#-- stop_p:  Stop when number of parameters excedes stop_p
#-- stop_T2: Stop when T2 excedes  stop_T2
#-- randomFlag:  TRUE or FALSE  - Starting point flag; default=TRUE
#-- excludeColumns:  columns to exclude from consideration like id or class indictor
#-- classCondition:  a condition which differentiates the classes, e.g. quote(Hyp_Ind=='true')
#-- Call:
#--  SHFS(TrainingData = poe, stop_p = 5, stop_T2 = 10000, excludeColumns=c(1,2), 
#--         classCondition=quote(author=='Doyle'))

#-- Possible speed gains (?):
#-- Make tsquare use global TrainingData
#-- get rid of printing



N = length(TrainingData[,1])

pool <<- seq(1, ncol(TrainingData))
pool <<- pool[-excludeColumns]
poolSize = length(pool)
p = length(pool)

currentSet <<- NULL


markerSize = 0
currentT2 = 0.0

step = 1

cat("Starting : N=", N, ", J=", J, ", p=", p, ", stop_T2 =", stop_T2, ", stop_p =", stop_p,  ", markerSize =", markerSize, "\n")

while ( markerSize < stop_p  && currentT2 < stop_T2 && (markerSize < (N-J-2) || (N-J-2 ==0) ) && markerSize < p ) {
maxGain = 0.0
markerSize = markerSize + 1
if (markerSize == 1 && randomFlag == TRUE) {
	# first time thru loop and caller wants to start with a random variable 
		#-- Comment/Uncomment these lines depending on 
		#-- which method of selecting a random var you want to use
		#selectedVar = as.integer(runif(1, 1, poolSize))
		selectedVar = selectRandomVar(1000)
		cat("selectedVar = ", selectedVar, "\n")
		
	} else {
	# Forward selection:  add the variable that maximizes T^2 of this step
	for (i in pool) {
		#-- call tsquare with currentSet plus i appended onto the end
		deltaT2 = tsquare(c(currentSet, i), TrainingData, classCondition) - currentT2
		if (deltaT2 >= maxGain) {
			maxGain = deltaT2
			selectedVar = i 
		}
	
	} #for (i in pool)
} #if (markerSize...)
currentT2 = currentT2 + maxGain
poolToMarker(selectedVar)
poolSize = poolSize - 1


minLoss = currentT2
#-- Backward optimization:  if elimination of any variable results in T2 greater
#--  than that of previous step, eliminate one that minimally decreases T2
if (markerSize > 2) {
	for (i in currentSet) {
		#-- call tsquare with currentSet without i
		deltaT2 = currentT2 - tsquare( currentSet[currentSet != i], TrainingData, classCondition )
		if (deltaT2 <= minLoss ) {
			minLoss = deltaT2
			removedVar = i
		}
	
	}
	
	if (minLoss < maxGain) {
		markerToPool(i)
		poolSize = poolSize + 1
		markerSize = markerSize - 1
		currentT2 = currentT2 - minLoss
	}
	
	#? saveSet(currentSet)

} # if (markerSize > 2)


cat("Step=", step, "poolSize=", poolSize, ", currentT2=", currentT2, ", markerSize =", markerSize, "\n")
step = step + 1
flush.console()
} #while

cat("currentT2=", currentT2, "\n")
return (currentSet)

} #function

selectRandomVar = function (n) {
# Select random column from top n

cols = read.csv("C:\\out-T2.csv")

cols = head(cols, n)

rowIdx = sample(nrow(cols), 1)

return(cols[rowIdx, 2])


}

poolToMarker = function(theVar) {
	#-- Add to currentSet
	currentSet <<- c(currentSet, theVar)
	#append(currentSet, theVar)
	#-- Remove from pool
	pool <<- pool[!pool==theVar]
	
}

markerToPool = function(theVar) {
	#-- Add to currentSet
	#append(pool, theVar)
	pool <<- c(pool, theVar)
	#-- Remove from pool
	currentSet <<- currentSet[!currentSet==theVar]
	
}


tsquare<-function(list1, data, classCondition){
#-- Function calculates Hotelling's T-square
#-- Adapted from:  http://www.stat.sc.edu/~habing/courses/530rF03.html
#-- Changed from original to accept a list of columns - this way it can
#-- be iteratively to try different sets of columns.  Also added 
#-- the classCondition

#-- Inputs:
#-- list1 - a list of columns which should be used - this is so we don't have to use all of the dataframe, data
#-- data: a dataframe
#-- classCondition:  the "where" clause which differentiates the two classes (e.g. "Control_Ind == 'Y' ")

#-- Call:
#-- > tsquare(c(1,2,3,4), test, quote(Hyp_Ind=='true'))
#-- [1] 249.4002


##
#1) Make sure they are both matrices
##

group1 <- as.matrix(subset(data, select=list1, subset=(eval(classCondition))))
group2 <- as.matrix(subset(data, select=list1, subset=(!eval(classCondition))))

#-- Add matrix of all 1's to each group
#group1 <<- group1 + matrix(1, nrow=length(group1[,1]), ncol=length(group1[1,]))
#group2 <<- group2 + matrix(1, nrow=length(group2[,1]), ncol=length(group2[1,]))


##
#2) Find the mean vector and covariance matrix for the first group#
##
n1 <- length(group1[,1])
xbar1 <- apply(group1,2,mean)
C1 <- cov(group1)
##
#3) find the mean vector and covariance matrix for the second group#
##
n2 <- length(group2[,1])
xbar2 <- apply(group2,2,mean)
C2 <- cov(group2)
##
#4) find the pooled covariance matrix and its inverse#
##
C <- ((n1-1)*C1+(n2-1)*C2)/(n1+n2-2)
Cinv <- solve(C)  


##
#5) multiply to find the T-square stat and convert it to F and p-value#
##
Tsquare <- (n1*n2)*t(xbar1-xbar2)%*%Cinv%*%(xbar1-xbar2)/(n1+n2)
#p <- length(group1[1,])
#F <- (n1+n2-p-1)*Tsquare/((n1+n2-2)*p)
#p.value <- 1 - pf(F,df1=p,df2=n1+n2-p-1)
##
#Finally, output all of the results#
##
#return(n1,xbar1,C1,xbar2,C2,n2,C,p,Tsquare,F,p.value)
return(as.numeric(Tsquare))

}


WriteT2Values = function(data, excludeColumns, classCondition, outFile) {
#-- Write the top n T2 values to a file
#-- Inputs:
#--		excludeColumns - any columns like index and class Condition we don't want to include
#--		classCondition - the condition which separates the classes
#--		outFile - name of output file
#-- Call:
# WriteT2Values(someData, c(1,2), classCondition=quote(author=='Doyle'), "C:\\data\\out-T2.csv")

pool <<- seq(1, ncol(data))
pool <<- pool[-excludeColumns]

DF = data.frame()

for (i in pool) {

		T2 = tsquare(i, data , classCondition)
		newRow = data.frame(col=i, T2Value = T2)
		DF <- rbind(DF, newRow)
		
} 

write.csv(DF[order(DF$T2Value, decreasing=TRUE),], outFile)

}