Password guessing

Interesting ideas for password guessing software.

I wonder any text analytic techniques, specifically stylometry, would be useful for reducing the number of passwords that need to be guessed.  Like does the writer invariant apply to Tweets?

You could monitor someone’s Tweets, or blog posts for that matter, and get an idea of some of the author’s invariant text properties, like average word length.

Maybe their passwords would also have similar properties.


Analytics in the work place

The other day I had a demo from our research group on Attensity.  After some successful text analytics projects using Perl and R (and one that just didn’t work because of large volumes of claim notes), we’re looking for something that can handle higher volumes.

We were told that due to its complexity and cost of licenses, we would not be allowed to use Attensity ourselves but rather would have to work thru the research unit.  Obviously, this pained me, but if that is the most cost effective approach then I’m all for it.

But I don’t think this is cost effective at all and here’s why.

Analytics today is still a youngster in the corporate world and the way we do it now is comparable to the way IT was done years ago.  It used to be that if you wanted any kind of programming done you had to go to the IT group, your work was prioritized (i.e. you made and presented some kind of cost benefit analysis), and if there were enough resources your project slogged forward.  By the time it was done, the resulting software may or may not have met your needs (which might have changed since you started).  Hence the Agile craze, but I digress.

Compare IT of yesterday to today.  I can do 99% of my job with the tools on my desktop – as much as possible on the back-end with Oracle and SQL Server, and then whatever else needs to be done with Excel/VBA, SAS (technically not on my desktop, but whatevs), Perl, and R.

Imagine if I had to go to IT every time I needed a new Excel macro.  The time (cost) to get work done would be outrageous.  Sure there ends up being a lot of customer created “applications” out there, some of which get turned over to IT (much to their horror).  But what’s often forgotten is that only the top, must useful, processes ever make it to “the show”.  IT may have to figure out how to turn Access SQL into Oracle SQL, but so what – think of all the savings – the business analyst’s job is practically done.  And IT never had to spend one second on the 80-90% of processes that either serve their purpose for a short time, or are perhaps more experimental in nature.

So that brings us back to analytics today.  At some point some kind of analytics will be a tool included for free with Windows, just like VB Script is today.  And every reason you can give that this won’t happen has had a parallel reason in IT:

  • It’s too complicated (writing windows apps used to complicated, now its easy)
  • They’ll make mistakes – yup, they will.  But some of it will make it to The Show.
  • The software is too expensive – there are already free tools out there.

I’m not saying Enterprise Miner is going to be included in Windows 10.  But how about a simple decision tree?  While not the most powerful technique, I like to do a quick CART tree just to visualize some of the relationships (linear or not) in a dataset.  Really, you could say that Microsoft is already including data mining for very little additional cost in SQL Server.

The reason I know this to be true is innovation.  There’s no way you can innovative with analytics by having to go thru some research unit.  The nature of innovation is such that 90% of what you try is going to suck.  As you get better maybe you can bring that down to 85% (ok, maybe 88%).  Nobody is going to fund a project that has a 90% chancing of sucking, thus the whole thing of having to go to a research unit to innovate will never last – either the practice or the company will end.

Luckily, our company is also carefully opening up on its use for Free and Open Source Software (FOSS).  Which is why we’re looking at using GATE for our large project.

Text mining feature selection with TF-IDF

TF-IDF is pretty handy in the field of text mining even though it’s from the field of information retrieval (think search engines – Google or Bing).  At its core, a search engine just takes the phrase you entered and tried to figure out which document you want back.

TF-IDF is made up of two parts:

  • TF:  Term Frequency
  • IDF:  Inverse Document Frequency

The TF part is easy enough – if a word is in a document 50 times and there are 1000 terms in the entire document the TF = 50/1000 = 0.05.  If another document has the same term in it 3 times and there are 500 words TF = 3/500 = 0.006.  TF is a measure of how important a word is within a body of text.

Well who cares?  Let’s say you’re searching for a information about “agaves”.  And your search turns up two documents – one with a TF of 0.05 and one of 0.006.  You’d probably want the document with the higher TF at the top of the list since “agaves” is more important to the entire document so it’s more likely to be listed above “agaves” in your search results.

Then the next piece, IDF, is only slightly more difficult.  Before showing the math I’ll show what it’s for.

Let’s say you have 100 emails.  Of these emails 50 are spam and 50 are from friends.

It’d be nice if there was a metric to find words which might be useful to differentiate between the two types of email.

Such a metric would have a lower value on any terms that were in all the emails – like your name for example because a word that’s in all the emails will obviously be of no use in helping you decide which are valid emails and which are spam.  Likewise, this metric would also put a higher value on any terms that were in only some of the emails as these might be useful for figuring out which is which.

In case you haven’t guessed yet, IDF is that metric:

IDF = log(Total number of documents / number of documents containing term)

Don’t let the log function scare you – it has a couple of useful properties here.  Like that the log of 1 is zero.  So in the email example above, if your name was in every document its IDF score would be log(100/100) = 0.  Any terms which are only in one of the documents would have an IDF of log(100/1) = 4.60.  So a term with a high IDF might be useful here.  A term that was in half the documents would be log(100/50) = 0.69.

Another property of the log function is that the log of a number greater than one is itself always greater than one.   This means the IDF is always greater than one and you could add the IDFs for a bunch of words together to come up with the total IDF for a phrase.   Are you starting to see how this might be useful to a search engine?  A search engine wants to quickly figure out which words are the most useful for helping it return relevant results.

A side note on a related topic – stop words or stoplists.  Stop words are words like “the” or “and” which are generally not very good at differentiating between documents because they’re probably in most of them anyway – they also don’t have much meaning.  It’s kind of like a list of words where we automatically assume the IDF is always equal to 0 and we want to save time since calculating the IDF takes time.

So to borrow a point from Bilisoly (2008) any words with an IDF = 0 (because they’re in every document) are effectively put on a stoplist.  IDF is useful creating a problem specific stop list.

The final calculation for TF-IDF is to multiply TF * IDF (there are actually several ways to combine TF and IDF, but multiplying them is simple).

A few notes to remember:

  • TF is a measure of a term’s relevance within a document
  • IDF is a measure of a term’s relevance within a group of documents
  • TF-IDF combines TF and IDF and is a measure of a term’s relevance within each document when compared to all the other documents.

This process of deciding which words to use in a text mining problem is called feature selection.  Which features of the document are the most useful in helping me differentiate between the different classes of document?  In some fields, like genomics and text mining, feature selection is often the greatest challenge of creating a predictive model.  Once you’ve created a useful, efficient set of markers, the model is done.

Finally, in the references there’s a paper “Introduction to variable and feature selection” (Guyon and Elisseeff, 2003) which I thought was great and easy to understand.  One of my favorite things about this paper is the “feature selection check list”.