Password guessing

Interesting ideas for password guessing software.

I wonder any text analytic techniques, specifically stylometry, would be useful for reducing the number of passwords that need to be guessed.  Like does the writer invariant apply to Tweets?

You could monitor someone’s Tweets, or blog posts for that matter, and get an idea of some of the author’s invariant text properties, like average word length.

Maybe their passwords would also have similar properties.

Five reasons you probably have never heard for why you should learn SQL

Everyone knows SQL is the “lingua franca” of the database world.  But here are some reasons for learning it that  you may have never heard before.

#1 – Push it toward the backend

Try to say that in a meeting and not crack a smile (no pun in tended).  In the corporate world any database of significance is running on another server(s).  So any data transformation you can do in SQL is essentially being pushed off to another server.  That’s like getting a  free computer to do your bidding.  Yeah, I know everyone is sharing the other computer, but it’s probably a big one.  There’s an amazing amount you can do with SQL – pivot tables (aka crosstab aka lots of other names), on the fly normalization, fix missing values, summarization, and more.  And if you’re using Oracle there’s Oracle Analytics – which is not analytics in the traditional sense – it’s an extension of SQL that allows you to analyze and summarize transactional data.  It isn’t always useful, but when it works a little bit of SQL can replace a lot of post processing.

#2 – Let the nerds help you

Anyone doing serious corporate data munging is bringing together multiple data sources, any one of which could have dozens (data warehouse) to hundreds of tables (operational database).  There’s no way you can be an expert in all of them.  SQL allows you to more easily make use of your friendly local data experts (FLDE’s).  Have a query that isn’t returning the proper results?  Paste it into an email and send it to an FLDE.  If you’re downloading a table and doing a bunch of processing in VBA/Python/Perl/R you’re not going to be able to send that to many FLDEs for help or advice.

#3 – It’s closer to Production Ready

With any luck, someday some of your processes will become valuable to others.  If and when that happens the more of your processing that is being done in SQL the cheaper and quicker it will be to make your process available to lots of people (i.e. “productionize” it).  There’s a huge difference between sitting down with the IT folks and showing them your Perl scripts and showing them your SQL.  I’ve seen Perl scripts induce looks of horror and disdain.  Usually SQL will get you “meh, that’ll be two weeks” which is exactly what I like to hear.

#4 – How you build your house

This is more a side-effect of #2 and #3.  There’s an old saying in woodworking:  “You can build a house with just a hammer and a screwdriver, but what’s it going to look like?”.  The act of using SQL and discussing your plans with FLDEs invariably turns up alternate ways of munging the data that needs munging.  I’m always surprised at how generally smart people will opt to create their own solutions using only  tools that’s they’re comfortable with (e.g. VBA – BTW, this stands for Visual Basic for Apps and is Excel’s scripting language) rather than asking around IT for better/quicker/easier solutions.  I’m all for being a Data MacGyver – heck, it’s what I do.  But very often the data you need is already available elsewhere, or the tables you’re joining in Microsoft Access can be brought together on the backend by coordinating the creation of a database link.  SQL facilitates discussions with FLDEs and their FLDE-ness will rub off on you.

#5 – FLDE’s will love you!

Well, I aint’ going to lie – many of them will hate you for bothering them.  But the smart ones will love you.  They know that deep down inside you’re their customer and a lot of this munging is stuff that possibly should go into a datamart anyway.  You’re essentially on the front lines doing free Business Analysis for IT and this is the way you should present it to them (in case they don’t notice) – though you probably don’t want to literally say “I’m doing free business analysis for you, so help me’ – be tactful geek!  But eventually, they may start contacting YOU – “Will you find this new table useful?”, that kind of stuff.  FLDEs all know that in a pinch, when the business mandates some new functionality be put in place by next month (or whatevs), they’d much rather get handed a bunch of SQL.  Helping you helps them.

There you have them.  You wouldn’t go to Italy with learning a little Italian, “the beautiful language”.  Why on Earth would you do data munging without knowing everything that can be done in SQL?

SQL is truly “la bella lingua”.

Baseball HR and SO over the years

From my upcoming Masters thesis, in baseball over the years, Home Runs (HR) and Strikeouts (SO) have been highly correlated (0.85).  Graphically we see this (values are scaled):

HR and SO by Year

HR and SO by Year

The Bill James Historical Baseball Abstract mentions a jump in HR’s in the 80’s and a jump in SO’s in the 90’s.  We see both of those here.  Perhaps most interesting is that for the 80’s HR spike, SO’s didn’t really keep pace that decade.  As James points out this was the first decade where players could finally make enough to play full time and work-out year round instead of selling cars in the off-season.  And of course there was probably drug use.

So you would think that if the wind were to have a significant impact on HR’s, you’d see its influence on the HR but not a corresponding influence on SO:

HR by Wind Direction

HR by Wind Direction

When the wind is blowing out at Fenway (Southwest wind) it does in fact appear as if there a slightly more HR’s.  Still the affect is hard to quantify.

“Factcheck.org” and Obama vs Romney

I was curious to see how often various news sites make use of Factcheck.org, and specifically how often “Factcheck.org” was used in the same article with either party’s leading candidate.  Since Factcheck.org seems to hammer both sides equally on their loose use of facts and what they mean, I’m assuming it is truly bipartisan.

Before I continue, you should know I’m firmly entrenched in the middle of the political spectrum.  And I think most of America is probably within one standard deviation of the middle – this isn’t some bold political statement – it’s just the bell curve.

With that in mind, I fully expected to find that the news sites were equally using Factcheck.org, with each side using it to support their representation of “the facts”.  So I expected Fox to use Factcheck to support Romney and MSNBC to use it to support Obama.  I wasn’t sure about CNN, but I was interested in the result because it seemed to me that they were more fair in their coverage.

I did four Google searches against Fox, MSNBC, and CNN.  The searches were:

Factcheck Only:  [site:www.foxnews.com “Factcheck.org” -obama -romney]

Obama:   [site:www.foxnews.com “Factcheck.org” obama -romney]

Romney:  [site:www.foxnews.com “Factcheck.org” -obama romney]

Both:  [site:www.foxnews.com “Factcheck.org” obama romney]

For MSNBC and CNN, just replace “www.foxnews.com” above with “www.msnbc.msn.com” and “www.cnn.com”, respectively.  It’s important to exclude terms (e.g., “-romney”) and also to put Factcheck.org in quotes, otherwise Google things you really meant “fact check”.

The results were:

Source R O B F
fox 2 47 29 22
msnbc 2 94 101 7
cnn 0 2 17 2

Here,

R = Factcheck.org and Romney and Not Obama

O = Factcheck.org and Obama and Not Romney

B = Factcheck.org and (Obama or Romney)

F = Factcheck.org and Not Obama and Not Romney

To visualize, I did a quick boxplot in R:

barplot(t(as.matrix(data.news[,c(-1)])),
legend=c("Factcheck + Romney","Factcheck + Obama", "Both", "Factcheck Only"),col=c("red","blue", "purple","brown"), names.arg=c("fox", "msnbc", "cnn"))

Click to Enlarge

I see a few things. Both Fox and MSNBC are equally likely to mention either Factcheck and both candidates OR Factcheck and Obama.  CNN makes fewer references to Factcheck.org, but when it does its articles mention both candidates.  I like that – it supports my perception of CNN.

But most stunning from this graphic is that “Romney” alone is rarely mentioned with “Factcheck.org” at all. In fact, if you search the entire web

[“factcheck.org” romney -obama] returns 27,300 results
[“factcheck.org” -romney obama] returns 356,000 results

Now, the easy answer is that Obama has been president for the past four years – there are more articles about him out there. Still, Fox news has two articles mentioning only Romney and Factcheck.org. In case you missed it: TWO

And it’s indicative of a major problem I have with Romney.

He’s not really saying anything.  No facts.  No plans.  Just “Elect me because Obama sucks!”.

Rare Events at Fenway Park

My thesis dataset includes the 3,231 games played at Fenway Park from 1970 to 2009.  While doing exploratory analysis it got me wondering about what are the rarest events that have happened at Fenway in 40 years.  Intuitively, I thought Catcher Interference or Balks, but once again intuition is wrong:

Play Count Avg
Triple Plays                       9         0.003
Catcher Interference                    19         0.006
Balk                  245         0.076
Passed Balls                  657         0.203
Triples              1,251         0.387
Caught Stealing              1,675         0.518
Sac Hit              1,725         0.534
Hit by Pitch              1,843         0.570
Wild Pitch              1,860         0.576
Sac Fly              2,017         0.624
HR              6,197         1.918
Hits            62,130      19.229
Total Putouts          171,677      53.134

I’ve seen at least 2-3 triple plays that I can remember.  And I don’t think I remember any catcher interference (there was a play during the 1975 World Series between Carlton Fisk and Ed Armbrister, but that wasn’t catcher interference, or any interference officially for that matter).  But I think the key word is remember.  Catcher Interference just isn’t that memorable of a play I guess – or of course I just haven’t ever seen one.

Incidently, I included Total Putouts as a sanity check – something we can calculate in our heads.  3,231 games * 27 putouts * 2 teams = 174,474.  Pretty close to the actual number of 171,677.  In fact since the home team doesn’t always bat in the 9th inning, you would guesstimate something less than 174,474 as the result was.  Sanity checks like this are often left out of presented data, yet are critical in helping grok and feel comfortable with stats.

Speak business

This Raconteur piece mentions learning to “speak business” as a critical skill for the data scientist.  For most I would recommend first just learning to speak.  Regardless of the domain, business is about being an informative, convincing, and entertaining speaker.  Yeah, I said entertaining – sue me.  The best model in the world is useless if you can’t convince the business line to use it and if your presentation is boring as sh*t, you’re not going to convince anyone!

You don’t need to be a life long Toastmasters member – just work thru the first ten speeches.  It’ll probably chance your business life.

Wind impact on extreme home runs

Just reading through “The Physics of Baseball” by Adair.  He says “I expect the longest home runs hit in outdoor parks are always wind assisted”. Yes!!!  As I said here, I think Williams HR at Fenway was definitely wind assisted.