NCAA 2013 Sleeper Report

I know surprisingly little about NCAA Men’s basketball, which is why my best bracket picks are randomly chosen (after weighting by seed).  But I’ve been wanting to do a more “sophisticated” analysis to see the relative strengths of the teams before filling out a bracket.  Also, it seems like you need to find upsets to differentiate yourself from every other bracket.  In the past my upsets were usually chosen by mascot strength (e.g. Badger > Horny Toad).

But this year is going to be different.  Fortunately, makes available two metrics for comparing relative strengths of each team:

  • SRS – Simple Rating System
  • SOS – Strength of Schedule

They are explained here.  Also, I posted Python code here to extract pull each region and dump them all into a csv file.

Right off the bat I want to plot these using R’s ggplot library.  Here the labels are Region and Seed:


The team with the highest SRS is E-1, the number one seed in the East, or Indiana.  Strangely, the team with the toughest schedule, MW-3, Michigan State, is only a three seed.  Actually, MW-1 (Louisville) , MW-2 (Duke), and MW-3 make up a fairly close cluster from the Midwest Region.

Speaking of clusters, looking at the data like this as if they were distances is exactly what is done in k-means clustering.  Using R’s fpc’s prediction strength I see that two is the only solution for k which results in a prediction strength > 0.80.  So if we create a two-cluster solution and use each team’s cluster for labels instead of Region and Seed we get:


But back to our Region & Seeds.

The West seems to be weaker overall with W-2 (Ohio St) and W-5 (Wisconsin) looking like sleepers coming out of that Region.  I say they’re sleepers because they’re almost as high as W-1 (Gonzaga), but look like they’ve endured a tougher schedule.

In the South there are a number of teams in the upper right – S-3 (Florida), S-1 (Kansas), S-4 (Michigan), and S-11 (Minnesota).  It seems like when lower seeds like Minnesota rank respectably in SRS and SOS that might be a situation where you can differentiate your bracket by picking them as upsets.

So there you have a nice final four – Wisconsin, Florida, Michigan State, and Indiana.


The Impact of Arming Teachers

Since the tragedy in Newtown, there’s been talking of arming teachers.

Reading things like this got me wondering if we could estimate the impact of this because surely it can’t be a free lunch.  Then I saw this great graphic about gun deaths vs gun ownership.  And he made the data available, so I built a simple linear model using R:

deaths = read.table("deaths.csv", sep="\t", header=T)
oecd = read.table("oecd.csv", sep="\t", header=T)
data = merge(guns, deaths, by="Country")
data$OECD = data$Country %in% oecd$Country
data.oecd = subset(data, data$OECD==T)

p <- ggplot(data = data.oecd, aes(x = Guns, y = Deaths))
 + geom_smooth(method="lm", se=FALSE, color="blue", formula= y ~ x) + geom_point()

mylm = lm(Guns~Deaths, data=data.oecd)


Deaths (per 100k people) = 0.599 + 0.089 * Guns (per 100 people).

Here again is a graph of the data with a plot of the linear model added:


This model has an R2 = 0.384 and p = 0.00015.  The residuals of this model are:


Not terrible – something’s going on with Mexico (row #42).  And it’s overestimating slightly for larger values of x (Guns), but probably due to Mexico.

It was then a simple matter to look up the number of teachers according to the U.S. Census:  7.2 million.

So, if we arm each teacher in American that results in 7.2 mil / 250 mil * 100 = 2.88 additional guns per 100 people.  Plug that into the above linear model and we get:

2.88 * 0.089 = 0.2564 additional deaths per 100k people.  So in the U.S. that translates to :

250 million/100k  * 0.2564 = 641 additional deaths

Look, I would consider this to be a toy model.  The underlying data is from different points in time, there’s the Mexico thing, and also 641 is only an average – could be more; could be less.

My point is that we should use data and assess the net impact of any actions we take to prevent school shootings.

Five reasons you probably have never heard for why you should learn SQL

Everyone knows SQL is the “lingua franca” of the database world.  But here are some reasons for learning it that  you may have never heard before.

#1 – Push it toward the backend

Try to say that in a meeting and not crack a smile (no pun in tended).  In the corporate world any database of significance is running on another server(s).  So any data transformation you can do in SQL is essentially being pushed off to another server.  That’s like getting a  free computer to do your bidding.  Yeah, I know everyone is sharing the other computer, but it’s probably a big one.  There’s an amazing amount you can do with SQL – pivot tables (aka crosstab aka lots of other names), on the fly normalization, fix missing values, summarization, and more.  And if you’re using Oracle there’s Oracle Analytics – which is not analytics in the traditional sense – it’s an extension of SQL that allows you to analyze and summarize transactional data.  It isn’t always useful, but when it works a little bit of SQL can replace a lot of post processing.

#2 – Let the nerds help you

Anyone doing serious corporate data munging is bringing together multiple data sources, any one of which could have dozens (data warehouse) to hundreds of tables (operational database).  There’s no way you can be an expert in all of them.  SQL allows you to more easily make use of your friendly local data experts (FLDE’s).  Have a query that isn’t returning the proper results?  Paste it into an email and send it to an FLDE.  If you’re downloading a table and doing a bunch of processing in VBA/Python/Perl/R you’re not going to be able to send that to many FLDEs for help or advice.

#3 – It’s closer to Production Ready

With any luck, someday some of your processes will become valuable to others.  If and when that happens the more of your processing that is being done in SQL the cheaper and quicker it will be to make your process available to lots of people (i.e. “productionize” it).  There’s a huge difference between sitting down with the IT folks and showing them your Perl scripts and showing them your SQL.  I’ve seen Perl scripts induce looks of horror and disdain.  Usually SQL will get you “meh, that’ll be two weeks” which is exactly what I like to hear.

#4 – How you build your house

This is more a side-effect of #2 and #3.  There’s an old saying in woodworking:  “You can build a house with just a hammer and a screwdriver, but what’s it going to look like?”.  The act of using SQL and discussing your plans with FLDEs invariably turns up alternate ways of munging the data that needs munging.  I’m always surprised at how generally smart people will opt to create their own solutions using only  tools that’s they’re comfortable with (e.g. VBA – BTW, this stands for Visual Basic for Apps and is Excel’s scripting language) rather than asking around IT for better/quicker/easier solutions.  I’m all for being a Data MacGyver – heck, it’s what I do.  But very often the data you need is already available elsewhere, or the tables you’re joining in Microsoft Access can be brought together on the backend by coordinating the creation of a database link.  SQL facilitates discussions with FLDEs and their FLDE-ness will rub off on you.

#5 – FLDE’s will love you!

Well, I aint’ going to lie – many of them will hate you for bothering them.  But the smart ones will love you.  They know that deep down inside you’re their customer and a lot of this munging is stuff that possibly should go into a datamart anyway.  You’re essentially on the front lines doing free Business Analysis for IT and this is the way you should present it to them (in case they don’t notice) – though you probably don’t want to literally say “I’m doing free business analysis for you, so help me’ – be tactful geek!  But eventually, they may start contacting YOU – “Will you find this new table useful?”, that kind of stuff.  FLDEs all know that in a pinch, when the business mandates some new functionality be put in place by next month (or whatevs), they’d much rather get handed a bunch of SQL.  Helping you helps them.

There you have them.  You wouldn’t go to Italy with learning a little Italian, “the beautiful language”.  Why on Earth would you do data munging without knowing everything that can be done in SQL?

SQL is truly “la bella lingua”.

Speak business

This Raconteur piece mentions learning to “speak business” as a critical skill for the data scientist.  For most I would recommend first just learning to speak.  Regardless of the domain, business is about being an informative, convincing, and entertaining speaker.  Yeah, I said entertaining – sue me.  The best model in the world is useless if you can’t convince the business line to use it and if your presentation is boring as sh*t, you’re not going to convince anyone!

You don’t need to be a life long Toastmasters member – just work thru the first ten speeches.  It’ll probably chance your business life.


Love this interview.  I didn’t realize Jeremy Howard, Chief Scientist at Kaggle, comes from an insurance background.  He gives out some nuggets of info on predictive modeling in insurance. He makes a point I’ve been shouting from the rooftops to whatever actuaries that will listen – predictive modeling of tough problems is so much more than GLM! The best overall model usually is a combination of techniques. As I talk to people in the actuarial world, the general feeling seems to be predictive modeling starts and ends with GLM, but it’s just once piece of the puzzle.

Another piece I often see missing in the actuarial world is unsupervised techniques – clustering, principal component analysis (PCA), Self Organizing Maps (SOM) – to create new variables which feed downstream.  I’m often surprised  how some crappy seeming cluster or principal component gets identified as an important variable by a  downstream algorithm.

Finally, he makes a distinction between Big Data and Analytics.  So often it’s implied that they go hand-in-hand – just look at job postings. Show me a listing for Hadoop that doesn’t also say you must know statistics or machine learning (ok, maybe there are some, but there aren’t many). Big Data is about scattering your data across multiple machines and is an engineering problem.  It’s like saying I have to be a DBA (and know about backups, replication, security, data governance and blah blah blah) just to write some SQL.

DM 101: Intro to Data Mining

There are lots of books and kits about doing science experiments with kids.  Things like making a stink bomb out of a ping pong ball or a paper-mache volcano.  I also noticed my kids (2nd and 4th graders) are already bringing home math homework with probability and statistics and I’ve, umm, been learning quite a bit.

So I’m starting this “DM 101” series (for “Data Mining 101” in case it’s not obvious) to be kind of like a “365 Chemistry Experiments” except it’s with data, spreadsheets, and computers.

Kids love computers – laptops, Wii’s, DSi’s, cell phones.  My son wants a cell phone not to call people but so he can check the radar.  And to love computers is to love data science.  Watch your kids the next time they’re playing Pokemon – they’re analyzing strength and weakness ratings, and running thousands of epochs on their Personal Neural Nets (PNN) analyzing attack probabilities.  Reminds me of the countless hours I spent playing Dungeons and Dragons [wiping away tear].

So my goals in DM 101 are:

  • Use real data – sports, astronomy, weather
  • Depending on who you talk to, data mining is anywhere from 50-90% data preparation.  I’ll have details about how to get the data you’ll be mining.
  • Make decisions and predictions with this data and see if they come true
  • Explain terminology – so like above an “epoch” in data mining is a training example and you usually need a lot of them to train a neural net (artificial or personal).

Through marriage I am blessed with many non-technical friends.  They are copy writers, book editors, marketing job guys, and theater types.  They generally use Macs and kick my butt in games like Scrabble.  They will also be the target audience.  If they can understand then anyone can….

Which brings me to one final note – most of the software I use will be open source.  I say most because I’m probably going to use Microsoft Excel.  I’ve used OpenOffice’s spreadsheet and it’s great – use that if you don’t have or want Excel.    Other than that I’ll just be using programming languages like R, Perl, and Python.

Good luck…