My Cafepress Shop – Steve’s Analytic Swag Shop

For months I’ve been wishing there were shirts and mugs and stuff with  more “analytical” themes.

Please visit my shop. My only criteria for a design is I have to think it’s funny.

Steve’s Analytic Swag Shop – buy t-shirts, mugs & gifts from my shop.


Neflix algorithm

The fact that the winning Netflix algorithm was never implemented doesn’t get enough publicity. It’s a good example of things modelers create without being shackled by engineering concerns. The other thing these competitions (like Kaggle) get to avoid is having to integrate their model into a workflow.

But I guess really these are only concerns for those of us who create models for other humans to use. If you are creating models that computers use to, say, trade stocks, knock yourself out!


NCAA 2013 Sleeper Report

I know surprisingly little about NCAA Men’s basketball, which is why my best bracket picks are randomly chosen (after weighting by seed).  But I’ve been wanting to do a more “sophisticated” analysis to see the relative strengths of the teams before filling out a bracket.  Also, it seems like you need to find upsets to differentiate yourself from every other bracket.  In the past my upsets were usually chosen by mascot strength (e.g. Badger > Horny Toad).

But this year is going to be different.  Fortunately, makes available two metrics for comparing relative strengths of each team:

  • SRS – Simple Rating System
  • SOS – Strength of Schedule

They are explained here.  Also, I posted Python code here to extract pull each region and dump them all into a csv file.

Right off the bat I want to plot these using R’s ggplot library.  Here the labels are Region and Seed:


The team with the highest SRS is E-1, the number one seed in the East, or Indiana.  Strangely, the team with the toughest schedule, MW-3, Michigan State, is only a three seed.  Actually, MW-1 (Louisville) , MW-2 (Duke), and MW-3 make up a fairly close cluster from the Midwest Region.

Speaking of clusters, looking at the data like this as if they were distances is exactly what is done in k-means clustering.  Using R’s fpc’s prediction strength I see that two is the only solution for k which results in a prediction strength > 0.80.  So if we create a two-cluster solution and use each team’s cluster for labels instead of Region and Seed we get:


But back to our Region & Seeds.

The West seems to be weaker overall with W-2 (Ohio St) and W-5 (Wisconsin) looking like sleepers coming out of that Region.  I say they’re sleepers because they’re almost as high as W-1 (Gonzaga), but look like they’ve endured a tougher schedule.

In the South there are a number of teams in the upper right – S-3 (Florida), S-1 (Kansas), S-4 (Michigan), and S-11 (Minnesota).  It seems like when lower seeds like Minnesota rank respectably in SRS and SOS that might be a situation where you can differentiate your bracket by picking them as upsets.

So there you have a nice final four – Wisconsin, Florida, Michigan State, and Indiana.

Some Python to extract NCAA Team stats from

It’s that time of year.  In prep for an NCAA bracket clustering project I’ve had inthe back of my head for the past year, here’s a snippet of Python to extract the team stats from  Thank you BeautifulSoup!

<pre>import urllib2
 import csv
 from bs4 import BeautifulSoup
 sites = {
 "BigTen" : "",
 "BigEast": "",
 "ACC": "",
 "Pac-12": "",
 "Big-12": "",
 "MWC": "",
 "SEC": "",
 "Atlantic-10": "",
 "MVC": "",
 "WCC": "",
 "CUSA": "",
 "WAC": "",
 "Horizon": "",
 "BigWest": "",
 "MAC": "",
 "MAAC": "",
 "Sun-Belt": "",
 "Patriot": "",
 "Colonial": "",
 "Ivy": "",
 "OVC": "",
 "America-East": "",
 "Summit": "",
 "Northeast": "",
 "Southern": "",
 "Atlantic-Sun": "",
 "Southland": "",
 "Big-Sky": "",
 "Big-South": "",
 "MEAC": "",
 "Great-West": "",
 #"Independent": "",
 "SWAC": ""


f = open('ncaa_data.csv', 'w')

f.write("Conf, Rk, School, Conf_W, Conf_L, Conf_Pct, Over_W, Over_L, Over_Pct, PPG_Own, PPG_Opp, SRS, SOS\n")

for item in sites.keys():
 soup = BeautifulSoup(urllib2.urlopen(sites[item]).read())
 print "Processing ", item, "..."
 for row in soup('table', {'class' : 'sortable stats_table'})[0].tbody('tr'):
 tds = row('td')
 if len(tds) >= 12: #some conferences like Sun-Belt have two tables and an extra header - this skips those rows
 f.write(item); f.write(",")
 f.write(tds[0].string); f.write(",")
 f.write(tds[1].find('a').string) ; f.write(",")#need to extract anchor text
 for x in range(2,12):
 f.write(tds[x].string); f.write(",")


Correlation gets a bad rap

Correlation has been getting a bad rap lately.  Just because correlation does not imply causation may not be all that important for the purposes of prediction.

My favorite Correlation/Causation Paradox (CCP) is that ZIP codes with lots of churches tend to have more crime.  If your purpose is to prevent crime, you could probably do worse than to use the number of churches to decide how to deploy your police resources.

If on the other hand, you decide that the best way to reduce crime is to destroy churches, you have CCP issues.

Just think of correlation is a black box, like neural nets.  Not so good for explanation.  Good for prediction!

The Face of Big Data


Thanks to Mark Tisdale for the NYC Skyline

The Impact of Arming Teachers

Since the tragedy in Newtown, there’s been talking of arming teachers.

Reading things like this got me wondering if we could estimate the impact of this because surely it can’t be a free lunch.  Then I saw this great graphic about gun deaths vs gun ownership.  And he made the data available, so I built a simple linear model using R:

deaths = read.table("deaths.csv", sep="\t", header=T)
oecd = read.table("oecd.csv", sep="\t", header=T)
data = merge(guns, deaths, by="Country")
data$OECD = data$Country %in% oecd$Country
data.oecd = subset(data, data$OECD==T)

p <- ggplot(data = data.oecd, aes(x = Guns, y = Deaths))
 + geom_smooth(method="lm", se=FALSE, color="blue", formula= y ~ x) + geom_point()

mylm = lm(Guns~Deaths, data=data.oecd)


Deaths (per 100k people) = 0.599 + 0.089 * Guns (per 100 people).

Here again is a graph of the data with a plot of the linear model added:


This model has an R2 = 0.384 and p = 0.00015.  The residuals of this model are:


Not terrible – something’s going on with Mexico (row #42).  And it’s overestimating slightly for larger values of x (Guns), but probably due to Mexico.

It was then a simple matter to look up the number of teachers according to the U.S. Census:  7.2 million.

So, if we arm each teacher in American that results in 7.2 mil / 250 mil * 100 = 2.88 additional guns per 100 people.  Plug that into the above linear model and we get:

2.88 * 0.089 = 0.2564 additional deaths per 100k people.  So in the U.S. that translates to :

250 million/100k  * 0.2564 = 641 additional deaths

Look, I would consider this to be a toy model.  The underlying data is from different points in time, there’s the Mexico thing, and also 641 is only an average – could be more; could be less.

My point is that we should use data and assess the net impact of any actions we take to prevent school shootings.