My Cafepress Shop – Steve’s Analytic Swag Shop

For months I’ve been wishing there were shirts and mugs and stuff with  more “analytical” themes.

Please visit my shop. My only criteria for a design is I have to think it’s funny.

Steve’s Analytic Swag Shop – buy t-shirts, mugs & gifts from my shop.


Neflix algorithm

The fact that the winning Netflix algorithm was never implemented doesn’t get enough publicity. It’s a good example of things modelers create without being shackled by engineering concerns. The other thing these competitions (like Kaggle) get to avoid is having to integrate their model into a workflow.

But I guess really these are only concerns for those of us who create models for other humans to use. If you are creating models that computers use to, say, trade stocks, knock yourself out!


Some Python to extract NCAA Team stats from

It’s that time of year.  In prep for an NCAA bracket clustering project I’ve had inthe back of my head for the past year, here’s a snippet of Python to extract the team stats from  Thank you BeautifulSoup!

<pre>import urllib2
 import csv
 from bs4 import BeautifulSoup
 sites = {
 "BigTen" : "",
 "BigEast": "",
 "ACC": "",
 "Pac-12": "",
 "Big-12": "",
 "MWC": "",
 "SEC": "",
 "Atlantic-10": "",
 "MVC": "",
 "WCC": "",
 "CUSA": "",
 "WAC": "",
 "Horizon": "",
 "BigWest": "",
 "MAC": "",
 "MAAC": "",
 "Sun-Belt": "",
 "Patriot": "",
 "Colonial": "",
 "Ivy": "",
 "OVC": "",
 "America-East": "",
 "Summit": "",
 "Northeast": "",
 "Southern": "",
 "Atlantic-Sun": "",
 "Southland": "",
 "Big-Sky": "",
 "Big-South": "",
 "MEAC": "",
 "Great-West": "",
 #"Independent": "",
 "SWAC": ""


f = open('ncaa_data.csv', 'w')

f.write("Conf, Rk, School, Conf_W, Conf_L, Conf_Pct, Over_W, Over_L, Over_Pct, PPG_Own, PPG_Opp, SRS, SOS\n")

for item in sites.keys():
 soup = BeautifulSoup(urllib2.urlopen(sites[item]).read())
 print "Processing ", item, "..."
 for row in soup('table', {'class' : 'sortable stats_table'})[0].tbody('tr'):
 tds = row('td')
 if len(tds) >= 12: #some conferences like Sun-Belt have two tables and an extra header - this skips those rows
 f.write(item); f.write(",")
 f.write(tds[0].string); f.write(",")
 f.write(tds[1].find('a').string) ; f.write(",")#need to extract anchor text
 for x in range(2,12):
 f.write(tds[x].string); f.write(",")


Correlation gets a bad rap

Correlation has been getting a bad rap lately.  Just because correlation does not imply causation may not be all that important for the purposes of prediction.

My favorite Correlation/Causation Paradox (CCP) is that ZIP codes with lots of churches tend to have more crime.  If your purpose is to prevent crime, you could probably do worse than to use the number of churches to decide how to deploy your police resources.

If on the other hand, you decide that the best way to reduce crime is to destroy churches, you have CCP issues.

Just think of correlation is a black box, like neural nets.  Not so good for explanation.  Good for prediction!

Password guessing

Interesting ideas for password guessing software.

I wonder any text analytic techniques, specifically stylometry, would be useful for reducing the number of passwords that need to be guessed.  Like does the writer invariant apply to Tweets?

You could monitor someone’s Tweets, or blog posts for that matter, and get an idea of some of the author’s invariant text properties, like average word length.

Maybe their passwords would also have similar properties.