NCAA 2013 Sleeper Report

I know surprisingly little about NCAA Men’s basketball, which is why my best bracket picks are randomly chosen (after weighting by seed).  But I’ve been wanting to do a more “sophisticated” analysis to see the relative strengths of the teams before filling out a bracket.  Also, it seems like you need to find upsets to differentiate yourself from every other bracket.  In the past my upsets were usually chosen by mascot strength (e.g. Badger > Horny Toad).

But this year is going to be different.  Fortunately, makes available two metrics for comparing relative strengths of each team:

  • SRS – Simple Rating System
  • SOS – Strength of Schedule

They are explained here.  Also, I posted Python code here to extract pull each region and dump them all into a csv file.

Right off the bat I want to plot these using R’s ggplot library.  Here the labels are Region and Seed:


The team with the highest SRS is E-1, the number one seed in the East, or Indiana.  Strangely, the team with the toughest schedule, MW-3, Michigan State, is only a three seed.  Actually, MW-1 (Louisville) , MW-2 (Duke), and MW-3 make up a fairly close cluster from the Midwest Region.

Speaking of clusters, looking at the data like this as if they were distances is exactly what is done in k-means clustering.  Using R’s fpc’s prediction strength I see that two is the only solution for k which results in a prediction strength > 0.80.  So if we create a two-cluster solution and use each team’s cluster for labels instead of Region and Seed we get:


But back to our Region & Seeds.

The West seems to be weaker overall with W-2 (Ohio St) and W-5 (Wisconsin) looking like sleepers coming out of that Region.  I say they’re sleepers because they’re almost as high as W-1 (Gonzaga), but look like they’ve endured a tougher schedule.

In the South there are a number of teams in the upper right – S-3 (Florida), S-1 (Kansas), S-4 (Michigan), and S-11 (Minnesota).  It seems like when lower seeds like Minnesota rank respectably in SRS and SOS that might be a situation where you can differentiate your bracket by picking them as upsets.

So there you have a nice final four – Wisconsin, Florida, Michigan State, and Indiana.

Some Python to extract NCAA Team stats from

It’s that time of year.  In prep for an NCAA bracket clustering project I’ve had inthe back of my head for the past year, here’s a snippet of Python to extract the team stats from  Thank you BeautifulSoup!

<pre>import urllib2
 import csv
 from bs4 import BeautifulSoup
 sites = {
 "BigTen" : "",
 "BigEast": "",
 "ACC": "",
 "Pac-12": "",
 "Big-12": "",
 "MWC": "",
 "SEC": "",
 "Atlantic-10": "",
 "MVC": "",
 "WCC": "",
 "CUSA": "",
 "WAC": "",
 "Horizon": "",
 "BigWest": "",
 "MAC": "",
 "MAAC": "",
 "Sun-Belt": "",
 "Patriot": "",
 "Colonial": "",
 "Ivy": "",
 "OVC": "",
 "America-East": "",
 "Summit": "",
 "Northeast": "",
 "Southern": "",
 "Atlantic-Sun": "",
 "Southland": "",
 "Big-Sky": "",
 "Big-South": "",
 "MEAC": "",
 "Great-West": "",
 #"Independent": "",
 "SWAC": ""


f = open('ncaa_data.csv', 'w')

f.write("Conf, Rk, School, Conf_W, Conf_L, Conf_Pct, Over_W, Over_L, Over_Pct, PPG_Own, PPG_Opp, SRS, SOS\n")

for item in sites.keys():
 soup = BeautifulSoup(urllib2.urlopen(sites[item]).read())
 print "Processing ", item, "..."
 for row in soup('table', {'class' : 'sortable stats_table'})[0].tbody('tr'):
 tds = row('td')
 if len(tds) >= 12: #some conferences like Sun-Belt have two tables and an extra header - this skips those rows
 f.write(item); f.write(",")
 f.write(tds[0].string); f.write(",")
 f.write(tds[1].find('a').string) ; f.write(",")#need to extract anchor text
 for x in range(2,12):
 f.write(tds[x].string); f.write(",")