Some Python to extract NCAA Team stats from Sports-reference.com

It’s that time of year.  In prep for an NCAA bracket clustering project I’ve had inthe back of my head for the past year, here’s a snippet of Python to extract the team stats from sports-reference.com.  Thank you BeautifulSoup!

</span>
<pre>import urllib2
 import csv
 from bs4 import BeautifulSoup
 sites = {
 "BigTen" : "http://www.sports-reference.com/cbb/conferences/big-ten/2013.html#standings::none",
 "BigEast": "http://www.sports-reference.com/cbb/conferences/big-east/2013.html#standings::none",
 "ACC": "http://www.sports-reference.com/cbb/conferences/acc/2013.html#standings::none",
 "Pac-12": "http://www.sports-reference.com/cbb/conferences/pac-12/2013.html#standings::none",
 "Big-12": "http://www.sports-reference.com/cbb/conferences/big-12/2013.html#standings::none",
 "MWC": "http://www.sports-reference.com/cbb/conferences/mwc/2013.html#standings::none",
 "SEC": "http://www.sports-reference.com/cbb/conferences/sec/2013.html#standings::none",
 "Atlantic-10": "http://www.sports-reference.com/cbb/conferences/atlantic-10/2013.html#standings::none",
 "MVC": "http://www.sports-reference.com/cbb/conferences/mvc/2013.html#standings::none",
 "WCC": "http://www.sports-reference.com/cbb/conferences/wcc/2013.html#standings::none",
 "CUSA": "http://www.sports-reference.com/cbb/conferences/cusa/2013.html#standings::none",
 "WAC": "http://www.sports-reference.com/cbb/conferences/wac/2013.html#standings::none",
 "Horizon": "http://www.sports-reference.com/cbb/conferences/horizon/2013.html#standings::none",
 "BigWest": "http://www.sports-reference.com/cbb/conferences/big-west/2013.html#standings::none",
 "MAC": "http://www.sports-reference.com/cbb/conferences/mac/2013.html#standings::none",
 "MAAC": "http://www.sports-reference.com/cbb/conferences/maac/2013.html#standings::none",
 "Sun-Belt": "http://www.sports-reference.com/cbb/conferences/sun-belt/2013.html#standings::none",
 "Patriot": "http://www.sports-reference.com/cbb/conferences/patriot/2013.html#standings::none",
 "Colonial": "http://www.sports-reference.com/cbb/conferences/colonial/2013.html#standings::none",
 "Ivy": "http://www.sports-reference.com/cbb/conferences/ivy/2013.html#standings::none",
 "OVC": "http://www.sports-reference.com/cbb/conferences/ovc/2013.html#standings::none",
 "America-East": "http://www.sports-reference.com/cbb/conferences/america-east/2013.html#standings::none",
 "Summit": "http://www.sports-reference.com/cbb/conferences/summit/2013.html#standings::none",
 "Northeast": "http://www.sports-reference.com/cbb/conferences/northeast/2013.html#standings::none",
 "Southern": "http://www.sports-reference.com/cbb/conferences/southern/2013.html#standings::none",
 "Atlantic-Sun": "http://www.sports-reference.com/cbb/conferences/atlantic-sun/2013.html#standings::none",
 "Southland": "http://www.sports-reference.com/cbb/conferences/southland/2013.html#standings::none",
 "Big-Sky": "http://www.sports-reference.com/cbb/conferences/big-sky/2013.html#standings::none",
 "Big-South": "http://www.sports-reference.com/cbb/conferences/big-south/2013.html#standings::none",
 "MEAC": "http://www.sports-reference.com/cbb/conferences/meac/2013.html#standings::none",
 "Great-West": "http://www.sports-reference.com/cbb/conferences/great-west/2013.html#standings::none",
 #"Independent": "http://www.sports-reference.com/cbb/conferences/independent/2013.html#standings::none",
 "SWAC": "http://www.sports-reference.com/cbb/conferences/swac/2013.html#standings::none"

}

f = open('ncaa_data.csv', 'w')

f.write("Conf, Rk, School, Conf_W, Conf_L, Conf_Pct, Over_W, Over_L, Over_Pct, PPG_Own, PPG_Opp, SRS, SOS\n")

for item in sites.keys():
 soup = BeautifulSoup(urllib2.urlopen(sites[item]).read())
 print "Processing ", item, "..."
 for row in soup('table', {'class' : 'sortable stats_table'})[0].tbody('tr'):
 tds = row('td')
 if len(tds) >= 12: #some conferences like Sun-Belt have two tables and an extra header - this skips those rows
 f.write(item); f.write(",")
 f.write(tds[0].string); f.write(",")
 f.write(tds[1].find('a').string) ; f.write(",")#need to extract anchor text
 for x in range(2,12):
 f.write(tds[x].string); f.write(",")
 f.write("\n")

f.close

Advertisements

One Response to Some Python to extract NCAA Team stats from Sports-reference.com

  1. Pingback: NCAA 2013 Sleeper Report | DMFunZone

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: