NCAA 2013 Sleeper Report

I know surprisingly little about NCAA Men’s basketball, which is why my best bracket picks are randomly chosen (after weighting by seed).  But I’ve been wanting to do a more “sophisticated” analysis to see the relative strengths of the teams before filling out a bracket.  Also, it seems like you need to find upsets to differentiate yourself from every other bracket.  In the past my upsets were usually chosen by mascot strength (e.g. Badger > Horny Toad).

But this year is going to be different.  Fortunately, makes available two metrics for comparing relative strengths of each team:

  • SRS – Simple Rating System
  • SOS – Strength of Schedule

They are explained here.  Also, I posted Python code here to extract pull each region and dump them all into a csv file.

Right off the bat I want to plot these using R’s ggplot library.  Here the labels are Region and Seed:


The team with the highest SRS is E-1, the number one seed in the East, or Indiana.  Strangely, the team with the toughest schedule, MW-3, Michigan State, is only a three seed.  Actually, MW-1 (Louisville) , MW-2 (Duke), and MW-3 make up a fairly close cluster from the Midwest Region.

Speaking of clusters, looking at the data like this as if they were distances is exactly what is done in k-means clustering.  Using R’s fpc’s prediction strength I see that two is the only solution for k which results in a prediction strength > 0.80.  So if we create a two-cluster solution and use each team’s cluster for labels instead of Region and Seed we get:


But back to our Region & Seeds.

The West seems to be weaker overall with W-2 (Ohio St) and W-5 (Wisconsin) looking like sleepers coming out of that Region.  I say they’re sleepers because they’re almost as high as W-1 (Gonzaga), but look like they’ve endured a tougher schedule.

In the South there are a number of teams in the upper right – S-3 (Florida), S-1 (Kansas), S-4 (Michigan), and S-11 (Minnesota).  It seems like when lower seeds like Minnesota rank respectably in SRS and SOS that might be a situation where you can differentiate your bracket by picking them as upsets.

So there you have a nice final four – Wisconsin, Florida, Michigan State, and Indiana.


The Impact of Arming Teachers

Since the tragedy in Newtown, there’s been talking of arming teachers.

Reading things like this got me wondering if we could estimate the impact of this because surely it can’t be a free lunch.  Then I saw this great graphic about gun deaths vs gun ownership.  And he made the data available, so I built a simple linear model using R:

deaths = read.table("deaths.csv", sep="\t", header=T)
oecd = read.table("oecd.csv", sep="\t", header=T)
data = merge(guns, deaths, by="Country")
data$OECD = data$Country %in% oecd$Country
data.oecd = subset(data, data$OECD==T)

p <- ggplot(data = data.oecd, aes(x = Guns, y = Deaths))
 + geom_smooth(method="lm", se=FALSE, color="blue", formula= y ~ x) + geom_point()

mylm = lm(Guns~Deaths, data=data.oecd)


Deaths (per 100k people) = 0.599 + 0.089 * Guns (per 100 people).

Here again is a graph of the data with a plot of the linear model added:


This model has an R2 = 0.384 and p = 0.00015.  The residuals of this model are:


Not terrible – something’s going on with Mexico (row #42).  And it’s overestimating slightly for larger values of x (Guns), but probably due to Mexico.

It was then a simple matter to look up the number of teachers according to the U.S. Census:  7.2 million.

So, if we arm each teacher in American that results in 7.2 mil / 250 mil * 100 = 2.88 additional guns per 100 people.  Plug that into the above linear model and we get:

2.88 * 0.089 = 0.2564 additional deaths per 100k people.  So in the U.S. that translates to :

250 million/100k  * 0.2564 = 641 additional deaths

Look, I would consider this to be a toy model.  The underlying data is from different points in time, there’s the Mexico thing, and also 641 is only an average – could be more; could be less.

My point is that we should use data and assess the net impact of any actions we take to prevent school shootings.

“” and Obama vs Romney

I was curious to see how often various news sites make use of, and specifically how often “” was used in the same article with either party’s leading candidate.  Since seems to hammer both sides equally on their loose use of facts and what they mean, I’m assuming it is truly bipartisan.

Before I continue, you should know I’m firmly entrenched in the middle of the political spectrum.  And I think most of America is probably within one standard deviation of the middle – this isn’t some bold political statement – it’s just the bell curve.

With that in mind, I fully expected to find that the news sites were equally using, with each side using it to support their representation of “the facts”.  So I expected Fox to use Factcheck to support Romney and MSNBC to use it to support Obama.  I wasn’t sure about CNN, but I was interested in the result because it seemed to me that they were more fair in their coverage.

I did four Google searches against Fox, MSNBC, and CNN.  The searches were:

Factcheck Only:  [ “” -obama -romney]

Obama:   [ “” obama -romney]

Romney:  [ “” -obama romney]

Both:  [ “” obama romney]

For MSNBC and CNN, just replace “” above with “” and “”, respectively.  It’s important to exclude terms (e.g., “-romney”) and also to put in quotes, otherwise Google things you really meant “fact check”.

The results were:

Source R O B F
fox 2 47 29 22
msnbc 2 94 101 7
cnn 0 2 17 2


R = and Romney and Not Obama

O = and Obama and Not Romney

B = and (Obama or Romney)

F = and Not Obama and Not Romney

To visualize, I did a quick boxplot in R:

legend=c("Factcheck + Romney","Factcheck + Obama", "Both", "Factcheck Only"),col=c("red","blue", "purple","brown"), names.arg=c("fox", "msnbc", "cnn"))

Click to Enlarge

I see a few things. Both Fox and MSNBC are equally likely to mention either Factcheck and both candidates OR Factcheck and Obama.  CNN makes fewer references to, but when it does its articles mention both candidates.  I like that – it supports my perception of CNN.

But most stunning from this graphic is that “Romney” alone is rarely mentioned with “” at all. In fact, if you search the entire web

[“” romney -obama] returns 27,300 results
[“” -romney obama] returns 356,000 results

Now, the easy answer is that Obama has been president for the past four years – there are more articles about him out there. Still, Fox news has two articles mentioning only Romney and In case you missed it: TWO

And it’s indicative of a major problem I have with Romney.

He’s not really saying anything.  No facts.  No plans.  Just “Elect me because Obama sucks!”.

Impact of weather at Fenway

From my thesis dataset which combines 40 years of games at Fenway Park with the weather at the start of the game. These plots were done using R’s radial.plot() in the plotrix package.

First is the number of games held “per Direction”. So we can see that the most common direction for winds at Fenway is from the Southwest at 200 degrees:

This actually has me a little bit worried. Though hard to tell from the graphic, the lowest number is 23 games at due North (360 degrees), but you can generally see that relatively few games are played with winds from the North-Northeast.

If we then take some baseball statistics like hits and divide by the above graphic we get this:

This seems to make sense:

  • More HR’s get hit when the wind is blow out – well duh
  • Interestingly, I’d expected the number of singles to be roughly equal for all directions, but there are those two spikes 20 and 40 degrees.  When the wind is blowing in do they take a break from swinging for the fences?
  • Could the builders of Fenway have oriented the park to take advantage of prevailing winds? I’ll have to do some research.

Growing up in New England, I’ve always equated “nor’easter” with a winter storm – they don’t play many games in the winter or during nor’easters. Still the dearth of games which actually have winds from the North bothers me.

UPDATE (3/27/2012):  In Glenn Stout’s book, “Fenway 1912”, he says the park was oriented to match the previously existing field at the site – Huntington Avenue Grounds. Also, rule 1.04 of MLB’s official rules states “it is desirable that the line from home base through the pitchers plate to second base shall run East-Northeast.”  According to, this is supposed to be so the setting sun shines in right field where fewer balls are hit.  So it seems to make sense that an older park like Fenway would be built to follow this rule.

UPDATE (3/30/2012):  Dr. Steve at CCSU Weather tells me that my wind rose looks exactly as it’s supposed to since the actual weather data I used was from the Logan Airport station where due to its position on the coast, it tends to only have either land or sea breezes. Also as I stated above, they don’t play many Sox games during nor’easters.  The weather between Fenway and Logan is probably not exactly the same, but I think they’ll be close enough AND I think they’ll be consistently different.  Still there is no weather station at Fenway, and the game time reports I’ve seen are over the place (literally, when compared to the weather at Logan) so if the conditions are not consistent between the two locations well then I’ll probably find no affects on runs scored.


Analytics in the work place

The other day I had a demo from our research group on Attensity.  After some successful text analytics projects using Perl and R (and one that just didn’t work because of large volumes of claim notes), we’re looking for something that can handle higher volumes.

We were told that due to its complexity and cost of licenses, we would not be allowed to use Attensity ourselves but rather would have to work thru the research unit.  Obviously, this pained me, but if that is the most cost effective approach then I’m all for it.

But I don’t think this is cost effective at all and here’s why.

Analytics today is still a youngster in the corporate world and the way we do it now is comparable to the way IT was done years ago.  It used to be that if you wanted any kind of programming done you had to go to the IT group, your work was prioritized (i.e. you made and presented some kind of cost benefit analysis), and if there were enough resources your project slogged forward.  By the time it was done, the resulting software may or may not have met your needs (which might have changed since you started).  Hence the Agile craze, but I digress.

Compare IT of yesterday to today.  I can do 99% of my job with the tools on my desktop – as much as possible on the back-end with Oracle and SQL Server, and then whatever else needs to be done with Excel/VBA, SAS (technically not on my desktop, but whatevs), Perl, and R.

Imagine if I had to go to IT every time I needed a new Excel macro.  The time (cost) to get work done would be outrageous.  Sure there ends up being a lot of customer created “applications” out there, some of which get turned over to IT (much to their horror).  But what’s often forgotten is that only the top, must useful, processes ever make it to “the show”.  IT may have to figure out how to turn Access SQL into Oracle SQL, but so what – think of all the savings – the business analyst’s job is practically done.  And IT never had to spend one second on the 80-90% of processes that either serve their purpose for a short time, or are perhaps more experimental in nature.

So that brings us back to analytics today.  At some point some kind of analytics will be a tool included for free with Windows, just like VB Script is today.  And every reason you can give that this won’t happen has had a parallel reason in IT:

  • It’s too complicated (writing windows apps used to complicated, now its easy)
  • They’ll make mistakes – yup, they will.  But some of it will make it to The Show.
  • The software is too expensive – there are already free tools out there.

I’m not saying Enterprise Miner is going to be included in Windows 10.  But how about a simple decision tree?  While not the most powerful technique, I like to do a quick CART tree just to visualize some of the relationships (linear or not) in a dataset.  Really, you could say that Microsoft is already including data mining for very little additional cost in SQL Server.

The reason I know this to be true is innovation.  There’s no way you can innovative with analytics by having to go thru some research unit.  The nature of innovation is such that 90% of what you try is going to suck.  As you get better maybe you can bring that down to 85% (ok, maybe 88%).  Nobody is going to fund a project that has a 90% chancing of sucking, thus the whole thing of having to go to a research unit to innovate will never last – either the practice or the company will end.

Luckily, our company is also carefully opening up on its use for Free and Open Source Software (FOSS).  Which is why we’re looking at using GATE for our large project.

R implementation of Stepwise Hybrid Feature Selection

From Dziuda’s Data Mining for Genomics and Proteomics

I had mentioned doing this while taking the Genomics class and several people expressed interest in seeing it.  I was going to do it in some other language, but since taking Linear Models I’m much more comfortable in R.  Also, I found someone’s implementation of the T2 function so I modified it (website is in the notes).

The code contains comments and call examples.  Also, there’s another function at the bottom, WriteT2Values(), which writes a file of the T2 for each column.  Per the note on the bottom of p. 147 instead of selecting any random variable to start with it’s probably a better idea to choose from the top 1000, for example.

I’ve been playing around with it for text mining – sometimes the solve() errors out because the matrix is singular (which I’m assuming is because the matrices in text mining are much more sparse than in genomics).

I would appreciate any feedback or suggestions for improvement:

The code:

#Stepwise Hybrid Feature Selection
SHFS = function(TrainingData, stop_p, stop_T2, excludeColumns, classCondition, randomFlag=TRUE, J=2) {
#-- Input
#-- TrainingData
#-- J:  number of classes, default=2
#-- stop_p:  Stop when number of parameters excedes stop_p
#-- stop_T2: Stop when T2 excedes  stop_T2
#-- randomFlag:  TRUE or FALSE  - Starting point flag; default=TRUE
#-- excludeColumns:  columns to exclude from consideration like id or class indictor
#-- classCondition:  a condition which differentiates the classes, e.g. quote(Hyp_Ind=='true')
#-- Call:
#--  SHFS(TrainingData = poe, stop_p = 5, stop_T2 = 10000, excludeColumns=c(1,2), 
#--         classCondition=quote(author=='Doyle'))

#-- Possible speed gains (?):
#-- Make tsquare use global TrainingData
#-- get rid of printing

N = length(TrainingData[,1])

pool <<- seq(1, ncol(TrainingData))
pool <<- pool[-excludeColumns]
poolSize = length(pool)
p = length(pool)

currentSet <<- NULL

markerSize = 0
currentT2 = 0.0

step = 1

cat("Starting : N=", N, ", J=", J, ", p=", p, ", stop_T2 =", stop_T2, ", stop_p =", stop_p,  ", markerSize =", markerSize, "\n")

while ( markerSize < stop_p  && currentT2 < stop_T2 && (markerSize < (N-J-2) || (N-J-2 ==0) ) && markerSize < p ) {
maxGain = 0.0
markerSize = markerSize + 1
if (markerSize == 1 && randomFlag == TRUE) {
	# first time thru loop and caller wants to start with a random variable 
		#-- Comment/Uncomment these lines depending on 
		#-- which method of selecting a random var you want to use
		#selectedVar = as.integer(runif(1, 1, poolSize))
		selectedVar = selectRandomVar(1000)
		cat("selectedVar = ", selectedVar, "\n")
	} else {
	# Forward selection:  add the variable that maximizes T^2 of this step
	for (i in pool) {
		#-- call tsquare with currentSet plus i appended onto the end
		deltaT2 = tsquare(c(currentSet, i), TrainingData, classCondition) - currentT2
		if (deltaT2 >= maxGain) {
			maxGain = deltaT2
			selectedVar = i 
	} #for (i in pool)
} #if (markerSize...)
currentT2 = currentT2 + maxGain
poolSize = poolSize - 1

minLoss = currentT2
#-- Backward optimization:  if elimination of any variable results in T2 greater
#--  than that of previous step, eliminate one that minimally decreases T2
if (markerSize > 2) {
	for (i in currentSet) {
		#-- call tsquare with currentSet without i
		deltaT2 = currentT2 - tsquare( currentSet[currentSet != i], TrainingData, classCondition )
		if (deltaT2 <= minLoss ) {
			minLoss = deltaT2
			removedVar = i
	if (minLoss < maxGain) {
		poolSize = poolSize + 1
		markerSize = markerSize - 1
		currentT2 = currentT2 - minLoss
	#? saveSet(currentSet)

} # if (markerSize > 2)

cat("Step=", step, "poolSize=", poolSize, ", currentT2=", currentT2, ", markerSize =", markerSize, "\n")
step = step + 1
} #while

cat("currentT2=", currentT2, "\n")
return (currentSet)

} #function

selectRandomVar = function (n) {
# Select random column from top n

cols = read.csv("C:\\out-T2.csv")

cols = head(cols, n)

rowIdx = sample(nrow(cols), 1)

return(cols[rowIdx, 2])


poolToMarker = function(theVar) {
	#-- Add to currentSet
	currentSet <<- c(currentSet, theVar)
	#append(currentSet, theVar)
	#-- Remove from pool
	pool <<- pool[!pool==theVar]

markerToPool = function(theVar) {
	#-- Add to currentSet
	#append(pool, theVar)
	pool <<- c(pool, theVar)
	#-- Remove from pool
	currentSet <<- currentSet[!currentSet==theVar]

tsquare<-function(list1, data, classCondition){
#-- Function calculates Hotelling's T-square
#-- Adapted from:
#-- Changed from original to accept a list of columns - this way it can
#-- be iteratively to try different sets of columns.  Also added 
#-- the classCondition

#-- Inputs:
#-- list1 - a list of columns which should be used - this is so we don't have to use all of the dataframe, data
#-- data: a dataframe
#-- classCondition:  the "where" clause which differentiates the two classes (e.g. "Control_Ind == 'Y' ")

#-- Call:
#-- > tsquare(c(1,2,3,4), test, quote(Hyp_Ind=='true'))
#-- [1] 249.4002

#1) Make sure they are both matrices

group1 <- as.matrix(subset(data, select=list1, subset=(eval(classCondition))))
group2 <- as.matrix(subset(data, select=list1, subset=(!eval(classCondition))))

#-- Add matrix of all 1's to each group
#group1 <<- group1 + matrix(1, nrow=length(group1[,1]), ncol=length(group1[1,]))
#group2 <<- group2 + matrix(1, nrow=length(group2[,1]), ncol=length(group2[1,]))

#2) Find the mean vector and covariance matrix for the first group#
n1 <- length(group1[,1])
xbar1 <- apply(group1,2,mean)
C1 <- cov(group1)
#3) find the mean vector and covariance matrix for the second group#
n2 <- length(group2[,1])
xbar2 <- apply(group2,2,mean)
C2 <- cov(group2)
#4) find the pooled covariance matrix and its inverse#
C <- ((n1-1)*C1+(n2-1)*C2)/(n1+n2-2)
Cinv <- solve(C)  

#5) multiply to find the T-square stat and convert it to F and p-value#
Tsquare <- (n1*n2)*t(xbar1-xbar2)%*%Cinv%*%(xbar1-xbar2)/(n1+n2)
#p <- length(group1[1,])
#F <- (n1+n2-p-1)*Tsquare/((n1+n2-2)*p)
#p.value <- 1 - pf(F,df1=p,df2=n1+n2-p-1)
#Finally, output all of the results#


WriteT2Values = function(data, excludeColumns, classCondition, outFile) {
#-- Write the top n T2 values to a file
#-- Inputs:
#--		excludeColumns - any columns like index and class Condition we don't want to include
#--		classCondition - the condition which separates the classes
#--		outFile - name of output file
#-- Call:
# WriteT2Values(someData, c(1,2), classCondition=quote(author=='Doyle'), "C:\\data\\out-T2.csv")

pool <<- seq(1, ncol(data))
pool <<- pool[-excludeColumns]

DF = data.frame()

for (i in pool) {

		T2 = tsquare(i, data , classCondition)
		newRow = data.frame(col=i, T2Value = T2)
		DF <- rbind(DF, newRow)

write.csv(DF[order(DF$T2Value, decreasing=TRUE),], outFile)