About

My name is Steve Cultrera and I’m a student in Central Connecticut State University’s Masters of Data Mining program.  But most of my time is spent working for the man in an actuary department of a large insurance company.  In the past I have been a paperboy, bartender, Microsoft developer, and for eight years in the Air Force I maintained security systems around nuclear weapons facilities – you haven’t lived until you’ve accidently set off all the alarms at an ALCM site.

I’m currently working on my thesis.  I took the data from 40 years of Boston Red Sox games at Fenway park and combined with the weather at game time to see what kind impact weather has on baseball.  I’m also interested in answering a question that has vexed Sox fans for almost 20 years – what impact does weather have on Tim Wakefield’s knuckleball?

With my degree wrapping up I’m starting a blog because I really love data mining.  Like I think if I won lotto I would just search thru datasets with Perl, R, and RapidMiner for the rest of my life.

5 Responses to About

  1. Shai says:

    HI,

    My name is Shai . I am using the function you wrote succesfully to analyze cancer expression patterns.
    However – the function stucks whenever it ancounter a marker set for which the matrix inversion (in Tsquare) cannot be completed due to singularity. I can repeat the analysis and usually get results, however this means that I omit several of the ensemble markers whose matrix is singular (and may be informative though).

    I thought of implementation of an alternative pseudoinversion (this is used in SAS for linear discriminant analysis for singular matrices). In R the function is “pseudoinverse”

    What do you think of this ?

    Thanks,

    Shai

    • dmfunzone says:

      Hi Shai! Thanks for the feedback – to be clear, I simply implemented an algorithm from Dziuda’s “Data Mining for Genomics and Proteomics”.

      A similar thing was happening to me when I tried to use the algorithm for text mining – but in text mining you can have very sparse matrices (i.e. many fields have 0), and thus you can get non-invertible matrices. To be honest, I didn’t think it would happen in genomics b/c I didn’t think the matrices were sparse. BUT, I am not an expert in that field by any means. Still, for text mining I was just adding 1 to all the fields to ensure there were no non-zero values. Since you are just comparing two groups I don’t think this impacts the results. A similar thing is done with a Naive Bayes classifier (see Mitchell’s “Machine Learning” Chapter 6).

      I will post your question out to CCSU’s LinkedIn group and see if I can get a better response.

  2. dmfunzone says:

    BTW, the technique I reference is called Laplacian smoothing.

  3. Tarek says:

    Hi, Steve – my name is Tarek. I’m interested in using your code to analyze genomic data. I’m pretty new to R and learning “to sail while already underway”. Would you happen to have — say, 10 mins., max — for me to ask a few questions about your code? We could do it by phone or via whatever messaging system you’d like. If not, I understand — busy time of year and all!

    B/t/w, we are of like mind(s) re: correlation getting a bad rap. I’m guessing you already seen/heard Edward Tufte on the subject, but just in case you haven’t…

    ‘Tufte suggests that the shortest true statement that can be made about causality and correlation is one of the following:

    “Empirically observed covariation is a necessary but not sufficient condition for causality.”

    “Correlation is not causation but it sure is a hint.” ‘

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: