Five reasons you probably have never heard for why you should learn SQL

Everyone knows SQL is the “lingua franca” of the database world.  But here are some reasons for learning it that  you may have never heard before.

#1 – Push it toward the backend

Try to say that in a meeting and not crack a smile (no pun in tended).  In the corporate world any database of significance is running on another server(s).  So any data transformation you can do in SQL is essentially being pushed off to another server.  That’s like getting a  free computer to do your bidding.  Yeah, I know everyone is sharing the other computer, but it’s probably a big one.  There’s an amazing amount you can do with SQL – pivot tables (aka crosstab aka lots of other names), on the fly normalization, fix missing values, summarization, and more.  And if you’re using Oracle there’s Oracle Analytics – which is not analytics in the traditional sense – it’s an extension of SQL that allows you to analyze and summarize transactional data.  It isn’t always useful, but when it works a little bit of SQL can replace a lot of post processing.

#2 – Let the nerds help you

Anyone doing serious corporate data munging is bringing together multiple data sources, any one of which could have dozens (data warehouse) to hundreds of tables (operational database).  There’s no way you can be an expert in all of them.  SQL allows you to more easily make use of your friendly local data experts (FLDE’s).  Have a query that isn’t returning the proper results?  Paste it into an email and send it to an FLDE.  If you’re downloading a table and doing a bunch of processing in VBA/Python/Perl/R you’re not going to be able to send that to many FLDEs for help or advice.

#3 – It’s closer to Production Ready

With any luck, someday some of your processes will become valuable to others.  If and when that happens the more of your processing that is being done in SQL the cheaper and quicker it will be to make your process available to lots of people (i.e. “productionize” it).  There’s a huge difference between sitting down with the IT folks and showing them your Perl scripts and showing them your SQL.  I’ve seen Perl scripts induce looks of horror and disdain.  Usually SQL will get you “meh, that’ll be two weeks” which is exactly what I like to hear.

#4 – How you build your house

This is more a side-effect of #2 and #3.  There’s an old saying in woodworking:  “You can build a house with just a hammer and a screwdriver, but what’s it going to look like?”.  The act of using SQL and discussing your plans with FLDEs invariably turns up alternate ways of munging the data that needs munging.  I’m always surprised at how generally smart people will opt to create their own solutions using only  tools that’s they’re comfortable with (e.g. VBA – BTW, this stands for Visual Basic for Apps and is Excel’s scripting language) rather than asking around IT for better/quicker/easier solutions.  I’m all for being a Data MacGyver – heck, it’s what I do.  But very often the data you need is already available elsewhere, or the tables you’re joining in Microsoft Access can be brought together on the backend by coordinating the creation of a database link.  SQL facilitates discussions with FLDEs and their FLDE-ness will rub off on you.

#5 – FLDE’s will love you!

Well, I aint’ going to lie – many of them will hate you for bothering them.  But the smart ones will love you.  They know that deep down inside you’re their customer and a lot of this munging is stuff that possibly should go into a datamart anyway.  You’re essentially on the front lines doing free Business Analysis for IT and this is the way you should present it to them (in case they don’t notice) – though you probably don’t want to literally say “I’m doing free business analysis for you, so help me’ – be tactful geek!  But eventually, they may start contacting YOU – “Will you find this new table useful?”, that kind of stuff.  FLDEs all know that in a pinch, when the business mandates some new functionality be put in place by next month (or whatevs), they’d much rather get handed a bunch of SQL.  Helping you helps them.

There you have them.  You wouldn’t go to Italy with learning a little Italian, “the beautiful language”.  Why on Earth would you do data munging without knowing everything that can be done in SQL?

SQL is truly “la bella lingua”.

Advertisements

Speak business

This Raconteur piece mentions learning to “speak business” as a critical skill for the data scientist.  For most I would recommend first just learning to speak.  Regardless of the domain, business is about being an informative, convincing, and entertaining speaker.  Yeah, I said entertaining – sue me.  The best model in the world is useless if you can’t convince the business line to use it and if your presentation is boring as sh*t, you’re not going to convince anyone!

You don’t need to be a life long Toastmasters member – just work thru the first ten speeches.  It’ll probably chance your business life.

GLM, GLM, GLM

Love this interview.  I didn’t realize Jeremy Howard, Chief Scientist at Kaggle, comes from an insurance background.  He gives out some nuggets of info on predictive modeling in insurance. He makes a point I’ve been shouting from the rooftops to whatever actuaries that will listen – predictive modeling of tough problems is so much more than GLM! The best overall model usually is a combination of techniques. As I talk to people in the actuarial world, the general feeling seems to be predictive modeling starts and ends with GLM, but it’s just once piece of the puzzle.

Another piece I often see missing in the actuarial world is unsupervised techniques – clustering, principal component analysis (PCA), Self Organizing Maps (SOM) – to create new variables which feed downstream.  I’m often surprised  how some crappy seeming cluster or principal component gets identified as an important variable by a  downstream algorithm.

Finally, he makes a distinction between Big Data and Analytics.  So often it’s implied that they go hand-in-hand – just look at job postings. Show me a listing for Hadoop that doesn’t also say you must know statistics or machine learning (ok, maybe there are some, but there aren’t many). Big Data is about scattering your data across multiple machines and is an engineering problem.  It’s like saying I have to be a DBA (and know about backups, replication, security, data governance and blah blah blah) just to write some SQL.

Analytics in the work place

The other day I had a demo from our research group on Attensity.  After some successful text analytics projects using Perl and R (and one that just didn’t work because of large volumes of claim notes), we’re looking for something that can handle higher volumes.

We were told that due to its complexity and cost of licenses, we would not be allowed to use Attensity ourselves but rather would have to work thru the research unit.  Obviously, this pained me, but if that is the most cost effective approach then I’m all for it.

But I don’t think this is cost effective at all and here’s why.

Analytics today is still a youngster in the corporate world and the way we do it now is comparable to the way IT was done years ago.  It used to be that if you wanted any kind of programming done you had to go to the IT group, your work was prioritized (i.e. you made and presented some kind of cost benefit analysis), and if there were enough resources your project slogged forward.  By the time it was done, the resulting software may or may not have met your needs (which might have changed since you started).  Hence the Agile craze, but I digress.

Compare IT of yesterday to today.  I can do 99% of my job with the tools on my desktop – as much as possible on the back-end with Oracle and SQL Server, and then whatever else needs to be done with Excel/VBA, SAS (technically not on my desktop, but whatevs), Perl, and R.

Imagine if I had to go to IT every time I needed a new Excel macro.  The time (cost) to get work done would be outrageous.  Sure there ends up being a lot of customer created “applications” out there, some of which get turned over to IT (much to their horror).  But what’s often forgotten is that only the top, must useful, processes ever make it to “the show”.  IT may have to figure out how to turn Access SQL into Oracle SQL, but so what – think of all the savings – the business analyst’s job is practically done.  And IT never had to spend one second on the 80-90% of processes that either serve their purpose for a short time, or are perhaps more experimental in nature.

So that brings us back to analytics today.  At some point some kind of analytics will be a tool included for free with Windows, just like VB Script is today.  And every reason you can give that this won’t happen has had a parallel reason in IT:

  • It’s too complicated (writing windows apps used to complicated, now its easy)
  • They’ll make mistakes – yup, they will.  But some of it will make it to The Show.
  • The software is too expensive – there are already free tools out there.

I’m not saying Enterprise Miner is going to be included in Windows 10.  But how about a simple decision tree?  While not the most powerful technique, I like to do a quick CART tree just to visualize some of the relationships (linear or not) in a dataset.  Really, you could say that Microsoft is already including data mining for very little additional cost in SQL Server.

The reason I know this to be true is innovation.  There’s no way you can innovative with analytics by having to go thru some research unit.  The nature of innovation is such that 90% of what you try is going to suck.  As you get better maybe you can bring that down to 85% (ok, maybe 88%).  Nobody is going to fund a project that has a 90% chancing of sucking, thus the whole thing of having to go to a research unit to innovate will never last – either the practice or the company will end.

Luckily, our company is also carefully opening up on its use for Free and Open Source Software (FOSS).  Which is why we’re looking at using GATE for our large project.