Data Science, Moore’s Law, and Moneyball

I’m fond of navel gazing, meta discussions, and so forth. I’ve recently written about inferring navel gazing from link data, and about the meaning of the “Analytics” buzzword. This post will be my second on that other infectious buzzword, “Data Science”.

When I moved to Washington DC in July, I was struck by the fact that there was no Meetup for analytics/applied statistics/machine learning/data science. There’s a great DC Tech Meetup, a great Big Data Meetup, and a great R Meetup, but nothing like the NYC Predictive Analytics Meetup. So, I and a couple of others I talked to about this (Marck Vaisman, who I first met through the NYC R Meetup a couple years ago, and Matt Bryan, who I met just after moving to town), started a new Meetup, which we decided to call “Data Science DC“.

For our second meetup, we thought we should address some aspect of our name, and so I presented a little bit about the term and the controversies around its definition and its recent dramatic upsurge in popularity. Here are the slides (note that you should be able to click through the links on the slide to the source documents):

I mostly didn’t present a personal opinion about what I though the term means, or what it should mean, but instead wanted to present a bunch of other peoples’ points of view to kick off an interesting discussion. And in that sense I succeeded. We had an exceedingly interesting conversation following my slides, and I think a couple of the most interesting ideas from the evening came out of that discussion.

Here are three theses I’d like to propose.

  1. “Data Science” is defined as what “Data Scientists” do. What Data Scientists do has been very well covered, and it runs the gamut from data collection and munging, through application of statistics and machine learning and related techniques, to interpretation, communication, and visualization of the results. Who Data Scientists are may be the more fundamental question.
  2. One reason Data Science is a big thing now is because advances in technology have made it easy for Data Scientists to develop wide-ranging expertise. Even 10 years ago, the idea that the same person could integrate several databases, run a multilevel regression, and generate elegant visualizations would be seen as incredibly rare.
  3. The other reason Data Science is a big thing now is because sabermetrics demonstrated that number-crunching brings results. There’s nothing business leaders love more than a sports analogy, and the analytic revolution in professional sports immediately draw attention to the ways that numbers beat intuition.

I tend to like the idea that Data Science is defined by its practitioners, that it’s a career path rather than a category of activities. In my conversations with people, it seems that people who consider themselves Data Scientists typically have eclectic career paths, that might in some ways seem not to make much sense. A typical path might be someone who started out learning to program, then spent some time in a scientific field, then hopped around a variety of different roles, collecting a wide variety of different skills, all of which related to using analytical techniques to make sense of data.

This sort of career path isn’t particularly new, but what is new is that it’s now possible to relatively quickly and cheaply do get started in all of the processes involved in Data Science. (Thanks to Taylor Horton for suggesting this at the Meetup!) Fast computers, open source tools, and some programming skills allow someone to try a new data management approach or a new machine learning technology incredibly quickly, and to iterate on approaches until a solution to a particular problem is found. This has two consequences. First of all, the productivity of a modern Data Scientist is remarkable. Projects that a few decades ago would have taken teams of people literally years can now be done in a few days. Second of all, this amazing productivity allows people to spend their 10,000 hours developing expertise in the now vertically integrated process of Data Science, rather than having to spend all of that time focusing on developing skills on just a single aspect of the task. There are huge number of things that need to be learned to be an effective Data Scientist, but it is now possible to learn those skills quickly enough to make a career out of being a Jack of all Trades and a near-master at many of them.

So now there’s a supply of people who could be Data Scientists. But what about the type of demand that drives an incessant stream of O’Reilly articles and job postings? Where does the demand come from? Justin Grimes had an intriguing idea that resonated with me — analytics in sports, which I propose as the other reason why analytics and data science have become buzzwords. Although business has used mathematical methods for 100 years, (thousands if you include finance and insurance) the idea that you could hire a very small number of people to analyze data and beat gut instincts in many aspects of decision making is much newer. The idea that a statistician could turn around the Oakland A’s by radically overturning longstanding recruiting practice was a powerful analogy. Even now, business books about analytics almost always have sports examples in the first chapters. I made a point at work the other day by noting that most professional sports prognosticators predict NFL playoff outcomes wrong because they over-weight last years’ results. Sports analogies get attention.

Does this make sense? Data Science is a buzzword now because a group of people with eclectic talents match a growing demand for and recognition of the value of those talents. I’d love feedback on these thoughts!


14 thoughts on “Data Science, Moore’s Law, and Moneyball

  1. Taylor

    Great post, Harlan. I’d always been drawn to the idea that a data scientist is a career defined by an eclectic path or skill set, instead of one set of tools and procedures. However, I’d never seen that definition put forth until your talk. Point #2 was me, but it was a result of all the other great discussion from the DC meetup.

  2. Mic

    I liked your post! I’m a physicist by training, but I’ve been working in the “data science” and “analytics” area for going on 20 years now. It’s just never been called that before…

    It’s really coming into play now because lots of data is available, computing power is cheap, and smart people can get answers from the data quickly using these tools (and creating their own). The marketplace is just now ready to recognize “data science” on its own terms.

  3. Pingback: getstats » Getstats – Campaigning to make Britain better with numbers and statistics

  4. human mathematics

    Meh … I hope Hal Varian is right, but as of now it looks like NoSQL, Hadoop, and various other upgraded ETL/database stuff is where the money’s at. I think statisticians face the problem that they can’t communicate exactly what value they will add to the bottom line (“I will explain things that you don’t understand to you … which will accomplish … um … well I don’t know your business well enough to say”). Whereas DB’s are necessary to the basic functioning of a business.

    I see a parallel to Quants (finance). If you are creative and can come up with strategies, great once you’re in the door. But you need to be an excellent programmer to get in the door (so they know they can extract some value from you as a servant, before you become an advisor).

    1. Harlan Post author

      Hi Chris. I think you’re absolutely right. Statistical training alone is not adequate for addressing the value of many real-world questions. I think that substantial systems intuition, and the ability to simulate the outcome of business (or government, or whatever) policy changes that might be due to the results of your analysis, is critical. Probably the people who do this best are quantitative MBAs, who have the formal training to do this work and the ability to talk with non-technical business people, and OR people, who are likewise trained to simulate complex processes and estimate the real-world effects (in money, defects, whatever) of various changes. A stats PhD seems much less useful in application of data science than many other degrees.

      DBAs are valuable, of course, but they don’t do any of the things I talked about above.

  5. human mathematics

    I’d like to hear your comments on the Quora post “What does one have to learn to become a data scientist?”. I thought the post indicated a lack of definition to the field.

    I also read something in the NYT maybe a year or two ago which also claimed that stats PhD’s are/will be a valuable labour force. Again, once I see the evidence, I’ll believe it…

  6. Pingback: Thought this was cool: 什么是数据科学(Data Science) « CWYAlpha

  7. Pingback: A Very Short History of Data Science | What's The Big Data?

  8. Pingback: Press seeks contributions to the ‘Very Short History of Data Science’ | RSSeNews

  9. Pingback: Somethink to Chew On » Survey of Data Science / Analytics / Big Data / Applied Stats / Machine Learning etc. Practitioners

  10. Ross

    Like the sports-analogy observation. Rings true.

    In the language of B-school, “data science” can be cast in part as ‘going to existing clients with new technology.’

    If we can show new clients “past performance” (using DS w/Oakland A’s), especially in a culturally-relevant setting as ‘sports’, then we’ve done a good bit of the sales work.

  11. Pingback: A Very Short History of Data Science | Um blog sobre nada

  12. Pingback: A Very Short History Of Data Science by Gil Press | The Brussels Data Science Community

  13. Dina

    If only all of my professors could have used such a simplicity in elucidating their subjects.. The world could be much easier to understand! Thanks for your thoughts @harlanharris! Had a great pleasure to read them 🙂


Leave a Reply

Your email address will not be published. Required fields are marked *