Tag Archives: data science

Data Science, Moore’s Law, and Moneyball

I’m fond of navel gazing, meta discussions, and so forth. I’ve recently written about inferring navel gazing from link data, and about the meaning of the “Analytics” buzzword. This post will be my second on that other infectious buzzword, “Data Science”.

When I moved to Washington DC in July, I was struck by the fact that there was no Meetup for analytics/applied statistics/machine learning/data science. There’s a great DC Tech Meetup, a great Big Data Meetup, and a great R Meetup, but nothing like the NYC Predictive Analytics Meetup. So, I and a couple of others I talked to about this (Marck Vaisman, who I first met through the NYC R Meetup a couple years ago, and Matt Bryan, who I met just after moving to town), started a new Meetup, which we decided to call “Data Science DC“.

For our second meetup, we thought we should address some aspect of our name, and so I presented a little bit about the term and the controversies around its definition and its recent dramatic upsurge in popularity. Here are the slides (note that you should be able to click through the links on the slide to the source documents):

I mostly didn’t present a personal opinion about what I though the term means, or what it should mean, but instead wanted to present a bunch of other peoples’ points of view to kick off an interesting discussion. And in that sense I succeeded. We had an exceedingly interesting conversation following my slides, and I think a couple of the most interesting ideas from the evening came out of that discussion.

Here are three theses I’d like to propose.

  1. “Data Science” is defined as what “Data Scientists” do. What Data Scientists do has been very well covered, and it runs the gamut from data collection and munging, through application of statistics and machine learning and related techniques, to interpretation, communication, and visualization of the results. Who Data Scientists are may be the more fundamental question.
  2. One reason Data Science is a big thing now is because advances in technology have made it easy for Data Scientists to develop wide-ranging expertise. Even 10 years ago, the idea that the same person could integrate several databases, run a multilevel regression, and generate elegant visualizations would be seen as incredibly rare.
  3. The other reason Data Science is a big thing now is because sabermetrics demonstrated that number-crunching brings results. There’s nothing business leaders love more than a sports analogy, and the analytic revolution in professional sports immediately draw attention to the ways that numbers beat intuition.

I tend to like the idea that Data Science is defined by its practitioners, that it’s a career path rather than a category of activities. In my conversations with people, it seems that people who consider themselves Data Scientists typically have eclectic career paths, that might in some ways seem not to make much sense. A typical path might be someone who started out learning to program, then spent some time in a scientific field, then hopped around a variety of different roles, collecting a wide variety of different skills, all of which related to using analytical techniques to make sense of data.

This sort of career path isn’t particularly new, but what is new is that it’s now possible to relatively quickly and cheaply do get started in all of the processes involved in Data Science. (Thanks to Taylor Horton for suggesting this at the Meetup!) Fast computers, open source tools, and some programming skills allow someone to try a new data management approach or a new machine learning technology incredibly quickly, and to iterate on approaches until a solution to a particular problem is found. This has two consequences. First of all, the productivity of a modern Data Scientist is remarkable. Projects that a few decades ago would have taken teams of people literally years can now be done in a few days. Second of all, this amazing productivity allows people to spend their 10,000 hours developing expertise in the now vertically integrated process of Data Science, rather than having to spend all of that time focusing on developing skills on just a single aspect of the task. There are huge number of things that need to be learned to be an effective Data Scientist, but it is now possible to learn those skills quickly enough to make a career out of being a Jack of all Trades and a near-master at many of them.

So now there’s a supply of people who could be Data Scientists. But what about the type of demand that drives an incessant stream of O’Reilly articles and job postings? Where does the demand come from? Justin Grimes had an intriguing idea that resonated with me — analytics in sports, which I propose as the other reason why analytics and data science have become buzzwords. Although business has used mathematical methods for 100 years, (thousands if you include finance and insurance) the idea that you could hire a very small number of people to analyze data and beat gut instincts in many aspects of decision making is much newer. The idea that a statistician could turn around the Oakland A’s by radically overturning longstanding recruiting practice was a powerful analogy. Even now, business books about analytics almost always have sports examples in the first chapters. I made a point at work the other day by noting that most professional sports prognosticators predict NFL playoff outcomes wrong because they over-weight last years’ results. Sports analogies get attention.

Does this make sense? Data Science is a buzzword now because a group of people with eclectic talents match a growing demand for and recognition of the value of those talents. I’d love feedback on these thoughts!

 

On “Analytics” and related fields

I recently attended the INFORMS Conference on Business Analytics and Operations Research, aka “INFORMS Analytics 2011”, conference in Chicago. This deserves a little bit of an explanation. INFORMS is the professional organization for Operations Research (OR) and Management Science (MS), which are terms describing approaches to improving business efficiency by use of mathematical optimization and simulation tools. OR is perhaps best known for the technique of Linear Programming (read “Programming” as “Planning”), which is a method for optimizing a useful class of mathematical expressions under various constraints extremely efficiently. You can, for example, solve scheduling, assignment, transportation, factory layout, and similar problems with millions of variables in seconds. These techniques came out of large-scale government and especially military logistics and decision-making needs of the mid-20th century, and have now been applied extensively in many industries. Have you seen the UPS “We (heart) Logistics” ad? That’s OR.

OR is useful, but it’s not sexy, despite UPS’ best efforts. Interest in OR programs in universities (often specialties of Industrial Engineering departments) has been down in recent years, as has been attendance at INFORMS conferences. On the other hand, if you ignore the part about “optimization” and just see OR as “improving business efficiency by use of mathematical processes,” this makes no sense at all! Hasn’t Analytics been a buzzword for the past few years? (“analytics buzzword” gets 2.4 million results on Google.) Haven’t there been bestselling business books about mathematical tools being used in all sorts of industries? (That last link is about baseball.) Hasn’t the use of statistical and mathematical techniques in business been called “sexy” by Google’s Chief Economist? How could a field and an industry that at some level seems to be the very definition of what’s cool in business and technology right now be seen as a relic of McNamara’s vision of the world?

To answer that rhetorical question, I think it’s worth considering the many ways that organizations can use data about their operations to improve their effectiveness. SAS has a really useful hierarchy, which it calls the Eight levels of analytics.

  1. Standard Reports – pre-processed, regular summaries of historical data
  2. Ad Hoc Reports – the ability for analysts to ask new questions and get new answers
  3. Query Drilldown – the ability for non-technical users to slice and dice data to see results interactively
  4. Alerts – systems that detect atypical conditions and notify people
  5. Statistical Analysis – use of regressions and similar to find trends and correlations in historical data
  6. Forecasting – ability to extrapolate from historical data to estimate future business
  7. Predictive Analytics – advanced forecasting, using statistical and machine-learning tools and large data sets
  8. Optimization – balance competing goals to maximize results

I like this hierarchy because it distinguishes among a bunch of different disciplines and technologies that tend to run together. For example, what’s often called “Business Intelligence” is a set of tools for doing items #1-#4. No statistics per se are involved, just the ability to provide useful summaries of data to people who need in various ways. At its most statistically advanced, BI includes tools for data visualization that are informed by research, and at its most technologically advanced, BI includes sophisticated database and data management systems to keep everything running quickly and reliably. These are not small accomplishments, and this is a substantial and useful thing to be able to do.

But it’s not what “data scientists” in industry do, or at least, it’s not what makes them sexy and valuable. When you apply the tools of scientific inquiry, statistical analysis, and machine learning to data, you get the abilities in levels #5-#7. Real causality can be separated from random noise. Eclectic data sources, including unstructured documents, can be processed for valuable predictive features. Models can predict movie revenue or recommend movies you want to see or any number of other fascinating things. Great stuff. Not BI.

And not really OR either, unless you redefine OR. OR is definitely #8, the ability to build sophisticated mathematical models that can be used not just to predict the future, but to find a way to get to the future you want.

So why did I go to an INFORMS conference with the work Analytics in its title? This same conference in the past used to be called “The INFORMS Conference on OR Practice”. Why the change? This has been the topic of constant conversation recently, among the leaders of the society, as well as among the attendees of the conference. There are a number of possible answers, from jumping on a bandwagon, to trying to protect academic turf, to trying to let “data geeks” know that there’s a whole world of “advanced” analytics beyond “just” predictive modeling.

I think all of those are right, and justifiable, despite the pejorative slant. SAS’ hierarchy does define a useful progression among useful analytic skills. INFORMS recently hired consultants to help them figure out how to place themselves, and identified a similar set of overlapping distinctions:

  • Descriptive Analytics — Analysis and reporting of patterns in historical data
  • Predictive Analytics — Predicts future trends, finds complex relationships in data
  • Prescriptive Analytics — Determines better procedures and strategies, balances constraints

They also have been using “Advanced Analytics” for the Predictive and Prescriptive categories.

I do like these definitions. But do I like the OR professional society trying to add Predictive Analytics to the scope of their domain, or at least of their Business-focused conference? I’m on the fence. It’s clearly valuable to link optimization to prediction, in business as well as other sorts of domains. (In fact, I have a recent Powerpoint slide that says “You can’t optimize what you can’t predict”!) And crosstalk among practitioners of these fields can be nothing but positive. I certainly have learned a lot about appropriate technologies from my membership in a variety of professional organizations.

But the whole scope of “analytics” is a lot of ground, and the underlying research and technology spans several very different fields. I’d be surprised if there were more than a dozen people at INFORMS with substantial expertise in text mining, for example. There almost needs to be a new business-focused advanced analytics conference, sponsored jointly by the professional societies of the machine learning, statistics, and OR fields, covering everything that businesses large and small do with data that is more mathematically sophisticated (though not necessarily more useful) than the material covered by the many business intelligence conferences and trade shows. Would that address the problem of advanced analytics better than trying to expand the definition of OR?

“Data Scientist” and other titles

Neil Saunders has an interesting (to me) blog post up this morning, with the title “Dumped on by data scientists.” He uses the use of “data scientist” in a Chronicle of Higher Ed article to rant a little bit about the term. For Neil, it’s redundant, as the act of doing science necessarily requires data; it’s insulting, as if “scientist” wasn’t cool enough and you have to add “data”; and it’s misleading, as many people who call themselves “data scientists” are actually dealing with business data rather than scientific data.

Without disagreeing that there’s a terminological sprawl going on, I did want to address the use of the term, and partially disagree with Neil.

As someone with scientific training who uses those tools to solve business problems, I certainly struggle with a description of my role. “Data Scientist” or “Statistical Data Scientist” is actually pretty good, as it correctly indicates that I use scientific techniques (controlled experiments, sophisticated statistics) to understand our company’s data. I often describe myself as a “Statistician”, too, which gets across some of the same ideas without people having to do a double take and parse a new phrase. I also sometimes describe myself as doing “Operations Research” (aka “Management Science”, although I don’t use that term), since I use some of the tools of that field, as well as of Artificial Intelligence/Machine Learning, to optimize certain objective functions.

“Business Intelligence” actually is not that good a term for what I do, as most of what is usually called BI is about tools for better/more relevant/faster access to data for business people to use. This is not a bad thing to be doing, at all, but it’s different from the predictive and inferential statistical methods that I use in my job.

I don’t know what the right answer is. It might depend on the precise person and their precise role. My title, for instance, is the result of a back-and-forth with my boss, HR, and others, trying to find words that have both appropriate internal and external meanings. “Technical Lead” is a rank, indicating that I run technical projects without (formally) managing people. “Inventory Optimization and Research” covers a variety of areas. “Inventory” here means “sellable units”, like boxes on a shelf, or in this case, like scheduled airline flights. Probably baffling for an external audience without an explanation, but extremely clear inside the company. “Optimization” means what it sounds like, both in a technical and a non-technical sense, and for both internal and external audiences. “Research” indicates a focus on the development of long-term and cutting-edge systems. “Data Scientist” didn’t end up in there, but it could have.

For people using Big Data tools and scientific methods to study topics inside academia, the right answer seems to me to put the field of study first. You’re not a “Data Scientist”, you’re an astrophysicist, or a bioinformatician, or a neuroscientist, with a specialization in statistical methods. If you’re a generalist inside the academy, you’re probably a statistician. Perhaps “Data Scientist” should be restricted to people applying scientific tools and techniques to problems of non-academic interest? That might work, as long as it included people who do things like apply predictive analytic tools to hospital admissions data.