# intuitive visualizations of categorization for non-technical audiences

For a project I’m working on at work, I’m building a predictive model that categorizes something (I can’t tell you what) into two bins. There is a default bin that 95% of the things belong to and a bin that the business cares a lot about, containing 5% of the things. Some readers may be familiar with the use of predictive models to identify better sales leads, so that you can target the leads most likely to convert and minimize the amount of effort wasted on people who won’t purchase your product. Although my situation doesn’t have to do with sales leads, I’m going to pretend it does, as it’s a common domain.

My data is many thousands of “leads”, for which I’ve constructed hundreds of predictive features (mostly 1/0, a few numeric) each. I can plug this data into any number of common statistical and machine learning systems which will crunch the numbers and provide a black box that can do a pretty good job of separating more-valuable leads from less valuable leads. That’s great, but now I have to communicate what I’ve done, and how valuable it is, to an audience that struggles with relatively simple statistical concepts like correlation. What can I do?

# On “Analytics” and related fields

I recently attended the INFORMS Conference on Business Analytics and Operations Research, aka “INFORMS Analytics 2011”, conference in Chicago. This deserves a little bit of an explanation. INFORMS is the professional organization for Operations Research (OR) and Management Science (MS), which are terms describing approaches to improving business efficiency by use of mathematical optimization and simulation tools. OR is perhaps best known for the technique of Linear Programming (read “Programming” as “Planning”), which is a method for optimizing a useful class of mathematical expressions under various constraints extremely efficiently. You can, for example, solve scheduling, assignment, transportation, factory layout, and similar problems with millions of variables in seconds. These techniques came out of large-scale government and especially military logistics and decision-making needs of the mid-20th century, and have now been applied extensively in many industries. Have you seen the UPS “We (heart) Logistics” ad? That’s OR.

OR is useful, but it’s not sexy, despite UPS’ best efforts. Interest in OR programs in universities (often specialties of Industrial Engineering departments) has been down in recent years, as has been attendance at INFORMS conferences. On the other hand, if you ignore the part about “optimization” and just see OR as “improving business efficiency by use of mathematical processes,” this makes no sense at all! Hasn’t Analytics been a buzzword for the past few years? (“analytics buzzword” gets 2.4 million results on Google.) Haven’t there been bestselling business books about mathematical tools being used in all sorts of industries? (That last link is about baseball.) Hasn’t the use of statistical and mathematical techniques in business been called “sexy” by Google’s Chief Economist? How could a field and an industry that at some level seems to be the very definition of what’s cool in business and technology right now be seen as a relic of McNamara’s vision of the world?

To answer that rhetorical question, I think it’s worth considering the many ways that organizations can use data about their operations to improve their effectiveness. SAS has a really useful hierarchy, which it calls the Eight levels of analytics.

1. Standard Reports – pre-processed, regular summaries of historical data
2. Ad Hoc Reports – the ability for analysts to ask new questions and get new answers
3. Query Drilldown – the ability for non-technical users to slice and dice data to see results interactively
4. Alerts – systems that detect atypical conditions and notify people
5. Statistical Analysis – use of regressions and similar to find trends and correlations in historical data
6. Forecasting – ability to extrapolate from historical data to estimate future business
7. Predictive Analytics – advanced forecasting, using statistical and machine-learning tools and large data sets
8. Optimization – balance competing goals to maximize results

I like this hierarchy because it distinguishes among a bunch of different disciplines and technologies that tend to run together. For example, what’s often called “Business Intelligence” is a set of tools for doing items #1-#4. No statistics per se are involved, just the ability to provide useful summaries of data to people who need in various ways. At its most statistically advanced, BI includes tools for data visualization that are informed by research, and at its most technologically advanced, BI includes sophisticated database and data management systems to keep everything running quickly and reliably. These are not small accomplishments, and this is a substantial and useful thing to be able to do.

But it’s not what “data scientists” in industry do, or at least, it’s not what makes them sexy and valuable. When you apply the tools of scientific inquiry, statistical analysis, and machine learning to data, you get the abilities in levels #5-#7. Real causality can be separated from random noise. Eclectic data sources, including unstructured documents, can be processed for valuable predictive features. Models can predict movie revenue or recommend movies you want to see or any number of other fascinating things. Great stuff. Not BI.

And not really OR either, unless you redefine OR. OR is definitely #8, the ability to build sophisticated mathematical models that can be used not just to predict the future, but to find a way to get to the future you want.

So why did I go to an INFORMS conference with the work Analytics in its title? This same conference in the past used to be called “The INFORMS Conference on OR Practice”. Why the change? This has been the topic of constant conversation recently, among the leaders of the society, as well as among the attendees of the conference. There are a number of possible answers, from jumping on a bandwagon, to trying to protect academic turf, to trying to let “data geeks” know that there’s a whole world of “advanced” analytics beyond “just” predictive modeling.

I think all of those are right, and justifiable, despite the pejorative slant. SAS’ hierarchy does define a useful progression among useful analytic skills. INFORMS recently hired consultants to help them figure out how to place themselves, and identified a similar set of overlapping distinctions:

• Descriptive Analytics — Analysis and reporting of patterns in historical data
• Predictive Analytics — Predicts future trends, finds complex relationships in data
• Prescriptive Analytics — Determines better procedures and strategies, balances constraints

They also have been using “Advanced Analytics” for the Predictive and Prescriptive categories.

I do like these definitions. But do I like the OR professional society trying to add Predictive Analytics to the scope of their domain, or at least of their Business-focused conference? I’m on the fence. It’s clearly valuable to link optimization to prediction, in business as well as other sorts of domains. (In fact, I have a recent Powerpoint slide that says “You can’t optimize what you can’t predict”!) And crosstalk among practitioners of these fields can be nothing but positive. I certainly have learned a lot about appropriate technologies from my membership in a variety of professional organizations.

But the whole scope of “analytics” is a lot of ground, and the underlying research and technology spans several very different fields. I’d be surprised if there were more than a dozen people at INFORMS with substantial expertise in text mining, for example. There almost needs to be a new business-focused advanced analytics conference, sponsored jointly by the professional societies of the machine learning, statistics, and OR fields, covering everything that businesses large and small do with data that is more mathematically sophisticated (though not necessarily more useful) than the material covered by the many business intelligence conferences and trade shows. Would that address the problem of advanced analytics better than trying to expand the definition of OR?

# “Data Scientist” and other titles

Neil Saunders has an interesting (to me) blog post up this morning, with the title “Dumped on by data scientists.” He uses the use of “data scientist” in a Chronicle of Higher Ed article to rant a little bit about the term. For Neil, it’s redundant, as the act of doing science necessarily requires data; it’s insulting, as if “scientist” wasn’t cool enough and you have to add “data”; and it’s misleading, as many people who call themselves “data scientists” are actually dealing with business data rather than scientific data.

Without disagreeing that there’s a terminological sprawl going on, I did want to address the use of the term, and partially disagree with Neil.

As someone with scientific training who uses those tools to solve business problems, I certainly struggle with a description of my role. “Data Scientist” or “Statistical Data Scientist” is actually pretty good, as it correctly indicates that I use scientific techniques (controlled experiments, sophisticated statistics) to understand our company’s data. I often describe myself as a “Statistician”, too, which gets across some of the same ideas without people having to do a double take and parse a new phrase. I also sometimes describe myself as doing “Operations Research” (aka “Management Science”, although I don’t use that term), since I use some of the tools of that field, as well as of Artificial Intelligence/Machine Learning, to optimize certain objective functions.

“Business Intelligence” actually is not that good a term for what I do, as most of what is usually called BI is about tools for better/more relevant/faster access to data for business people to use. This is not a bad thing to be doing, at all, but it’s different from the predictive and inferential statistical methods that I use in my job.

I don’t know what the right answer is. It might depend on the precise person and their precise role. My title, for instance, is the result of a back-and-forth with my boss, HR, and others, trying to find words that have both appropriate internal and external meanings. “Technical Lead” is a rank, indicating that I run technical projects without (formally) managing people. “Inventory Optimization and Research” covers a variety of areas. “Inventory” here means “sellable units”, like boxes on a shelf, or in this case, like scheduled airline flights. Probably baffling for an external audience without an explanation, but extremely clear inside the company. “Optimization” means what it sounds like, both in a technical and a non-technical sense, and for both internal and external audiences. “Research” indicates a focus on the development of long-term and cutting-edge systems. “Data Scientist” didn’t end up in there, but it could have.

For people using Big Data tools and scientific methods to study topics inside academia, the right answer seems to me to put the field of study first. You’re not a “Data Scientist”, you’re an astrophysicist, or a bioinformatician, or a neuroscientist, with a specialization in statistical methods. If you’re a generalist inside the academy, you’re probably a statistician. Perhaps “Data Scientist” should be restricted to people applying scientific tools and techniques to problems of non-academic interest? That might work, as long as it included people who do things like apply predictive analytic tools to hospital admissions data.

# Prediction with Multilevel Regression Models, and Pizza

The Meetup phenomenon, which is now substantial and longstanding enough to be more of a cultural change than a flash in the pan, continues to impress me. Even more so than tools like LinkedIn, Meetups have changed the nature of professional networking, making it more informal, diverse, and decentralized. Last night, statistics consultant (and cheap eats guru) Jared Lander and I presented a talk on a statistical technique tangentially related to my professional work (more closely associated with Jared’s). The origin of this presentation is worth noting. On Meetup’s web site, members of a group can suggest topics for meetings. Before even attending a single NYC Predictive Analytics event, I posted several topics that I thought might be interesting for the group. A bit later, the organizers (Bruno and Alex) contacted me to see if I’d be willing to present on prediction with Multilevel models. I said that I would, but only if I could co-present with someone who actually knew something about the topic a complementary set of skills and experiences. Knowing Jared from the NYC R Meetup group, and knowing that he learned about multilevel models from the professor who wrote the best book on the topic, and knowing that he’s pretty good in front of an audience, I suggested we collaborate.

Despite requiring a lot of work, and a lot of learning of details on my part, we managed to throw together a pretty decent talk. (As of this morning, there’s four ratings of the event on Meetup, and we got 5/5 stars! Yay us! Not statistically conclusive, though…) We used as an example topic for data analysis the difficult and critically important problem of predicting reviews of pizza restaurants in downtown NYC. Jared is actually an expert on this topic, having written his Masters thesis on ratings from Menupages.com. For the talk, Jared would present a few slides, then I’d present a few. In a few cases we’d both try to explain topics from slightly different points of view. I’d repeatedly try to use the keyboard instead of the remote-control gadget to control Powerpoint, causing the computer to melt down into a pile of slag and refuse to change the slide. Jared would send me withering glares when I started to move towards the keyboard. It ended up OK, though, we got through everything, and even answered about half of the (excellent) questions! Oh, and shout-out to the AV guy at AOL HQ. I don’t know how they pay his salary, but he rocked.

Jared has posted the slides from the talk here (ppt), and I’ve put the data we made up (for pedagogical purposes) and the code we used to analyze it and generate graphs for the talk here on Github. Alex video-recorded the presentation, and I’ll update this sentence to link to the video once it’s posted somewhere. Hope folks find it valuable!