For a project I’m working on at work, I’m building a predictive model that categorizes something (I can’t tell you what) into two bins. There is a default bin that 95% of the things belong to and a bin that the business cares a lot about, containing 5% of the things. Some readers may be familiar with the use of predictive models to identify better sales leads, so that you can target the leads most likely to convert and minimize the amount of effort wasted on people who won’t purchase your product. Although my situation doesn’t have to do with sales leads, I’m going to pretend it does, as it’s a common domain.
My data is many thousands of “leads”, for which I’ve constructed hundreds of predictive features (mostly 1/0, a few numeric) each. I can plug this data into any number of common statistical and machine learning systems which will crunch the numbers and provide a black box that can do a pretty good job of separating more-valuable leads from less valuable leads. That’s great, but now I have to communicate what I’ve done, and how valuable it is, to an audience that struggles with relatively simple statistical concepts like correlation. What can I do?
I was recently given the opportunity to re-present my ggplot2 talk, which I originally gave to the NYC R Meetup, to the DC R Meetup group. The Meetup was held co-located with the Predictive Analytics World conference in Alexandria, VA. (More on my thoughts on PAW below…) Contentwise, I made only small changes, changing a bit of patter and adding more examples at the end. I still love ggplot, with some frustration at the way it is typically introduced. Some of the audience had no R experience at all, while others were experts. One person, a grad student at U. of Maryland, had had very similar difficulty as I had when originally learning ggplot2, and his enthusiastic nods during my presentation were very validating! For reference, the Meetup page is here, and I stuck the current version of the slides in a public Dropbox, located here.
And a few thoughts about PAW. The conference was well-run (although I have my gripes with the hotel and its location!) and there were an interesting and eclectic lineup of speakers, from a variety of industries. Compared to academic conferences I’ve attended, I missed having all the grad students around. At PAW, I felt rather young, which had not been true at academic conferences in quite a long time! The content of the conference focused on people using predictive methods (statistics, data mining, machine learning) at the individual-customer level, for marketing or retainment or other purposes. That’s not my primary interest right now — my work is focused at a slightly higher operations-research-y level, trying to make sure that customers in the aggregate have good options. But I enjoyed learning about what other people are doing using somewhat similar methods. Next year, though, I think I’ll try to go to a different conference, perhaps UseR! in the UK, or INFORMS’ applied conference…
A few months back I gave a presentation to the NYC R Meetup. (R is a statistical programming language. If this means nothing to you, feel free to stop reading now.) The presentation was on ggplot2, a popular package for generating graphs of data and statistics. In the talk (which you can see here, including both my slides and my patter!) I presented both the really great things about ggplot2 and some of its downsides. In this blog post, I wanted to expand a bit on my thinking on ggplot, the Grammar of Graphics, and how peoples’ conceptual representations of graphs, data, ggplot, and R all interact. ggplot is both incredibly elegant and unfortunately difficult to learn to use well, I think as a consequence of the variety of representations. Continue reading