Tag Archives: meetup

integrating R with other systems

I just returned from the useR! 2012 conference for developers and users of R. One of the common themes to many of the presentations was integration of R-based statistical systems with other systems, be they other programming languages, web systems, or enterprise data systems. Some highlights for me were an update to Rserve that includes 1-stop web services, and a presentation on ESB integration. Although I didn’t see it discussed, the new httr package for easier access to web services is also another outstanding development in integrating R into large-scale systems.

Coincidentally, I just a week or so ago had given a short presentation to the local R Meetup entitled “Annotating Enterprise Data from an R Server.” The topic for the evening was “R in the Enterprise,” and others talked about generating large, automated reports with knitr, and using RPy2 to integrate R into a Python-based web system. I talked about my experiences building and deploying a predictive system, using the corporate database as the common link. Here are the slides:

 

Data Science, Moore’s Law, and Moneyball

I’m fond of navel gazing, meta discussions, and so forth. I’ve recently written about inferring navel gazing from link data, and about the meaning of the “Analytics” buzzword. This post will be my second on that other infectious buzzword, “Data Science”.

When I moved to Washington DC in July, I was struck by the fact that there was no Meetup for analytics/applied statistics/machine learning/data science. There’s a great DC Tech Meetup, a great Big Data Meetup, and a great R Meetup, but nothing like the NYC Predictive Analytics Meetup. So, I and a couple of others I talked to about this (Marck Vaisman, who I first met through the NYC R Meetup a couple years ago, and Matt Bryan, who I met just after moving to town), started a new Meetup, which we decided to call “Data Science DC“.

For our second meetup, we thought we should address some aspect of our name, and so I presented a little bit about the term and the controversies around its definition and its recent dramatic upsurge in popularity. Here are the slides (note that you should be able to click through the links on the slide to the source documents):

I mostly didn’t present a personal opinion about what I though the term means, or what it should mean, but instead wanted to present a bunch of other peoples’ points of view to kick off an interesting discussion. And in that sense I succeeded. We had an exceedingly interesting conversation following my slides, and I think a couple of the most interesting ideas from the evening came out of that discussion.

Here are three theses I’d like to propose.

  1. “Data Science” is defined as what “Data Scientists” do. What Data Scientists do has been very well covered, and it runs the gamut from data collection and munging, through application of statistics and machine learning and related techniques, to interpretation, communication, and visualization of the results. Who Data Scientists are may be the more fundamental question.
  2. One reason Data Science is a big thing now is because advances in technology have made it easy for Data Scientists to develop wide-ranging expertise. Even 10 years ago, the idea that the same person could integrate several databases, run a multilevel regression, and generate elegant visualizations would be seen as incredibly rare.
  3. The other reason Data Science is a big thing now is because sabermetrics demonstrated that number-crunching brings results. There’s nothing business leaders love more than a sports analogy, and the analytic revolution in professional sports immediately draw attention to the ways that numbers beat intuition.

I tend to like the idea that Data Science is defined by its practitioners, that it’s a career path rather than a category of activities. In my conversations with people, it seems that people who consider themselves Data Scientists typically have eclectic career paths, that might in some ways seem not to make much sense. A typical path might be someone who started out learning to program, then spent some time in a scientific field, then hopped around a variety of different roles, collecting a wide variety of different skills, all of which related to using analytical techniques to make sense of data.

This sort of career path isn’t particularly new, but what is new is that it’s now possible to relatively quickly and cheaply do get started in all of the processes involved in Data Science. (Thanks to Taylor Horton for suggesting this at the Meetup!) Fast computers, open source tools, and some programming skills allow someone to try a new data management approach or a new machine learning technology incredibly quickly, and to iterate on approaches until a solution to a particular problem is found. This has two consequences. First of all, the productivity of a modern Data Scientist is remarkable. Projects that a few decades ago would have taken teams of people literally years can now be done in a few days. Second of all, this amazing productivity allows people to spend their 10,000 hours developing expertise in the now vertically integrated process of Data Science, rather than having to spend all of that time focusing on developing skills on just a single aspect of the task. There are huge number of things that need to be learned to be an effective Data Scientist, but it is now possible to learn those skills quickly enough to make a career out of being a Jack of all Trades and a near-master at many of them.

So now there’s a supply of people who could be Data Scientists. But what about the type of demand that drives an incessant stream of O’Reilly articles and job postings? Where does the demand come from? Justin Grimes had an intriguing idea that resonated with me — analytics in sports, which I propose as the other reason why analytics and data science have become buzzwords. Although business has used mathematical methods for 100 years, (thousands if you include finance and insurance) the idea that you could hire a very small number of people to analyze data and beat gut instincts in many aspects of decision making is much newer. The idea that a statistician could turn around the Oakland A’s by radically overturning longstanding recruiting practice was a powerful analogy. Even now, business books about analytics almost always have sports examples in the first chapters. I made a point at work the other day by noting that most professional sports prognosticators predict NFL playoff outcomes wrong because they over-weight last years’ results. Sports analogies get attention.

Does this make sense? Data Science is a buzzword now because a group of people with eclectic talents match a growing demand for and recognition of the value of those talents. I’d love feedback on these thoughts!

 

how to speak ggplot2 like a native, and Predictive Analytics World

I was recently given the opportunity to re-present my ggplot2 talk, which I originally gave to the NYC R Meetup, to the DC R Meetup group. The Meetup was held co-located with the Predictive Analytics World conference in Alexandria, VA. (More on my thoughts on PAW below…) Contentwise, I made only small changes, changing a bit of patter and adding more examples at the end. I still love ggplot, with some frustration at the way it is typically introduced. Some of the audience had no R experience at all, while others were experts. One person, a grad student at U. of Maryland, had had very similar difficulty as I had when originally learning ggplot2, and his enthusiastic nods during my presentation were very validating! For reference, the Meetup page is here, and I stuck the current version of the slides in a public Dropbox, located here.

And a few thoughts about PAW. The conference was well-run (although I have my gripes with the hotel and its location!) and there were an interesting and eclectic lineup of speakers, from a variety of industries. Compared to academic conferences I’ve attended, I missed having all the grad students around. At PAW, I felt rather young, which had not been true at academic conferences in quite a long time! The content of the conference focused on people using predictive methods (statistics, data mining, machine learning) at the individual-customer level, for marketing or retainment or other purposes. That’s not my primary interest right now — my work is focused at a slightly higher operations-research-y level, trying to make sure that customers in the aggregate have good options. But I enjoyed learning about what other people are doing using somewhat similar methods. Next year, though, I think I’ll try to go to a different conference, perhaps UseR! in the UK, or INFORMS’ applied conference

Prediction with Multilevel Regression Models, and Pizza

The Meetup phenomenon, which is now substantial and longstanding enough to be more of a cultural change than a flash in the pan, continues to impress me. Even more so than tools like LinkedIn, Meetups have changed the nature of professional networking, making it more informal, diverse, and decentralized. Last night, statistics consultant (and cheap eats guru) Jared Lander and I presented a talk on a statistical technique tangentially related to my professional work (more closely associated with Jared’s). The origin of this presentation is worth noting. On Meetup’s web site, members of a group can suggest topics for meetings. Before even attending a single NYC Predictive Analytics event, I posted several topics that I thought might be interesting for the group. A bit later, the organizers (Bruno and Alex) contacted me to see if I’d be willing to present on prediction with Multilevel models. I said that I would, but only if I could co-present with someone who actually knew something about the topic a complementary set of skills and experiences. Knowing Jared from the NYC R Meetup group, and knowing that he learned about multilevel models from the professor who wrote the best book on the topic, and knowing that he’s pretty good in front of an audience, I suggested we collaborate.

Despite requiring a lot of work, and a lot of learning of details on my part, we managed to throw together a pretty decent talk. (As of this morning, there’s four ratings of the event on Meetup, and we got 5/5 stars! Yay us! Not statistically conclusive, though…) We used as an example topic for data analysis the difficult and critically important problem of predicting reviews of pizza restaurants in downtown NYC. Jared is actually an expert on this topic, having written his Masters thesis on ratings from Menupages.com. For the talk, Jared would present a few slides, then I’d present a few. In a few cases we’d both try to explain topics from slightly different points of view. I’d repeatedly try to use the keyboard instead of the remote-control gadget to control Powerpoint, causing the computer to melt down into a pile of slag and refuse to change the slide. Jared would send me withering glares when I started to move towards the keyboard. It ended up OK, though, we got through everything, and even answered about half of the (excellent) questions! Oh, and shout-out to the AV guy at AOL HQ. I don’t know how they pay his salary, but he rocked.

Jared has posted the slides from the talk here (ppt), and I’ve put the data we made up (for pedagogical purposes) and the code we used to analyze it and generate graphs for the talk here on Github. Alex video-recorded the presentation, and I’ll update this sentence to link to the video once it’s posted somewhere. Hope folks find it valuable!