# Thoughts on Managing Data Science Team Workstreams (and a Shiny app)

This is an updated version of a post originally published on Medium on Jan. 28, 2016. I may have more to say about this sort of thing in the near future.

There are different types of data scientists, with different backgrounds and career paths. With Sean Murphy and Marck Vaisman, I wrote an article about this for O’Reilly a few years back, based on survey research we’d done. Download a copy, if you haven’t read it. This idea is now pretty well established, but I want to talk about a related issue, which is that the type of work that Data Science teams do varies a lot, and that managing those types of work can be an interesting challenge.

As Josh Wills said, data scientists aren’t software developers, but they sometimes do that sort of work, and they aren’t statisticians, but they sometimes do that sort of work too. At EAB, where I lead a Data Science team of people with very diverse backgrounds and skill sets, this issue leads to a lot of complexity and experimentation as we (and the upper management I report to) try to ensure that everyone is working on the right tasks, at the right time, efficiently.

In this post, I’d like to share some thoughts about how we currently think about and manage different types of Data Science work. I also wrote a little Shiny web tool to help us manage our time, and I’ll show that off as well.

# Parameterizable Reproducible Research

The below is a public version of a post originally posted on an internal blog at the Education Advisory Board (EAB), my current employer. We don’t yet have a public tech blog, but I got permission to edit and post it here, along with the referenced code.

Data Science teams get asked to do a lot of different sorts of things. Some of what the team that I’m part of builds is enterprise-scale predictive analytics, such as the Student Risk Model that’s part of the Student Success Collaborative. That’s basically software development with a statistical twist and machine-learning core. Sometimes we get asked to do quick-and-dirty, one-off sorts of things, to answer a research question. We have a variety of tools and processes for that task. But there’s a third category that I want to focus on – frequently requested but slightly-different reports.

## what is it

There’s a relatively new theme in the scientific research community called reproducible research. Briefly, the idea is that it should be possible to re-do all steps after data collection automatically, including data cleaning and reformatting, statistical analyses, and even the actual generation of a camera-ready report with charts, graphs, and tables. This means that if you realized that, say, one data point in your analysis was bogus and needed to be removed, you could remove that data point, press a button, and in a minute or two have a shiny new PDF with all of the results automatically updated.

This type of reproducible research has been around for a while, although it’s having a recent resurgence in part due to the so-called “statistical crisis“. The R (and S) statistical programming languages have supported LaTeX, the scientific document creation/typesetting tool, for many years. Using a tool called Sweave, a researcher “weaves” chunks of text and chunks of R code together. The document is then “executed”, where the R code chunks are executed and the results are converted into a single LaTeX document, which is then compiled into a PDF or similar. The code can generate charts and tables, so no manual effort is needed to rebuild a camera-ready document.

This is great, a huge step forward towards validation of often tricky and complex statistical analyses. If you’re writing a conference paper on, say, a biomedical experiment, a reproducible process can drastically improve your ability to be confident in your work. But data scientists often have to generate this sort of thing repeatedly, from different sources of data or with different parameters. And they have to do so efficiently.

Parameterizable reproducible research, then, is a variant of reproducible research tools and workflows where it is easy to specify data sources, options, and parameters to a standardized analytical report, even one that includes statistical or predictive analyses, data manipulation, and graph generation. The report can be emailed or otherwise sent to people, and doesn’t seem as public as, say, a web-based app developed in Shiny or another technology. This isn’t a huge breakthrough or anything, but it’s a useful pattern that seems worth sharing.

# integrating R with other systems

I just returned from the useR! 2012 conference for developers and users of R. One of the common themes to many of the presentations was integration of R-based statistical systems with other systems, be they other programming languages, web systems, or enterprise data systems. Some highlights for me were an update to Rserve that includes 1-stop web services, and a presentation on ESB integration. Although I didn’t see it discussed, the new httr package for easier access to web services is also another outstanding development in integrating R into large-scale systems.

Coincidentally, I just a week or so ago had given a short presentation to the local R Meetup entitled “Annotating Enterprise Data from an R Server.” The topic for the evening was “R in the Enterprise,” and others talked about generating large, automated reports with knitr, and using RPy2 to integrate R into a Python-based web system. I talked about my experiences building and deploying a predictive system, using the corporate database as the common link. Here are the slides:

# Survey of Data Science / Analytics / Big Data / Applied Stats / Machine Learning etc. Practitioners

As I’ve discussed here before, there is a debate raging (ok, maybe not raging) about terms such as “data science”, “analytics”, “data mining”, and “big data”. What do they mean, how do they overlap, and perhaps most importantly, who are the people who work in these fields?

Along with two other DC-area Data Scientists, Marck Vaisman and Sean Murphy, I’ve put together a survey to explore some of these issues. Help us quantitatively understand the space of data-related skills and careers by participating!

It should take 10 minutes or less, data will be kept confidential, and we look forward to sharing our results and insights in a variety of venues, including this blog! Thanks!

This past Friday, the web portal to the US Federal government, USA.gov, organized hackathons across the US for programmers and data scientists to work with and analyze the data from their link-shortening service. It turns out that if you shorten a web link with bit.ly, the shortened link looks like 1.usa.gov/V6NpL (that one goes to a NASA page). And because this service was paid for by taxpayer money, the data about each clickthrough is freely available.

Shortened-link click-through data is interesting. It tells you the time and approximate geographic location of each click-through, and the web page or service that the link was on (assuming someone didn’t type the URL in by hand). You also know when the shortened link was created, which tells you a little bit about the way links are shared. Bit.ly themselves have several full time data scientists on staff whose job is to learn about what shortened-link data can say about web traffic patterns and link sharing, potentially very lucrative information.

For my part, I just wanted to do some fun visualizations. Along with friends in NYC, I joined the hackathon remotely, following along on twitter and listening to dance music in their turntable.fm room. I managed to get rough drafts of two somewhat non-trivial graphs done during the official hackathon, and I re-built them with larger and more random data later.

This first graph looks at the difference in time between when a link was created (the first time someone tried to shorten the target URL) and when the clickthroughs happened. For each of the 25 most frequently visited target domains (mostly US government agencies), I built a density plot, or smoothed histogram, of the timings.

(click for a larger image) There are some interesting differences. Links from senate.gov are mostly clicked through within a few hours of their creation, and links from the NY Courts are clicked through in less than an hour. There appear to be links to NOAA and the State of California pages that are frequently clicked through hundreds of days after their creation. It would be interesting to dive into the content of the target pages, categorize them, and learn what causes these differences.

Speaking of diving into the content, I did a very simple version of that next. When clicking a link to a government web page, are people looking for information about their hometown? Fortunately, clickthrough data includes geocode information for the clicker’s IP address, which includes the nearest city. I decided to find out by scraping the text content of the 100 most frequently accessed web pages, and detected whether or not each city was in each web page.

(again, click for larger image) This “navel-gazers” plot shows the summarized results. For each city in the data set with more than 5 clickthroughs, I plotted the raw number of clickthroughs from that city (the X axis) against the proportion of clickthroughs that ended up on a web page with the name of the city in it (the Y axis). Many cities are clustered in the lower-left, with few clicks and no instances of their city on the target page. Large cities like New York and London are far to the right, as expected from their population, and they show up in target web pages occasionally. Washington (DC) is both a frequent clicker of shortened links, as well as a city that tends to show up on web pages, unsurprising given that it is the seat of the Federal government. The exceptions are the most interesting. People in Bangalore clicked through more than 15 times in this sample, and about 12% of their clicks were to pages with the name of their city. In Boulder, a quarter of the 12 or so clicks mentioned their town!

Deeper analysis would be needed to explain these results, but they were fun to put together! For those interested in checking out my work, including R code to pull a sample of 1.usa.gov data from the archives, please check out my repository on GitHub: https://github.com/HarlanH/hackathon-1usagov

# making meat shares more efficient with R and Symphony

In my previous post, I motivated a web application that would allow small-scale sustainable meat producers to sell directly to consumers using a meat share approach, using constrained optimization techniques to maximize utility for everyone involved. In this post, I’ll walk through some R code that I wrote to demonstrate the technique on a small scale.

Although the problem is set up in R, the actual mathematical optimization is done by Symphony, an open-source mixed-integer solver that’s part of the COIN-OR project. (The problem of optimizing assignments, in this case of cuts of meat to people, is an integer planning problem, because the solution involves assigning either 0 or 1 of each cut to each person. More generally, linear programming and related optimization frameworks allow solving for real-numbered variables.) The RSymphony package allows problems set up in R to be solved by the C/C++ Symphony code with little hassle.

My code is in a public github repository called groupmeat-demo, and the demo code discussed here is in the subset_test.R file. (The other stuff in the repo is an unfinished version of a larger-scale demo with slightly more realistic data.)

For this toy problem, we want to optimally assign 6 items to 3 people, each of whom have a different utility (value) for each item. In this case, I’m ignoring any fixed utility, such as cost in dollars, but that could be added into the formulation. Additionally, assume that items #1 and #2 cannot both be assigned, as with pork loin and pork chops.

This sort of problem is fairly simple to define mathematically. To set up the problem in code, I’ll need to create some matrices that are used in the computation. Briefly, the goal is to maximize an objective expression, $\mathbf{c}^T\mathbf{x}$, where the $\mathbf{x}$ are variables that will be 0 or 1, indicating an assignment or non-assignment, and the $\mathbf{c}$ is a coefficient vector representing the utilities of assigning each item to each person. Here, there are 6 items for 3 people, so I’ll have a 6×3 matrix, flattened to an 18-vector. The goal will be to find 0’s and 1’s for $\mathbf{x}$ that maximize the whole expression.

Here’s what the $\mathbf{c}$ matrix looks like:

      pers1 pers2  pers3
item1 0.467 0.221 0.2151
item2 0.030 0.252 0.4979
item3 0.019 0.033 0.0304
item4 0.043 0.348 0.0158
item5 0.414 0.050 0.0096
item6 0.029 0.095 0.2311

It appears as if everyone like item1, but only person1 likes item5.

Additionally, I need to define some constraints. For starters, it makes no sense to assign an item to more than one person. So, for each row of that matrix, the sum of the variables (not the utilities) must be 1, or maybe 0 (if that item is not assigned). I’ll create a constraint matrix, where each row contains 18 columns, and the pattern of 0’s and 1’s defines a row of the assignment matrix. Since there are 6 items, there are 6 rows (for now). Each row needs to be less than or equal to one (I’ll tell the solver to use integers only later), so I also define vectors of inequality symbols and right-hand-sides.

?View Code RSLANG
 # for each item/row, enforce that the sum of indicators for its assignment are <= 1 mat <- laply(1:num.items, function(ii) { x <- mat.0; x[ii, ] <- 1; as.double(x) }) dir <- rep('<=', num.items) rhs <- rep(1, num.items)

To add the loin/chops constraint, I need to add another row, specifying that the sum of the indicators for both rows now must be 1 or less as well.

?View Code RSLANG
 # for rows 1 and 2, enforce that the sum of indicators for their assignments are <= 1 mat <- rbind(mat, matrix(matrix(c(1, 1, rep(0, num.items-2)), nrow=num.items, ncol=num.pers), nrow=1)) dir <- c(dir, '<=') rhs <- c(rhs, 1)

Here’s what those matrices and vectors look like:

> mat
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[1,] 1 0 0 0 0 0 1 0 0  0  0  0  1  0  0  0  0  0
[2,] 0 1 0 0 0 0 0 1 0  0  0  0  0  1  0  0  0  0
[3,] 0 0 1 0 0 0 0 0 1  0  0  0  0  0  1  0  0  0
[4,] 0 0 0 1 0 0 0 0 0  1  0  0  0  0  0  1  0  0
[5,] 0 0 0 0 1 0 0 0 0  0  1  0  0  0  0  0  1  0
[6,] 0 0 0 0 0 1 0 0 0  0  0  1  0  0  0  0  0  1
[7,] 1 1 0 0 0 0 1 1 0  0  0  0  1  1  0  0  0  0
> dir
[1] "<=" "<=" "<=" "<=" "<=" "<=" "<="
> rhs
[1] 1 1 1 1 1 1 1


Finally, specify that the variables must be binary (0 or 1), and call SYMPHONY to solve the problem:

?View Code RSLANG
 # this is an IP problem, for now types <- rep('B', num.items * num.pers) max <- TRUE # maximizing utility   soln <- Rsymphony_solve_LP(obj, mat, dir, rhs, types=types, max=max)

And, with a bit of post-processing to recover matrices from vectors, here’s the result:

$solution [1] 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1$objval
[1] 1.52

$status TM_OPTIMAL_SOLUTION_FOUND 0 Person #1 got Items 5 worth 0.41 Person #2 got Items 3, 4 worth 0.38 Person #3 got Items 2, 6 worth 0.73 So that’s great. It found an optimal solution worth more than 50% more than the expected value of a random assignment. But there’s a problem. There’s no guarantee that everyone gets anything, and in this case, person #3 gets almost twice as much utility as person #2. Unfair! We need to enforce an additional constraint, that the difference between the maximum utility that any one person gets and the minimum utility that any one person gets is not too high. This is sometimes called a parity constraint. Adding parity constraints is a little tricky, but the basic idea here is to add two more variables to the 18 I’ve already defined. These variables are positive real numbers, and they are forced by constraints to be the maximum and minimum total utilities per person. In the objective function, then, they are weighted so that their difference is not to big. So, that expression becomes: $\mathbf{c}^T\mathbf{x} - \lambda x_{19} - - \lambda x^{20}$. The first variable (the maximum utility of any person) is minimized, while the second variable is maximized. The $\lambda$ free parameter defines how much to trade off parity with total utility, and I’ll set it to 1 for now. For the existing rows of the constraint matrix, these new variables get 0’s. But two more rows need to be added, per person, to force their values to be no bigger/smaller (and thus the same as) the maximum/minimum of any person’s assigned utility. ?View Code RSLANG  # now for those upper and lower variables # \forall p, \sum_i u_i x_{i,p} - d.upper \le 0 # \forall p, \sum_i u_i x_{i,p} - d.lower \ge 0 # so, two more rows per person d.constraint <- function(iperson, ul) { # ul = 1 for upper, 0 for lower x <- mat.utility.0 x[, iperson ] <- 1 x <- x * obj.utility c(as.double(x), (if (ul) c(-1,0) else c(0,-1))) } mat <- rbind(mat, maply(expand.grid(iperson=1:num.pers, ul=c(1,0)), d.constraint, .expand=FALSE)) dir <- c(dir, c(rep('<=', num.pers), rep('>=', num.pers))) rhs <- c(rhs, rep(0, num.pers*2)) The constraint inequalities then becomes as follows:  > print(mat, digits=2) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1.00 0.000 0.000 0.000 0.00 0.000 1.00 0.00 0.000 0.00 0.00 0.000 1.00 0.0 0.000 0.000 0.0000 0.00 0 0 0.00 1.000 0.000 0.000 0.00 0.000 0.00 1.00 0.000 0.00 0.00 0.000 0.00 1.0 0.000 0.000 0.0000 0.00 0 0 0.00 0.000 1.000 0.000 0.00 0.000 0.00 0.00 1.000 0.00 0.00 0.000 0.00 0.0 1.000 0.000 0.0000 0.00 0 0 0.00 0.000 0.000 1.000 0.00 0.000 0.00 0.00 0.000 1.00 0.00 0.000 0.00 0.0 0.000 1.000 0.0000 0.00 0 0 0.00 0.000 0.000 0.000 1.00 0.000 0.00 0.00 0.000 0.00 1.00 0.000 0.00 0.0 0.000 0.000 1.0000 0.00 0 0 0.00 0.000 0.000 0.000 0.00 1.000 0.00 0.00 0.000 0.00 0.00 1.000 0.00 0.0 0.000 0.000 0.0000 1.00 0 0 1.00 1.000 0.000 0.000 0.00 0.000 1.00 1.00 0.000 0.00 0.00 0.000 1.00 1.0 0.000 0.000 0.0000 0.00 0 0 0.47 0.030 0.019 0.043 0.41 0.029 0.00 0.00 0.000 0.00 0.00 0.000 0.00 0.0 0.000 0.000 0.0000 0.00 -1 0 0.00 0.000 0.000 0.000 0.00 0.000 0.22 0.25 0.033 0.35 0.05 0.095 0.00 0.0 0.000 0.000 0.0000 0.00 -1 0 0.00 0.000 0.000 0.000 0.00 0.000 0.00 0.00 0.000 0.00 0.00 0.000 0.22 0.5 0.030 0.016 0.0096 0.23 -1 0 0.47 0.030 0.019 0.043 0.41 0.029 0.00 0.00 0.000 0.00 0.00 0.000 0.00 0.0 0.000 0.000 0.0000 0.00 0 -1 0.00 0.000 0.000 0.000 0.00 0.000 0.22 0.25 0.033 0.35 0.05 0.095 0.00 0.0 0.000 0.000 0.0000 0.00 0 -1 0.00 0.000 0.000 0.000 0.00 0.000 0.00 0.00 0.000 0.00 0.00 0.000 0.22 0.5 0.030 0.016 0.0096 0.23 0 -1 > dir [1] "<=" "<=" "<=" "<=" "<=" "<=" "<=" "<=" "<=" "<=" ">=" "<=" "<=" > rhs [1] 1 1 1 1 1 1 1 0 0 0 0 0 0 Looking at just the last row, this constraint says that the sum of the utilities of any assigned items for person #3, minus the lower limit, must be at least 0. That is essentially the definition of the lower limit, that that constraint holds true for all three people in this problem. Similar logic applies for the upper limit. Running the solver with this set of inputs gives the following:  $solution [1] 0.000 0.000 1.000 0.000 1.000 0.000 0.000 0.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 0.000 0.000 [18] 0.000 0.498 0.433   $objval [1] 1.31$status TM_OPTIMAL_SOLUTION_FOUND 0   Person #1 got Items 3, 5 worth 0.43 Person #2 got Items 4, 6 worth 0.44 Person #3 got Items 2 worth 0.50

The last two numbers in the solution are the values of the upper and lower bounds. Note that the objective value is only 41% higher than a random assignment, but the utilities assigned to each person are much closer. Dropping the $\lambda$ value to something closer to 0 causes the weights of the parity bounds to be less important, and the solution tends to be closer to the initial result.

Scaling this up to include constraints in pricing, farm preferences, price vs. preference meta-preferences, etc., is not conceptually difficult, but would just entail careful programming. It is left as an exercise for the well-motivated reader!

If you’ve made it this far, I’d definitely appreciate any feedback about this idea, corrections to my formulation or code or terminology, etc!

(Thanks to Paul Ruben and others on OR-Exchange, who helped me figure out how to think about the parity problem, and to the authors of WP-codebox and WP LaTeX for giving me tools to put nice scrollable R code and math in this post!)

# intuitive visualizations of categorization for non-technical audiences

For a project I’m working on at work, I’m building a predictive model that categorizes something (I can’t tell you what) into two bins. There is a default bin that 95% of the things belong to and a bin that the business cares a lot about, containing 5% of the things. Some readers may be familiar with the use of predictive models to identify better sales leads, so that you can target the leads most likely to convert and minimize the amount of effort wasted on people who won’t purchase your product. Although my situation doesn’t have to do with sales leads, I’m going to pretend it does, as it’s a common domain.

My data is many thousands of “leads”, for which I’ve constructed hundreds of predictive features (mostly 1/0, a few numeric) each. I can plug this data into any number of common statistical and machine learning systems which will crunch the numbers and provide a black box that can do a pretty good job of separating more-valuable leads from less valuable leads. That’s great, but now I have to communicate what I’ve done, and how valuable it is, to an audience that struggles with relatively simple statistical concepts like correlation. What can I do?

# how to speak ggplot2 like a native, and Predictive Analytics World

I was recently given the opportunity to re-present my ggplot2 talk, which I originally gave to the NYC R Meetup, to the DC R Meetup group. The Meetup was held co-located with the Predictive Analytics World conference in Alexandria, VA. (More on my thoughts on PAW below…) Contentwise, I made only small changes, changing a bit of patter and adding more examples at the end. I still love ggplot, with some frustration at the way it is typically introduced. Some of the audience had no R experience at all, while others were experts. One person, a grad student at U. of Maryland, had had very similar difficulty as I had when originally learning ggplot2, and his enthusiastic nods during my presentation were very validating! For reference, the Meetup page is here, and I stuck the current version of the slides in a public Dropbox, located here.

And a few thoughts about PAW. The conference was well-run (although I have my gripes with the hotel and its location!) and there were an interesting and eclectic lineup of speakers, from a variety of industries. Compared to academic conferences I’ve attended, I missed having all the grad students around. At PAW, I felt rather young, which had not been true at academic conferences in quite a long time! The content of the conference focused on people using predictive methods (statistics, data mining, machine learning) at the individual-customer level, for marketing or retainment or other purposes. That’s not my primary interest right now — my work is focused at a slightly higher operations-research-y level, trying to make sure that customers in the aggregate have good options. But I enjoyed learning about what other people are doing using somewhat similar methods. Next year, though, I think I’ll try to go to a different conference, perhaps UseR! in the UK, or INFORMS’ applied conference

# Prediction with Multilevel Regression Models, and Pizza

The Meetup phenomenon, which is now substantial and longstanding enough to be more of a cultural change than a flash in the pan, continues to impress me. Even more so than tools like LinkedIn, Meetups have changed the nature of professional networking, making it more informal, diverse, and decentralized. Last night, statistics consultant (and cheap eats guru) Jared Lander and I presented a talk on a statistical technique tangentially related to my professional work (more closely associated with Jared’s). The origin of this presentation is worth noting. On Meetup’s web site, members of a group can suggest topics for meetings. Before even attending a single NYC Predictive Analytics event, I posted several topics that I thought might be interesting for the group. A bit later, the organizers (Bruno and Alex) contacted me to see if I’d be willing to present on prediction with Multilevel models. I said that I would, but only if I could co-present with someone who actually knew something about the topic a complementary set of skills and experiences. Knowing Jared from the NYC R Meetup group, and knowing that he learned about multilevel models from the professor who wrote the best book on the topic, and knowing that he’s pretty good in front of an audience, I suggested we collaborate.

Despite requiring a lot of work, and a lot of learning of details on my part, we managed to throw together a pretty decent talk. (As of this morning, there’s four ratings of the event on Meetup, and we got 5/5 stars! Yay us! Not statistically conclusive, though…) We used as an example topic for data analysis the difficult and critically important problem of predicting reviews of pizza restaurants in downtown NYC. Jared is actually an expert on this topic, having written his Masters thesis on ratings from Menupages.com. For the talk, Jared would present a few slides, then I’d present a few. In a few cases we’d both try to explain topics from slightly different points of view. I’d repeatedly try to use the keyboard instead of the remote-control gadget to control Powerpoint, causing the computer to melt down into a pile of slag and refuse to change the slide. Jared would send me withering glares when I started to move towards the keyboard. It ended up OK, though, we got through everything, and even answered about half of the (excellent) questions! Oh, and shout-out to the AV guy at AOL HQ. I don’t know how they pay his salary, but he rocked.

Jared has posted the slides from the talk here (ppt), and I’ve put the data we made up (for pedagogical purposes) and the code we used to analyze it and generate graphs for the talk here on Github. Alex video-recorded the presentation, and I’ll update this sentence to link to the video once it’s posted somewhere. Hope folks find it valuable!

# ggplot and concepts — what’s right, and what’s wrong

A few months back I gave a presentation to the NYC R Meetup. (R is a statistical programming language. If this means nothing to you, feel free to stop reading now.) The presentation was on ggplot2, a popular package for generating graphs of data and statistics. In the talk (which you can see here, including both my slides and my patter!) I presented both the really great things about ggplot2 and some of its downsides. In this blog post, I wanted to expand a bit on my thinking on ggplot, the Grammar of Graphics, and how peoples’ conceptual representations of graphs, data, ggplot, and R all interact. ggplot is both incredibly elegant and unfortunately difficult to learn to use well, I think as a consequence of the variety of representations. Continue reading