# Patterns for Connecting Predictive Models to Software Products

This was originally published on Medium on June 21st, 2016.

You’re a data scientist, and you’ve got a predictive model — great work! Now what? In many cases, you need to hook it up to some sort of large, complex software product so that users can get access to the predictions. Think of LinkedIn’s People You May Know, which mines your professional graph for unconnected connections, or Hopper’s flight price predictions. Those started out as prototypes on someone’s laptop, and are now running at scale, with many millions of users.

Even if you’re building an internal tool to make a business run better, if you didn’t build the whole app, you’ve got to get the scoring/prediction (as distinct from the fitting/estimation) part of the model connected to a system someone else wrote. In this blog post, I’m going to summarize two methods for doing this that I think are particularly good practices — database mediation and web services.

# Parameterizable Reproducible Research

The below is a public version of a post originally posted on an internal blog at the Education Advisory Board (EAB), my current employer. We don’t yet have a public tech blog, but I got permission to edit and post it here, along with the referenced code.

Data Science teams get asked to do a lot of different sorts of things. Some of what the team that I’m part of builds is enterprise-scale predictive analytics, such as the Student Risk Model that’s part of the Student Success Collaborative. That’s basically software development with a statistical twist and machine-learning core. Sometimes we get asked to do quick-and-dirty, one-off sorts of things, to answer a research question. We have a variety of tools and processes for that task. But there’s a third category that I want to focus on – frequently requested but slightly-different reports.

## what is it

There’s a relatively new theme in the scientific research community called reproducible research. Briefly, the idea is that it should be possible to re-do all steps after data collection automatically, including data cleaning and reformatting, statistical analyses, and even the actual generation of a camera-ready report with charts, graphs, and tables. This means that if you realized that, say, one data point in your analysis was bogus and needed to be removed, you could remove that data point, press a button, and in a minute or two have a shiny new PDF with all of the results automatically updated.

This type of reproducible research has been around for a while, although it’s having a recent resurgence in part due to the so-called “statistical crisis“. The R (and S) statistical programming languages have supported LaTeX, the scientific document creation/typesetting tool, for many years. Using a tool called Sweave, a researcher “weaves” chunks of text and chunks of R code together. The document is then “executed”, where the R code chunks are executed and the results are converted into a single LaTeX document, which is then compiled into a PDF or similar. The code can generate charts and tables, so no manual effort is needed to rebuild a camera-ready document.

This is great, a huge step forward towards validation of often tricky and complex statistical analyses. If you’re writing a conference paper on, say, a biomedical experiment, a reproducible process can drastically improve your ability to be confident in your work. But data scientists often have to generate this sort of thing repeatedly, from different sources of data or with different parameters. And they have to do so efficiently.

Parameterizable reproducible research, then, is a variant of reproducible research tools and workflows where it is easy to specify data sources, options, and parameters to a standardized analytical report, even one that includes statistical or predictive analyses, data manipulation, and graph generation. The report can be emailed or otherwise sent to people, and doesn’t seem as public as, say, a web-based app developed in Shiny or another technology. This isn’t a huge breakthrough or anything, but it’s a useful pattern that seems worth sharing.

# INFORMS Business Analytics 2014 Blog Posts

Earlier this year, I attended the INFORMS Conference on Business Analytics & Operations Research, in Boston. I was asked beforehand if I wanted to be a conference blogger, and for some reason I said I would. This meant I was able to publish posts on the conference’s WordPress web site, and was also obliged to do so!

Here are the five posts that I wrote, along with an excerpt from each. Please click through to read the full pieces:

## Operations Research, from the point of view of Data Science

• more insight, less action — deliverables tend towards predictions and storytelling, versus formal optimization
• more openness, less big iron — open source software leads to a low-cost, highly flexible approach
• more scruffy, less neat — data science technologies often come from black-box statistical models, vs. domain-based theory
• more velocity, smaller projects — a hundred $10K projects beats one$1M project
• more science, less engineering — both practitioners and methods have different backgrounds
• more hipsters, less suits — stronger connections to the tech industry than to the boardroom
• more rockstars, less teams — one person can now (roughly) do everything, in simple cases, for better or worse

## What is a “Data Product”?

DJ Patil says “a data product is a product that facilitates an end goal through the use of data.” So, it’s not just an analysis, or a recommendation to executives, or an insight that leads to an improvement to a business process. It’s a visible component of a system. LinkedIn’s People You May Know is viewed by many millions of customers, and it’s based on the complex interactions of the customers themselves.

## Healthcare (and not Education) at INFORMS Analytics

[A]s a DC resident, we often hear of “Healthcare and Education” as a linked pair of industries. Both are systems focused on social good, with intertwined government, nonprofit, and for-profit entities, highly distributed management, and (reportedly) huge opportunities for improvement. Aside from MIT Leaders for Global Operations winning the Smith Prize (and a number of shoutouts to academic partners and mentors), there was not a peep from the education sector at tonight’s awards ceremony. Is education, and particularly K-12 and postsecondary education, not amenable to OR techniques or solutions?

## What’s Changed at the Practice/Analytics Conference?

In 2011, almost every talk seemed to me to be from a Fortune 500 company, or a large nonprofit, or a consulting firm advising a Fortune 500 company or a large nonprofit. Entrepeneurship around analytics was barely to be seen. This year, there are at least a few talks about Hadoop and iPhone apps and more. Has the cost of deploying advanced analytics substantially dropped?

## Why OR/Analytics People Need to Know About Database Technology

It’s worthwhile learning a bit about databases, even if you have no decision-making authority in your organization, and don’t feel like becoming a database administrator (good call). But by getting involved early in the data-collection process, when IT folks are sitting around a table arguing about platform questions, you can get a word in occasionally about the things that matter for analytics — collecting all the data, storing it in a way friendly to later analytics, and so forth.

All in all, I enjoyed blogging the conference, and recommend the practice to others! It’s a great way to organize your thoughts and to summarize and synthesize your experiences.

# Communication and the Data Scientist

I recently gave a presentation on communication issues around the terms “Data Science” and “Data Scientist”, based in part on a survey that I did with my Meetup colleagues Marck and Sean. The basic idea is that these new, extremely-broad buzzwords have resulted in confusion, which has impacted the ability of people with skills and people with data to meet and effectively communicate about who does what and what appropriate expectations should be. The survey was an attempt to bring some clarity to the issue of who are the people in this newly-reformulated community, and how do they view themselves and their skills. For more on the survey, see our post on the Data Community DC blog. Here’s the video of my presentation at DataGotham:

# integrating R with other systems

I just returned from the useR! 2012 conference for developers and users of R. One of the common themes to many of the presentations was integration of R-based statistical systems with other systems, be they other programming languages, web systems, or enterprise data systems. Some highlights for me were an update to Rserve that includes 1-stop web services, and a presentation on ESB integration. Although I didn’t see it discussed, the new httr package for easier access to web services is also another outstanding development in integrating R into large-scale systems.

Coincidentally, I just a week or so ago had given a short presentation to the local R Meetup entitled “Annotating Enterprise Data from an R Server.” The topic for the evening was “R in the Enterprise,” and others talked about generating large, automated reports with knitr, and using RPy2 to integrate R into a Python-based web system. I talked about my experiences building and deploying a predictive system, using the corporate database as the common link. Here are the slides:

# Survey of Data Science / Analytics / Big Data / Applied Stats / Machine Learning etc. Practitioners

As I’ve discussed here before, there is a debate raging (ok, maybe not raging) about terms such as “data science”, “analytics”, “data mining”, and “big data”. What do they mean, how do they overlap, and perhaps most importantly, who are the people who work in these fields?

Along with two other DC-area Data Scientists, Marck Vaisman and Sean Murphy, I’ve put together a survey to explore some of these issues. Help us quantitatively understand the space of data-related skills and careers by participating!

It should take 10 minutes or less, data will be kept confidential, and we look forward to sharing our results and insights in a variety of venues, including this blog! Thanks!

# Data Science, Moore’s Law, and Moneyball

I’m fond of navel gazing, meta discussions, and so forth. I’ve recently written about inferring navel gazing from link data, and about the meaning of the “Analytics” buzzword. This post will be my second on that other infectious buzzword, “Data Science”.

When I moved to Washington DC in July, I was struck by the fact that there was no Meetup for analytics/applied statistics/machine learning/data science. There’s a great DC Tech Meetup, a great Big Data Meetup, and a great R Meetup, but nothing like the NYC Predictive Analytics Meetup. So, I and a couple of others I talked to about this (Marck Vaisman, who I first met through the NYC R Meetup a couple years ago, and Matt Bryan, who I met just after moving to town), started a new Meetup, which we decided to call “Data Science DC“.

For our second meetup, we thought we should address some aspect of our name, and so I presented a little bit about the term and the controversies around its definition and its recent dramatic upsurge in popularity. Here are the slides (note that you should be able to click through the links on the slide to the source documents):

I mostly didn’t present a personal opinion about what I though the term means, or what it should mean, but instead wanted to present a bunch of other peoples’ points of view to kick off an interesting discussion. And in that sense I succeeded. We had an exceedingly interesting conversation following my slides, and I think a couple of the most interesting ideas from the evening came out of that discussion.

Here are three theses I’d like to propose.

1. “Data Science” is defined as what “Data Scientists” do. What Data Scientists do has been very well covered, and it runs the gamut from data collection and munging, through application of statistics and machine learning and related techniques, to interpretation, communication, and visualization of the results. Who Data Scientists are may be the more fundamental question.
2. One reason Data Science is a big thing now is because advances in technology have made it easy for Data Scientists to develop wide-ranging expertise. Even 10 years ago, the idea that the same person could integrate several databases, run a multilevel regression, and generate elegant visualizations would be seen as incredibly rare.
3. The other reason Data Science is a big thing now is because sabermetrics demonstrated that number-crunching brings results. There’s nothing business leaders love more than a sports analogy, and the analytic revolution in professional sports immediately draw attention to the ways that numbers beat intuition.

I tend to like the idea that Data Science is defined by its practitioners, that it’s a career path rather than a category of activities. In my conversations with people, it seems that people who consider themselves Data Scientists typically have eclectic career paths, that might in some ways seem not to make much sense. A typical path might be someone who started out learning to program, then spent some time in a scientific field, then hopped around a variety of different roles, collecting a wide variety of different skills, all of which related to using analytical techniques to make sense of data.

This sort of career path isn’t particularly new, but what is new is that it’s now possible to relatively quickly and cheaply do get started in all of the processes involved in Data Science. (Thanks to Taylor Horton for suggesting this at the Meetup!) Fast computers, open source tools, and some programming skills allow someone to try a new data management approach or a new machine learning technology incredibly quickly, and to iterate on approaches until a solution to a particular problem is found. This has two consequences. First of all, the productivity of a modern Data Scientist is remarkable. Projects that a few decades ago would have taken teams of people literally years can now be done in a few days. Second of all, this amazing productivity allows people to spend their 10,000 hours developing expertise in the now vertically integrated process of Data Science, rather than having to spend all of that time focusing on developing skills on just a single aspect of the task. There are huge number of things that need to be learned to be an effective Data Scientist, but it is now possible to learn those skills quickly enough to make a career out of being a Jack of all Trades and a near-master at many of them.

So now there’s a supply of people who could be Data Scientists. But what about the type of demand that drives an incessant stream of O’Reilly articles and job postings? Where does the demand come from? Justin Grimes had an intriguing idea that resonated with me — analytics in sports, which I propose as the other reason why analytics and data science have become buzzwords. Although business has used mathematical methods for 100 years, (thousands if you include finance and insurance) the idea that you could hire a very small number of people to analyze data and beat gut instincts in many aspects of decision making is much newer. The idea that a statistician could turn around the Oakland A’s by radically overturning longstanding recruiting practice was a powerful analogy. Even now, business books about analytics almost always have sports examples in the first chapters. I made a point at work the other day by noting that most professional sports prognosticators predict NFL playoff outcomes wrong because they over-weight last years’ results. Sports analogies get attention.

Does this make sense? Data Science is a buzzword now because a group of people with eclectic talents match a growing demand for and recognition of the value of those talents. I’d love feedback on these thoughts!

This past Friday, the web portal to the US Federal government, USA.gov, organized hackathons across the US for programmers and data scientists to work with and analyze the data from their link-shortening service. It turns out that if you shorten a web link with bit.ly, the shortened link looks like 1.usa.gov/V6NpL (that one goes to a NASA page). And because this service was paid for by taxpayer money, the data about each clickthrough is freely available.

Shortened-link click-through data is interesting. It tells you the time and approximate geographic location of each click-through, and the web page or service that the link was on (assuming someone didn’t type the URL in by hand). You also know when the shortened link was created, which tells you a little bit about the way links are shared. Bit.ly themselves have several full time data scientists on staff whose job is to learn about what shortened-link data can say about web traffic patterns and link sharing, potentially very lucrative information.

For my part, I just wanted to do some fun visualizations. Along with friends in NYC, I joined the hackathon remotely, following along on twitter and listening to dance music in their turntable.fm room. I managed to get rough drafts of two somewhat non-trivial graphs done during the official hackathon, and I re-built them with larger and more random data later.

This first graph looks at the difference in time between when a link was created (the first time someone tried to shorten the target URL) and when the clickthroughs happened. For each of the 25 most frequently visited target domains (mostly US government agencies), I built a density plot, or smoothed histogram, of the timings.

(click for a larger image) There are some interesting differences. Links from senate.gov are mostly clicked through within a few hours of their creation, and links from the NY Courts are clicked through in less than an hour. There appear to be links to NOAA and the State of California pages that are frequently clicked through hundreds of days after their creation. It would be interesting to dive into the content of the target pages, categorize them, and learn what causes these differences.

Speaking of diving into the content, I did a very simple version of that next. When clicking a link to a government web page, are people looking for information about their hometown? Fortunately, clickthrough data includes geocode information for the clicker’s IP address, which includes the nearest city. I decided to find out by scraping the text content of the 100 most frequently accessed web pages, and detected whether or not each city was in each web page.

(again, click for larger image) This “navel-gazers” plot shows the summarized results. For each city in the data set with more than 5 clickthroughs, I plotted the raw number of clickthroughs from that city (the X axis) against the proportion of clickthroughs that ended up on a web page with the name of the city in it (the Y axis). Many cities are clustered in the lower-left, with few clicks and no instances of their city on the target page. Large cities like New York and London are far to the right, as expected from their population, and they show up in target web pages occasionally. Washington (DC) is both a frequent clicker of shortened links, as well as a city that tends to show up on web pages, unsurprising given that it is the seat of the Federal government. The exceptions are the most interesting. People in Bangalore clicked through more than 15 times in this sample, and about 12% of their clicks were to pages with the name of their city. In Boulder, a quarter of the 12 or so clicks mentioned their town!

Deeper analysis would be needed to explain these results, but they were fun to put together! For those interested in checking out my work, including R code to pull a sample of 1.usa.gov data from the archives, please check out my repository on GitHub: https://github.com/HarlanH/hackathon-1usagov

# making meat shares more efficient with R and Symphony

In my previous post, I motivated a web application that would allow small-scale sustainable meat producers to sell directly to consumers using a meat share approach, using constrained optimization techniques to maximize utility for everyone involved. In this post, I’ll walk through some R code that I wrote to demonstrate the technique on a small scale.

Although the problem is set up in R, the actual mathematical optimization is done by Symphony, an open-source mixed-integer solver that’s part of the COIN-OR project. (The problem of optimizing assignments, in this case of cuts of meat to people, is an integer planning problem, because the solution involves assigning either 0 or 1 of each cut to each person. More generally, linear programming and related optimization frameworks allow solving for real-numbered variables.) The RSymphony package allows problems set up in R to be solved by the C/C++ Symphony code with little hassle.

My code is in a public github repository called groupmeat-demo, and the demo code discussed here is in the subset_test.R file. (The other stuff in the repo is an unfinished version of a larger-scale demo with slightly more realistic data.)

For this toy problem, we want to optimally assign 6 items to 3 people, each of whom have a different utility (value) for each item. In this case, I’m ignoring any fixed utility, such as cost in dollars, but that could be added into the formulation. Additionally, assume that items #1 and #2 cannot both be assigned, as with pork loin and pork chops.

This sort of problem is fairly simple to define mathematically. To set up the problem in code, I’ll need to create some matrices that are used in the computation. Briefly, the goal is to maximize an objective expression, $\mathbf{c}^T\mathbf{x}$, where the $\mathbf{x}$ are variables that will be 0 or 1, indicating an assignment or non-assignment, and the $\mathbf{c}$ is a coefficient vector representing the utilities of assigning each item to each person. Here, there are 6 items for 3 people, so I’ll have a 6×3 matrix, flattened to an 18-vector. The goal will be to find 0’s and 1’s for $\mathbf{x}$ that maximize the whole expression.

Here’s what the $\mathbf{c}$ matrix looks like:

pers1 pers2  pers3
item1 0.467 0.221 0.2151
item2 0.030 0.252 0.4979
item3 0.019 0.033 0.0304
item4 0.043 0.348 0.0158
item5 0.414 0.050 0.0096
item6 0.029 0.095 0.2311

It appears as if everyone like item1, but only person1 likes item5.

Additionally, I need to define some constraints. For starters, it makes no sense to assign an item to more than one person. So, for each row of that matrix, the sum of the variables (not the utilities) must be 1, or maybe 0 (if that item is not assigned). I’ll create a constraint matrix, where each row contains 18 columns, and the pattern of 0’s and 1’s defines a row of the assignment matrix. Since there are 6 items, there are 6 rows (for now). Each row needs to be less than or equal to one (I’ll tell the solver to use integers only later), so I also define vectors of inequality symbols and right-hand-sides.

?View Code RSLANG
 # for each item/row, enforce that the sum of indicators for its assignment are <= 1 mat <- laply(1:num.items, function(ii) { x <- mat.0; x[ii, ] <- 1; as.double(x) }) dir <- rep('<=', num.items) rhs <- rep(1, num.items)

To add the loin/chops constraint, I need to add another row, specifying that the sum of the indicators for both rows now must be 1 or less as well.

?View Code RSLANG
 # for rows 1 and 2, enforce that the sum of indicators for their assignments are <= 1 mat <- rbind(mat, matrix(matrix(c(1, 1, rep(0, num.items-2)), nrow=num.items, ncol=num.pers), nrow=1)) dir <- c(dir, '<=') rhs <- c(rhs, 1)

Here’s what those matrices and vectors look like:

> mat
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[1,] 1 0 0 0 0 0 1 0 0  0  0  0  1  0  0  0  0  0
[2,] 0 1 0 0 0 0 0 1 0  0  0  0  0  1  0  0  0  0
[3,] 0 0 1 0 0 0 0 0 1  0  0  0  0  0  1  0  0  0
[4,] 0 0 0 1 0 0 0 0 0  1  0  0  0  0  0  1  0  0
[5,] 0 0 0 0 1 0 0 0 0  0  1  0  0  0  0  0  1  0
[6,] 0 0 0 0 0 1 0 0 0  0  0  1  0  0  0  0  0  1
[7,] 1 1 0 0 0 0 1 1 0  0  0  0  1  1  0  0  0  0
> dir
[1] "<=" "<=" "<=" "<=" "<=" "<=" "<="
> rhs
[1] 1 1 1 1 1 1 1

Finally, specify that the variables must be binary (0 or 1), and call SYMPHONY to solve the problem:

?View Code RSLANG
 # this is an IP problem, for now types <- rep('B', num.items * num.pers) max <- TRUE # maximizing utility   soln <- Rsymphony_solve_LP(obj, mat, dir, rhs, types=types, max=max)

And, with a bit of post-processing to recover matrices from vectors, here’s the result:

$solution [1] 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1$objval
[1] 1.52

$status TM_OPTIMAL_SOLUTION_FOUND 0 Person #1 got Items 5 worth 0.41 Person #2 got Items 3, 4 worth 0.38 Person #3 got Items 2, 6 worth 0.73 So that’s great. It found an optimal solution worth more than 50% more than the expected value of a random assignment. But there’s a problem. There’s no guarantee that everyone gets anything, and in this case, person #3 gets almost twice as much utility as person #2. Unfair! We need to enforce an additional constraint, that the difference between the maximum utility that any one person gets and the minimum utility that any one person gets is not too high. This is sometimes called a parity constraint. Adding parity constraints is a little tricky, but the basic idea here is to add two more variables to the 18 I’ve already defined. These variables are positive real numbers, and they are forced by constraints to be the maximum and minimum total utilities per person. In the objective function, then, they are weighted so that their difference is not to big. So, that expression becomes: $\mathbf{c}^T\mathbf{x} - \lambda x_{19} - - \lambda x^{20}$. The first variable (the maximum utility of any person) is minimized, while the second variable is maximized. The $\lambda$ free parameter defines how much to trade off parity with total utility, and I’ll set it to 1 for now. For the existing rows of the constraint matrix, these new variables get 0’s. But two more rows need to be added, per person, to force their values to be no bigger/smaller (and thus the same as) the maximum/minimum of any person’s assigned utility. ?View Code RSLANG  # now for those upper and lower variables # \forall p, \sum_i u_i x_{i,p} - d.upper \le 0 # \forall p, \sum_i u_i x_{i,p} - d.lower \ge 0 # so, two more rows per person d.constraint <- function(iperson, ul) { # ul = 1 for upper, 0 for lower x <- mat.utility.0 x[, iperson ] <- 1 x <- x * obj.utility c(as.double(x), (if (ul) c(-1,0) else c(0,-1))) } mat <- rbind(mat, maply(expand.grid(iperson=1:num.pers, ul=c(1,0)), d.constraint, .expand=FALSE)) dir <- c(dir, c(rep('<=', num.pers), rep('>=', num.pers))) rhs <- c(rhs, rep(0, num.pers*2)) The constraint inequalities then becomes as follows:  > print(mat, digits=2) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1.00 0.000 0.000 0.000 0.00 0.000 1.00 0.00 0.000 0.00 0.00 0.000 1.00 0.0 0.000 0.000 0.0000 0.00 0 0 0.00 1.000 0.000 0.000 0.00 0.000 0.00 1.00 0.000 0.00 0.00 0.000 0.00 1.0 0.000 0.000 0.0000 0.00 0 0 0.00 0.000 1.000 0.000 0.00 0.000 0.00 0.00 1.000 0.00 0.00 0.000 0.00 0.0 1.000 0.000 0.0000 0.00 0 0 0.00 0.000 0.000 1.000 0.00 0.000 0.00 0.00 0.000 1.00 0.00 0.000 0.00 0.0 0.000 1.000 0.0000 0.00 0 0 0.00 0.000 0.000 0.000 1.00 0.000 0.00 0.00 0.000 0.00 1.00 0.000 0.00 0.0 0.000 0.000 1.0000 0.00 0 0 0.00 0.000 0.000 0.000 0.00 1.000 0.00 0.00 0.000 0.00 0.00 1.000 0.00 0.0 0.000 0.000 0.0000 1.00 0 0 1.00 1.000 0.000 0.000 0.00 0.000 1.00 1.00 0.000 0.00 0.00 0.000 1.00 1.0 0.000 0.000 0.0000 0.00 0 0 0.47 0.030 0.019 0.043 0.41 0.029 0.00 0.00 0.000 0.00 0.00 0.000 0.00 0.0 0.000 0.000 0.0000 0.00 -1 0 0.00 0.000 0.000 0.000 0.00 0.000 0.22 0.25 0.033 0.35 0.05 0.095 0.00 0.0 0.000 0.000 0.0000 0.00 -1 0 0.00 0.000 0.000 0.000 0.00 0.000 0.00 0.00 0.000 0.00 0.00 0.000 0.22 0.5 0.030 0.016 0.0096 0.23 -1 0 0.47 0.030 0.019 0.043 0.41 0.029 0.00 0.00 0.000 0.00 0.00 0.000 0.00 0.0 0.000 0.000 0.0000 0.00 0 -1 0.00 0.000 0.000 0.000 0.00 0.000 0.22 0.25 0.033 0.35 0.05 0.095 0.00 0.0 0.000 0.000 0.0000 0.00 0 -1 0.00 0.000 0.000 0.000 0.00 0.000 0.00 0.00 0.000 0.00 0.00 0.000 0.22 0.5 0.030 0.016 0.0096 0.23 0 -1 > dir [1] "<=" "<=" "<=" "<=" "<=" "<=" "<=" "<=" "<=" "<=" ">=" "<=" "<=" > rhs [1] 1 1 1 1 1 1 1 0 0 0 0 0 0 Looking at just the last row, this constraint says that the sum of the utilities of any assigned items for person #3, minus the lower limit, must be at least 0. That is essentially the definition of the lower limit, that that constraint holds true for all three people in this problem. Similar logic applies for the upper limit. Running the solver with this set of inputs gives the following: $solution [1] 0.000 0.000 1.000 0.000 1.000 0.000 0.000 0.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 0.000 0.000 [18] 0.000 0.498 0.433   $objval [1] 1.31$status TM_OPTIMAL_SOLUTION_FOUND 0   Person #1 got Items 3, 5 worth 0.43 Person #2 got Items 4, 6 worth 0.44 Person #3 got Items 2 worth 0.50

The last two numbers in the solution are the values of the upper and lower bounds. Note that the objective value is only 41% higher than a random assignment, but the utilities assigned to each person are much closer. Dropping the $\lambda$ value to something closer to 0 causes the weights of the parity bounds to be less important, and the solution tends to be closer to the initial result.

Scaling this up to include constraints in pricing, farm preferences, price vs. preference meta-preferences, etc., is not conceptually difficult, but would just entail careful programming. It is left as an exercise for the well-motivated reader!

If you’ve made it this far, I’d definitely appreciate any feedback about this idea, corrections to my formulation or code or terminology, etc!

(Thanks to Paul Ruben and others on OR-Exchange, who helped me figure out how to think about the parity problem, and to the authors of WP-codebox and WP LaTeX for giving me tools to put nice scrollable R code and math in this post!)

# making meat shares more efficient

A personal interest I have is the ethical and sustainable production of food. I’ve been a member of and helped run Community Supported Agriculture groups, and my wife and I currently purchase the majority of our meat from a group of upstate NY pastured-livestock producers who sell their products through CSAs. It’s an ala-carte business model, where I place an order on a website, and the next week I pick up the frozen products cut and packaged as if for retail.

A related way to get meat has become fairly popular recently — the meat CSA or meat share. As the NYC Meatshare group describes it, “Looking for healthy meat raised on pasture by small local farms? It’s expensive, but by banding together to buy whole animals we can support farmers and save money.” Members of a meatshare all pitch in to buy a whole animal, which is then butchered and split among the members. Here’s how a meatshare event described the 10th of a hog each member got: “Each person will get an equal amount of bacon and sausage (about 2 lbs each), chops (center & butt), and will divide the other cuts up as equally as possible (including ham steak, loin, organs, etc.)  If you have preferences please let me know, I will do my best to accommodate.  Or, you can swap with other members at my place.”

These two business models put a substantial burden on either the farmer (in the first case) or the consumer (in the second case). The retail model requires the farmer, or a collective of farmers, to put together a retail-ordering web site, a butchery and inventory system, and a delivery and distribution system. The meat share model takes these burdens off the farmers, but requires the consumers to set up and organize the purchase and payment system, meet at a common location, and either take what is available or perform ad hoc swaps. In a more traditional producer-consumer relationship, the supply chain, payment, inventory, and preferences-matching process is taken care of by the comodification of the animals (all cows are the same) and the services provided by a retail grocery store.

One could argue that that’s the third option — Whole Foods — but it sorta defeats the purpose of non-commidified, high-quality meat, and it tends to defeat the pocketbook too. No connection with the farm, just a promise of ethical standards (probably including the pointless “organic” label), and a substantial cut by middlemen. Not really an option at all.

So what else could be done to build sustainable relationships between animal producers and people who value high-quality, ethically produced meat? Why not leverage technology? And not just selling via web sites, but the kind of logistics technology that allows Whole Foods (and UPS, and Walmart) to efficiently get huge varieties of goods from place to place? A group at a recent food-tech hack-a-thon had the start of this idea. They put together a quick demo of a front-end web site (“groupme.at” — clever!) that would allow consumers to choose smaller sets of cuts in such a way that the whole neatly ends up with a whole animal. By setting up a platform that can be easily connected with many small producers all over the country, the problem of every producer needing to be a webmaster is eliminated. And the system to get all of the pieces to add up to whole animals reduces risk for the farmer. It’s a great start. But by leveraging additional open-source tools and some ingenuity, I think it should be possible to do even more.

Imagine a similar web site, but instead of selecting a pre-selected package of cuts, you instead indicate your preferences and price range. As animals become available, you get an emailed notification of a delivery with a set of products that are very similar to the preferences you specified. You might love pork belly and boneless loin. Your neighbor might love cured pork belly (bacon) and chops. You might hate liver, but you’d accept some pig ears every once in a while for your dog. And your neighbor might really like the fatback to render for lard, while you’d find that useless. Everyone who might be sharing in an animal indicates their preferences, and the web site would automatically give everyone as much as possible of what they like the most. Equally important, all of the parts add up to whole animals, so the farmer is not stuck with the risk of unsold inventory.

Now imagine that after a few months, you’ve ranked the cuts of meat from Alice’s Farm 5 stars, but the ones from Bob’s Ranch only 3 stars. And you’ve told the system that you’re willing to pay more to get more of what you really want, but you neighbor tells the web site that he’s willing to make trade-offs to spend less money. You’ve essentially added other constraints, that if balanced well, will make everyone as happy as possible. Also, notice that I mentioned both boneless loin and pork chops? They’re more-or-less the same part of the animal cut different ways, so you can’t sell them both off the same half of the same animal. Now you have exclusive constraints to add into the mix. Maybe everyone’s better off if you get the boneless loin, or maybe everyone’s better off if your neighbor gets the chops. It’s easy to imagine collecting all of this information, but how do you combine it all and optimize the outcome in a utilitarian way?

Why, operations research and computational optimization! Write some software that plugs everyone’s constraints into a set of equations, push a button, let the computer think for a second or two, and wham, you get a solution that balances the constraints as fairly as possible! Send the cut list to the slaughterhouse and email the product lists and bills to the customers, and you’re basically done.

In the past, this sort of supply-chain optimization required massive computing power and complex software design. But now, there are open-source code bases for solving this sort of problem, at least at the scale needed to balance the preferences of a few farmers and a few dozen or hundred customers at a time.

This is the next step in leveraging technology to make at least some aspects of the supply chain for small-scale meat operations as efficient as what Purdue does, but maintaining the high quality and personal connection to the farm that many people want now. All that’s needed are some enterprising hackers to write the code and set up a scalable, configurable web platform for preference-based meatshares.

In my next post, I’ll demonstrate how to write code that uses one of those open-source optimization libraries to solve a small version of this problem. If you’re interested in reading R code, stay tuned!