Patterns for Connecting Predictive Models to Software Products

This was originally published on Medium on June 21st, 2016.


You’re a data scientist, and you’ve got a predictive model — great work! Now what? In many cases, you need to hook it up to some sort of large, complex software product so that users can get access to the predictions. Think of LinkedIn’s People You May Know, which mines your professional graph for unconnected connections, or Hopper’s flight price predictions. Those started out as prototypes on someone’s laptop, and are now running at scale, with many millions of users.

Metaphor (source)

Even if you’re building an internal tool to make a business run better, if you didn’t build the whole app, you’ve got to get the scoring/prediction (as distinct from the fitting/estimation) part of the model connected to a system someone else wrote. In this blog post, I’m going to summarize two methods for doing this that I think are particularly good practices — database mediation and web services.

Continue reading

Simulating Rent Stabilization Policy at the National Day of Civic Hacking

This post was originally published on Medium on June 5, 2016.


Yesterday was the 2016 National Day of Civic Hacking, a Code for America event that encourages people with technology and related skills to explore projects related to civil society and government. My friend Josh Tauberer wrote a thoughtful post earlier about the event called Why We Hack —on what the value of this sort of event might be — please read it.

For my part, this year I worked on one of the projects he discusses, understanding the impact of DC’s rent stabilization laws and what potential policy changes might yield. As Josh noted, we discovered that it’s a hard problem. Much of the most relevant data (such as the list of properties under rent stabilization and their current and historical rents) are not available, and have to be estimated. Getting to a realistic understanding of the impact of law and policy on rents seems incredibly valuable, but hard.

So I spun off the main group, and worked on an easier but much less ambitious project that could potentially be useful in just an afternoon’s work. Instead of trying to understand the law’s effect on actual DC rents, I built a little tool to understand the law’s effect on a rather unrealistic set of simulated apartment buildings. Importantly, I did this fully aware that I’m not building with, I’m tinkering; my goal was to do something fun and interesting that might lead to something substantial and usable later, probably by someone else.

Continue reading

Thoughts on Managing Data Science Team Workstreams (and a Shiny app)

This is an updated version of a post originally published on Medium on Jan. 28, 2016. I may have more to say about this sort of thing in the near future.


There are different types of data scientists, with different backgrounds and career paths. With Sean Murphy and Marck Vaisman, I wrote an article about this for O’Reilly a few years back, based on survey research we’d done. Download a copy, if you haven’t read it. This idea is now pretty well established, but I want to talk about a related issue, which is that the type of work that Data Science teams do varies a lot, and that managing those types of work can be an interesting challenge.

As Josh Wills said, data scientists aren’t software developers, but they sometimes do that sort of work, and they aren’t statisticians, but they sometimes do that sort of work too. At EAB, where I lead a Data Science team of people with very diverse backgrounds and skill sets, this issue leads to a lot of complexity and experimentation as we (and the upper management I report to) try to ensure that everyone is working on the right tasks, at the right time, efficiently.

In this post, I’d like to share some thoughts about how we currently think about and manage different types of Data Science work. I also wrote a little Shiny web tool to help us manage our time, and I’ll show that off as well.

Continue reading

Building a Complementary Data Science Team

This is an updated version of an article first posted on Medium on Nov. 23, 2015. I’ve disabled the links to the jobs, as those specific ones are no longer available. If you’re interested in a role at EAB or the Advisory Board, please get in touch, though!


I’m the Director of Data Science at EAB, a firm that provides best-practices research and enterprise software for colleges and universities. My team is responsible for the predictive models and other advanced analytics that are part of the Student Success Collaborative product that’s used by academic advisors and other campus leadership. We’re hiring data scientists, and I wanted to publicly say a few things about the roles we have advertised. (Note that EAB is part of a public company and is in a competitive market, so there are obviously things I’m not saying!)

The most important point is that data scientists specialize, so look for the specializations. My co-authors and I made this point in our 2012 e-book Analyzing the Analyzers, and the folks at Mango Solutions are burning up Twitter with their self-service tool for identifying data science strengths and weaknesses.

Drew Conway’s Data Science Venn Diagram

A related point is that existing framing devices can help you balance a team. Drew Conway’s Venn Diagram remains a great way to think about Data Science aptitude. Combine people with strengths in each part of the diagram, who know enough to collaborate effective and make each other stronger, and you don’t need a team of unicorns with 3 PhDs each.

I suspect the details of the framing device are less important than the fact that you have one. It forces you to think about variety and complementary skills, and how people work together to solve problems and build systems.


At EAB, we have four career tracks for data scientists — Research, Engineering, Statistical Programming, and Management. Our new roles supplement our existing team by adding several new people, each with different capabilities and seniority.

At a Senior level, we’re looking for a Statistical Programmer-track person who is particularly strong in algorithm development and implementation, perhaps a straight-up Computer Scientist. Think of the “Machine Learning” area in Drew’s diagram. As we look to expand the classes of statistical techniques that we use, we need more people who know the academic literature and can figure out exactly what technical solution will let us build and scale high-quality models. Interested? Please apply!

A little less senior, we’re also looking for a Researcher who can help us apply domain knowledge even more effectively in our analyses, models, and systems. Some software, data visualization, and statistical skills required — maybe a quantitative Social Scientist pivoted into industry? The upper edge of the Substantive Expertise area. Sound like you? Please apply!

I strongly believe that a Data Science team should do all of the Data Science, including building and owning models in production. So, last but not least, we’re looking for another Engineering-focused data scientist, who can help us build model frameworks, data tools, workflow tools, and more. This role can be junior or even entry-level, but we do need programming skills, statistical thinking skills, and some sort of portfolio. Programmer and recent data science boot camp grad, perhaps? Please apply!

Of course, as we talk to people, learn what they’re good at and excited about, and what they bring, we may end up with a different mix of skills. But regardless, they’ll cover the space of data scientists, will provide different perspectives and skills, and will help us own our own tools and systems so that we can move and learn quickly.

Smartwatches with Higher-Bandwidth Vibration Notifications

This is an updated version of an article first published on Medium on Oct. 24, 2015.


I love my smartwatch, way more than I thought I would when I bought it, over a year ago. It’s a Moto 360, which is still better looking than the Apple watch, I think.

Why do I love it? It’s not the health monitoring. I turned that junk off as soon as I got the thing. Do not care. It’s because it separates my phone and its alerts (and temptations) from my interactions with other people.

The killer feature for smartphones is the vibration notifications. Phone in my bag or across the room? No problem, I can feel a phone call, or a text message, or a news alert, without anybody else knowing. I even downloaded an app that taps my wrist at the top of every hour, giving me the same sense of time as the dorky digital watch I had in 6th grade, but without annoying the people around me. The alerts are mostly different — apps can choose their duration and pattern — and with practice you can tell a few of them apart. It’s fantastic, and even better is the fact that I can see, dismiss, and briefly respond to alerts if necessary, by looking at my watch and interacting with it. Or, better yet, I can choose not to, but still know the type of alert to expect next time I’m not engaged in something more important.

But one key function isn’t perfect, and it highlights a limitation of having vibration notifications on your wrist. The broken UX is turn-by-turn navigation. The navigation app buzzes your wrist whenever you have to turn. But then the next action you have to perform is to look at your wrist. Maybe tolerable when walking, but inadvisable when driving, and particularly dangerous when biking. (Update: The Apple Watch gives you different patterns when you have to turn left versus right, which would be a useful, if limited, enhancement to Android Wear.)

What if you could feel what direction you have to turn next? What if you could just know when and when to turn, without having to listen to your phone, or learn complex tap patterns on your wrist?

Android Wear should support a secondary bracelet for your other wrist. No screen, just bluetooth and a buzzer. Now, you get twice as much bandwidth when apps want to communicate with you. Apps can buzz both wrists at once, or one then the other, or any other pattern. And even better, spatial apps such as navigation can guide you in the right direction, right from the start. Left buzz? Turn left now. Right buzz? Turn right. Both together? Maybe you’ve arrived!

What would it take to make this happen? Well, Google would have to make changes, I suspect. The current Vibrator API for Android Wear appears to make the strong assumption that there’s a single vibrating device. Android Wear would have to specifically support multiple, coordinated worn devices with independent vibration support, and probably would have to make additional changes to support Wear devices without a screen.

Once Wear supported these devices, though, they’d presumably be easy to manufacture, and we’d see metal smartbracelets, hipster smartbracelets made out of braided leather, and who knows, maybe smartanklets too! Speaking of anklets, a predecessor to this idea is the vibrating ankle compass, whose wearers always know which way is North. Apparently it was transformative to wearers. (Update: There’s been some efforts to sell a Smart Shoe along these lines.)

What other devices could increase your communications bandwidth? Google Glass was a failure, and having bluetooth things talk in your ear is now entirely passe. People don’t want audio or visual connections to the internet all the time, it turns out. (And even more, people don’t want you to have audio or visual connections to the internet all the time!) But the very-low-bandwidth notifications from vibrating devices may give people just enough connectivity to know what they need to know, without interrupting their interactions with the real world. Google? (And Apple, I suppose…) Get started!

Parameterizable Reproducible Research

The below is a public version of a post originally posted on an internal blog at the Education Advisory Board (EAB), my current employer. We don’t yet have a public tech blog, but I got permission to edit and post it here, along with the referenced code. 

Data Science teams get asked to do a lot of different sorts of things. Some of what the team that I’m part of builds is enterprise-scale predictive analytics, such as the Student Risk Model that’s part of the Student Success Collaborative. That’s basically software development with a statistical twist and machine-learning core. Sometimes we get asked to do quick-and-dirty, one-off sorts of things, to answer a research question. We have a variety of tools and processes for that task. But there’s a third category that I want to focus on – frequently requested but slightly-different reports.

what is it

There’s a relatively new theme in the scientific research community called reproducible research. Briefly, the idea is that it should be possible to re-do all steps after data collection automatically, including data cleaning and reformatting, statistical analyses, and even the actual generation of a camera-ready report with charts, graphs, and tables. This means that if you realized that, say, one data point in your analysis was bogus and needed to be removed, you could remove that data point, press a button, and in a minute or two have a shiny new PDF with all of the results automatically updated.

This type of reproducible research has been around for a while, although it’s having a recent resurgence in part due to the so-called “statistical crisis“. The R (and S) statistical programming languages have supported LaTeX, the scientific document creation/typesetting tool, for many years. Using a tool called Sweave, a researcher “weaves” chunks of text and chunks of R code together. The document is then “executed”, where the R code chunks are executed and the results are converted into a single LaTeX document, which is then compiled into a PDF or similar. The code can generate charts and tables, so no manual effort is needed to rebuild a camera-ready document.

This is great, a huge step forward towards validation of often tricky and complex statistical analyses. If you’re writing a conference paper on, say, a biomedical experiment, a reproducible process can drastically improve your ability to be confident in your work. But data scientists often have to generate this sort of thing repeatedly, from different sources of data or with different parameters. And they have to do so efficiently.

Parameterizable reproducible research, then, is a variant of reproducible research tools and workflows where it is easy to specify data sources, options, and parameters to a standardized analytical report, even one that includes statistical or predictive analyses, data manipulation, and graph generation. The report can be emailed or otherwise sent to people, and doesn’t seem as public as, say, a web-based app developed in Shiny or another technology. This isn’t a huge breakthrough or anything, but it’s a useful pattern that seems worth sharing.

Continue reading

Inauthenticity

Let me unpack that a bit…

Hugh and Crye t-shirt

Recently, Hugh & Crye, a DC-based clothing firm for men, with a novel take on sizing, recently did a Kickstarter campaign for their new line of fitted t-shirts. What the hell? H&C has been around for about 5 years, and based on their product growth and hiring seems to be doing quite well. I like their stuff. Why do they need a Kickstarter? The original goal of Kickstarter was to “kickstart” new products by providing crowdsourced seed funding so that you (you!) can ensure that a great idea gets off the ground. And if a project doesn’t make its goals, no harm done, and no money wasted. A fantastic example is the Oculus Rift, which was a Kickstarted Virtual Reality rig, and is now a subsidiary of Facebook. Kickstarting a project is a rather labor-intensive alternative to trying to get a bank loan, or maxing out your credit cards, but with much less risk. It’s a very community-driven, authentic way of getting support for a new venture, moving it from the prototype phase to the initial manufacturing round.

Continue reading

INFORMS Business Analytics 2014 Blog Posts

Earlier this year, I attended the INFORMS Conference on Business Analytics & Operations Research, in Boston. I was asked beforehand if I wanted to be a conference blogger, and for some reason I said I would. This meant I was able to publish posts on the conference’s WordPress web site, and was also obliged to do so!

Here are the five posts that I wrote, along with an excerpt from each. Please click through to read the full pieces:

Operations Research, from the point of view of Data Science

  • more insight, less action — deliverables tend towards predictions and storytelling, versus formal optimization
  • more openness, less big iron — open source software leads to a low-cost, highly flexible approach
  • more scruffy, less neat — data science technologies often come from black-box statistical models, vs. domain-based theory
  • more velocity, smaller projects — a hundred $10K projects beats one $1M project
  • more science, less engineering — both practitioners and methods have different backgrounds
  • more hipsters, less suits — stronger connections to the tech industry than to the boardroom
  • more rockstars, less teams — one person can now (roughly) do everything, in simple cases, for better or worse

What is a “Data Product”?

DJ Patil says “a data product is a product that facilitates an end goal through the use of data.” So, it’s not just an analysis, or a recommendation to executives, or an insight that leads to an improvement to a business process. It’s a visible component of a system. LinkedIn’s People You May Know is viewed by many millions of customers, and it’s based on the complex interactions of the customers themselves.

Healthcare (and not Education) at INFORMS Analytics

[A]s a DC resident, we often hear of “Healthcare and Education” as a linked pair of industries. Both are systems focused on social good, with intertwined government, nonprofit, and for-profit entities, highly distributed management, and (reportedly) huge opportunities for improvement. Aside from MIT Leaders for Global Operations winning the Smith Prize (and a number of shoutouts to academic partners and mentors), there was not a peep from the education sector at tonight’s awards ceremony. Is education, and particularly K-12 and postsecondary education, not amenable to OR techniques or solutions?

What’s Changed at the Practice/Analytics Conference?

In 2011, almost every talk seemed to me to be from a Fortune 500 company, or a large nonprofit, or a consulting firm advising a Fortune 500 company or a large nonprofit. Entrepeneurship around analytics was barely to be seen. This year, there are at least a few talks about Hadoop and iPhone apps and more. Has the cost of deploying advanced analytics substantially dropped?

Why OR/Analytics People Need to Know About Database Technology

It’s worthwhile learning a bit about databases, even if you have no decision-making authority in your organization, and don’t feel like becoming a database administrator (good call). But by getting involved early in the data-collection process, when IT folks are sitting around a table arguing about platform questions, you can get a word in occasionally about the things that matter for analytics — collecting all the data, storing it in a way friendly to later analytics, and so forth.

All in all, I enjoyed blogging the conference, and recommend the practice to others! It’s a great way to organize your thoughts and to summarize and synthesize your experiences.

Why a Data Community is Like a Music Scene — Resources

On Monday, October 28th, 2013, I gave a 5-minute Ignite talk entitled “Why a Data Community is Like a Music Scene” at an event associated with the Strata conference. Here’s the video:

And here are the acknowledgements and references for the talk:

Data Community DC

How Music Works, by David Byrne

my slides for the Ignite talk

my blog post (written first)

Photos:

 

Bikeshare hills, incentives, and rewards

A topographic map of Washington in 1791 by Don Alexander Hawkins. I live on the top edge of the map, on one of those hills.

I’m a generally happy user of DC’s Capital Bikeshare system — just renewed my annual membership today in fact. But I don’t use it as much as I’d like to, for one critical reason. I live on top of a hill. Riders are happy to take bikes from the neighborhood to their jobs downhill, but are much less likely to ride them uphill. As a result, the bike racks in my neighborhood are frequently completely empty by 8:00 or 8:30am, despite the many stores and businesses in the area. The only days I can reliably take a bike into work are when I leave at 7:15 for 8:00 meetings, which is thankfully not too often. On several occasions I have looked at the handy real-time map of bikeshare bikes, only to observe that there are no bikes available within a 15 minutes walk of my home!

What should Bikeshare do to solve this problem? Well, they already do one thing, which is that they hire people to put bicycles in the back of a big van, then drive them up the hill to rebalance the system. This works, but it’s expensive for the system, and it’s not very timely or efficient. In other transportation problems, incentives are used to balance demand. For instance, airlines and Amtrak use pricing to incentivize people who are flexible in their schedule to take off-peak trips. But that won’t work for Bikeshare, as most rides are free. (I pay $75/year, but all trips of 30 minutes or less are free. My rides are mostly 15-25 minutes long.) So people happily ride downhill to their downtown jobs in the mornings, but don’t ride uphill to their reverse-commute jobs, and don’t as often ride uphill in the evening home either. The end result is unhappy customers and excessive costs for the Bikeshare system.

ch_map

Rough map of a possible incentive line for North-Central DC.

So if you can’t give people the usual financial incentives to drop off bikes in the Columbia Heights rack at 8am, what can you do to reduce the need for rebalancing and provide reasons for people to want to help solve Bikeshare’s problem? I think the answer is swag. Imagine that there were lines on the Bikeshare map. Every time you crossed the line going in an uphill direction (reducing the need for rebalancing), you’d earn some points. If you earned enough points, you could redeem them for Bikeshare-branded, limited-edition swag. Imagine a t-shirt in official CaBi colors that said “I bike up hills”, available only through this point system. Who wouldn’t want that?

It’s easy for Bikeshare to figure this out, as they know exactly where you picked up each bike, and where you dropped it off. Determining whether you crossed a line, and thus biked uphill, is easy. And in addition to making people excited about biking up hills, you get them wearing branded items of clothing, which can only help market the system more broadly. They already sell swag through a Cafepress shop, so much of the infrastructure is in place. It’s a win-win.

Bikeshare people, if you read this and think it’s a good idea, please run with it!

Screenshot - 6_23_2013 , 3_08_23 PM

What the swag might look like.