This is an updated version of a post originally published on Medium on Jan. 28, 2016. I may have more to say about this sort of thing in the near future.
There are different types of data scientists, with different backgrounds and career paths. With Sean Murphy and Marck Vaisman, I wrote an article about this for O’Reilly a few years back, based on survey research we’d done. Download a copy, if you haven’t read it. This idea is now pretty well established, but I want to talk about a related issue, which is that the type of work that Data Science teams do varies a lot, and that managing those types of work can be an interesting challenge.
As Josh Wills said, data scientists aren’t software developers, but they sometimes do that sort of work, and they aren’t statisticians, but they sometimes do that sort of work too. At EAB, where I lead a Data Science team of people with very diverse backgrounds and skill sets, this issue leads to a lot of complexity and experimentation as we (and the upper management I report to) try to ensure that everyone is working on the right tasks, at the right time, efficiently.
In this post, I’d like to share some thoughts about how we currently think about and manage different types of Data Science work. I also wrote a little Shiny web tool to help us manage our time, and I’ll show that off as well.
A recent innovation on our team is partial adoption of Agile/Scrum development processes. We now have daily standup meetings, two-week sprints with kickoff and retrospective meetings, and time estimation for (some of) our tasks. In general, I’m not a fan of process for its own sake, but as the team has grown, it’s become more important to have visibility into projects, to ensure that they’re on-track and not stuck or moving off on a (fascinating) tangent. So far, two of the most valuable pieces of Scrum are having processes that cause us to add more definition to projects and tasks, along with the retrospective meeting, where we review the most recent sprint and try to improve our own processes.
Scrum has limitations for us, though. For one thing, most of our projects are fairly specialized, and the standard practice of having the team as a whole estimate task difficulty is not viable when backgrounds differ so much. So we have the people who are working on a project or task estimate for themselves.
The bigger issue is the variety of workstreams. For us, we have four main categories of work:
- Development– writing code for applications, either analytic services that are part of our company’s product, or internal web applications used by us or others. All of us do this work, but some data scientists specialize in this sort of work.
- Research– trying to understand something better. We divide this into three subcategories:insights research is about understanding our customers’ domain and data better;algorithm research is about matching data science approaches to business needs; technology research is about improving our technology stack.
- Service– repeatable operations tasks, usually customer specific, such as model-fitting or ad hoc analysis.
- Team Development– interviewing, training, teaching, writing blog posts, volunteer work, etc.
Our Development work is not that different from our colleagues who sling Java all day (except that we don’t have to sling Java, thankfully). We use relatively standard Scrum processes for these tasks, including breaking down Epics into smaller Tasks, estimating those Tasks, having Task kick-off meetings to ensure requirements are clear, and of course we use modern software development processes such as git, code reviews, and continuous testing and integration.
Our Research work is a hybrid. For a six-week project, we might have a high-level project plan that keeps the work focused on specific questions to be answered, as well as a series of technical tasks that are managed mostly like Development work. If you’re trying to figure out what algorithm to use for something, you probably have tasks that look like “read papers” as well as tasks that look like “build data pipeline”. In the former case we time-box the work, to ensure that it’s targeted and not encyclopedic; in the latter case, we usually have enough information to treat the task as Development work once it’s ready to start.
Our Service work is totally different. It’s often kicked off and constrained by other teams, and we manage it with a fairly traditional Kanban process. People who have time pick up a ticket and own until it’s done. (Our primary Service tasks typically take 4 to 8 hours of work, spread out over a couple of weeks as we run into issues, resolve them, get feedback, and so on.) We track velocity, but don’t estimate it.
Finally, our Team Development work is even more ad hoc, and we don’t currently do any sort of task tracking here, aside from conversations at one-on-one meetings.
As this tool is not a core competency of our company, I’ve open sourced the application, so you can use it (or improve it), if you’d like!
A sample, public version of the app is on ShinyApps. Feel free to play around with it. Right-click on the table to add/delete rows. (Note that whenever the app restarts, which it might do at arbitrary times, the tables on the disk is reset. Our copy is running inside our firewall on a copy of Shiny Server.)