A thing that I do when I cook is to re-write the recipes I’m using (whether they’re from a cookbook or my own invention) onto a piece of paper in a very specific way. I think the approach I use is handy, so I’m describing it here in case you’d like to use it. (Or in case you need more evidence about how weird I am.) There are 4 ideas that I think are important:

Read more →

This is my first new post on harlan.harris.name for a while. The occasion is a change of scenery. For about 10 years, my primary blog has been on WordPress, more recently supplemented by Medium. But WordPress and Medium are limited for technical writing, and the trend among data people recently has been to publish static sites built with Blogdown and Hugo. So that’s what this is. The technology I’m using (more on it below) lets me do fun things like trivially embed math: \(\sum_i a^2_i\), or generate plots with embedded code:

Read more →

This post was originally published on Medium There’s recently been some interesting opinionated writing in the R statistical programming community about how and when to teach the abstracted, easy-to-use approaches to solving problems, versus the underlying nitty-gritty. David Robinson, Data Scientist at Stack Overflow, wrote a blog post recently called Don’t teach students the hard way first. The primary example was on the data-manipulation tools in the tidyverse, versus the underlying methods in base R, but the discussion was mostly about principles in pedagogy.

Read more →

This post was originally published on Medium I recently attended two small conferences — the ISBIS (International Society for Business and Industrial Statistics) 2017 conference, held at IBM Research in Westchester County, and the Domino Data Lab Popup, held in West SoHo. I was invited to speak at ISBIS (slides here, if you’re curious), but for this post, I want to summarize some insights from other people’s talks. In chronological (to me) order… First a few talks from ISBIS that I particularly liked (note that I only saw a fraction of all the talks):

Read more →

This post was originally published on Medium Occasionally when chatting with other data scientists, especially with others who are interested in integrating predictive models into production software system, the word “scaling” comes up. Not this. Although some West Coast data scientists are into this kind of scaling too. I think this is a great question, but it’s a little underspecified. There seem to be at least three qualitatively different notions of “scaling” in data science, and it’s worth the effort to clarify each of them, and address how people tackle them.

Read more →

This post was originally published on Medium A particularly good way to get a little more out of professional conferences is to blog about your experiences, I think. It makes you focus your thoughts on things like “what’s the big take-away here,” and “what should I be asking people in the hallways?” Rather than just summarizing what you saw, or making snarky Twitter comments (also worth doing!), a great conference blog post is synthesis — combining insights from multiple presentations and conversations into a coherent new whole that helps clarify ideas.

Read more →

This post was originally published on Medium A particularly good talk at Strata NY last year was by Brett Goldstein, former CIO of Chicago, who talked about accountability and transparency in predictive models that affect people’s lives. This struck a strong chord with me, so I wanted to take some time to write down some thoughts. (And a rather longer time to publish those thoughts…) I’m sure others’ have thought about this more and have better takes on this — please comment and provide links!

Read more →

neveragain.tech

data ethics

I, Harlan D. Harris, hereby commit to the neveragain.tech pledge. Please stand with me and hold me to it. It starts: We, the undersigned, are employees of tech organizations and companies based in the United States. We are engineers, designers, business executives, and others whose jobs include managing or processing data about people. We are choosing to stand in solidarity with Muslim Americans, immigrants, and all people whose lives and livelihoods are threatened by the incoming administration’s proposed data collection policies.

Read more →

This post was originally published on Medium When building a complex system, it’s often helpful to think about the design of that system using patterns and abstractions. Architects and software engineers do so frequently, and the experience of implementing predictive modeling pipelines has recently led to a variety of patterns and best practices. For instance, when dealing with large amounts of streaming data, some organizations use the Lambda Architecture to handle both real-time and computationally-intensive use-cases.

Read more →

This post was originally published on Medium You’re a data scientist, and you’ve got a predictive model — great work! Now what? In many cases, you need to hook it up to some sort of large, complex software product so that users can get access to the predictions. Think of LinkedIn’s People You May Know, which mines your professional graph for unconnected connections, or Hopper’s flight price predictions. Those started out as prototypes on someone’s laptop, and are now running at scale, with many millions of users.

Read more →