p Values Are Useful for A/B Tests, Sometimes

May 2, 2023 · 9 min read · a/b testing statistics data science ·

The “best practice”, when evaluating the results of an online controlled experiment (A/B test), is to use classical statistical tests, proceeding with a change if (and only if) the result of the test includes a p value of less than 0.05. But, the American Statistical Association (ASA) said in a prominent 2016 statement that “…business… decisions should not be based only on whether a p-value passes a specific threshold.” Wait, what? Are we making bad decisions from A/B tests? Should we stop using p values and do something else?

My answers are yes – we are too often making bad decisions, and sometimes – p values are only useful for certain types of A/B test decisions. In this post, I’ll talk about the specific decision pattern where p values remain a reasonable approach, taking the somewhat contrarian position of defending them against the ASA’s guidance (while still almost entirely agreeing with everything else their statement says).

Why p Values Are Bad

Let’s back up. If you took one of those Introduction to Statistics classes, you probably were taught a few things about “statistical significance”, “p values”, and “null hypotheses.” You probably know that before you run an experiment, you do a “power analysis” to figure out how long you need to collect data for. Then, after you run the experiment, you can plug the results into a formula to compute a value called p, and if that number is less than 0.05, then the experiment was “statistically significant”, otherwise you cannot exclude the null hypothesis. You might also know that this process was developed in the first half of the 20th century to support academic science, especially agricultural and medical research. The framework is called Null Hypothesis Statistical Testing (NHST), and it is ubiquitous.

But you might not know how much pushback has happened in recent years. In 2016, the American Statistical Association, the professional society for statisticians in the US, published a statement on p values. Here is the summary (bold added for the most important two points for our purposes):

P-values can indicate how incompatible the data are with a specified statistical model.
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
Proper inference requires full reporting and transparency.
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

If you’re like most people I’ve worked with, marketers or product managers or analysts, the above comes as a shock. Here we are, trying to make important decisions about our business, and the experts are specifically saying that our approach is bad!

Why did the ASA say this? It’s not because the p value is being incorrectly calculated, it’s because in many cases it’s doesn’t mean anything relevant to the decision that’s being made. What it does mean is the following somewhat convoluted statement: it’s the probability, if the experiment were re-run again exactly the same way, and if in fact there was zero impact of the experimental manipulation, that the results would be as large or larger than what was observed.

It’s not clear at all how this confusing probability crossing an arbitrary threshold is anywhere near the right way to decide whether to make a change to your web site or app, or not.

Decision Making in Science and Industry

There seems an interesting tension between the needs of practitioners in modern organizations, and the problem that p values were originally intended to solve. Put simply, a low p value, if you ran the experiment properly, means that probably something interesting is going on. In science, that’s a great start – it begins a conversation that will take years to resolve, using replication, formal models of mechanisms, and other methods of converging evidence. It’s also worth noting that in science, the goal is often to estimate the impact of a change, or of some parameter of the model. p values only let you, best case, make an argument that an effect size is non-zero. So yes, for science, treating p < 0.05, on its own, as anything like conclusive or even particularly useful, is inappropriate.

But in industry, we need to have conversations about what data will mean for us first, then decide as quickly as we can what to do once we have data. Taking years, or even weeks, to decide what to do with the results of an experiment is not acceptable in almost any business. We need to take the results, then use agreed-upon decision rules and communication patterns to move forward.

So, can we find appropriate ways to use statistical tools to fit into the decisions and the communication patterns that we need to have? I think we can. And in one particular type of A/B test, the NHST approach of a p value crossing a threshold may even be the right tool for the job.

Superiority Decisions

Sometimes, when you’re running an A/B test, you only want to roll out a new experience if it is almost certainly better than the existing experience. This is a conservative decision, appropriate when the thing you’re testing is more expensive to maintain (or to fully build) than the control experience. Perhaps you’re adding a fancy and expensive recommendation module to a page, or you built a quick-and-dirty version of a feature that won’t scale, just for the test. This may not be the most common type of A/B test that people run, but it certainly happens. I like to call this a “Superiority Decision” test.

Before the experiment runs, you need to figure out, and as a team agree upon, two things – how long to run the experiment for, and how you’ll make a decision based on the results you see. Before jumping into statistics, let’s frame these qualitatively:

We want to make a Superiority Decision. We want to run the experiment until we’re quite likely to detect the smallest positive impact that would cause a rollout decision. Then, after we’ve collected the data, we will roll out the change if statistics say that the treatment experience is almost certainly better than the control experience.

I’ve found it extremely valuable to start with statements like this, getting agreement among stakeholders, before the experiment begins. In particular, the smallest positive impact value requires review of past experiments, discussion of cost-benefit tradeoffs, and a key decision to be made.

Translating Business Decisions to Statistics

The statement above is essentially the classical NHST approach. For instance you could set quite likely to 80%, smallest positive impact to 1% or whatever seems appropriate for the experiment, and almost certainly to 95%. By agreeing on this, you can then use standard power analysis techniques to determine how long to run the experiment, then use p < 0.05 as the decision criteria.

(As I discussed previously, although in general p < 0.05 does not mean “95% likely to be better”, in the very simple case of a t test for a properly-executed A/B test with adequate power, it’s very close, and certainly adequate to mean “almost certainly better”.)

When I’ve used this approach, I’ve paired this setup with standardized language around the decision, such as:

It is almost certain that the New Experience is better than the Control. The New Experience is quite likely between 0.3% and 0.7% better.

The first sentence is based on the p value being lower than the 0.05 threshold, but notably, I don’t actually report the p value itself in this summary. Instead, I report uncertainty intervals around the expected change, as discussed previously. This avoids all of the confusion around what p means, and avoids anchoring on point estimates or other numbers, but behind the scenes, relies on the NHST pattern.

On the ASA Statement

Returning to the ASA statement, here are my comments on each of their assertions:

P-values can indicate how incompatible the data are with a specified statistical model.

Agreed.

P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

For very simple cases, which is what A/B tests are, p values below a threshold can be interpreted as an indication that we have knowledge that should allow us to make a specific decision. Is this “true”? Maybe not, but it’s still useful.

Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

In some cases, such as the Superiority Decision described here, business decisions should be made expediently based on pre-determined rules, which can be p values below a threshold value. But in many other cases, other decision criteria should be used.

Proper inference requires full reporting and transparency.

Yes, but proper reporting to non-statistically-fluent stakeholders also requires not misleading them or burying them in jargon. As data scientists in industry, we are hired for our ability to guide the business using the tools of statistics. We should be intellectually honest without being offputting. Starting with qualitative statements that everyone can agree on goes a long way in the right direction.

A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

Completely true. If you have to mention p values in a report, it’s much better to say p < 0.05 than p = 0.0034 or whatever, which will cause stakeholders to say wince-worthy things like “very statistically significant.” The misnomer “statistical significance”" just means that you’ve crossed a threshold that leads to a particular decision. You’ve pre-determined the importance of the result, and you explicitly do not care, for a Superiority Decision, how big the effect size is.

By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

Right, but it can be adequate for making a decision, specifically a Superiority Decision. It’s not a measure of evidence, just a criterion that lets us take our win (or our loss) and move on.