11 Algolia A/B Testing Gotchas, Tips, and Lessons!!

I've supported A/B testing of Algolia search systems at three companies now, and have dived deep into A/B testing generally as well as specifically for search. The Algolia documentation on search A/B testing is technically adequate for getting started, and the dashboard has improved, but there are still many ways that you can go wrong when A/B testing Algolia search results. In the style of a 2014-era Buzzfeed listicle, here are 11 Algolia A/B Testing Gotchas, Tips, and Lessons!! All of the horrible illustrations are generated by AI, the rest are from earlier posts.

Before you start the test

  1. Replicas must be identical except for what you're testing. Although some parameter settings can be tested in real-time, most A/B tests will be on replicas with different ranking criteria, tiebreakers, or settings that require a separate index. The catch: Indexes get out of sync incredibly easily, invalidating your tests. If someone adds a synonym or rule to your control index, without replicating to your test index, any difference you see in metrics could be caused by that change, rather than by whatever you actually intended to test. This has happened to me many times.

My recommendations:

  • Use the Algolia CLI to diff the indexes, before you start the test, and every couple of days thereafter.
  • Proactively reach out to engineers and business people with Algolia access and let them know that seemingly innocuous changes to production configurations can delay work by weeks.
  • Get used to frustration and re-starting tests.

2. A/B testing when you're using the dynamic reranker is fraught. The dynamic reranking feature is really valuable -- it uses user interaction patterns from events to automatically improve ranking for the most common queries. The impact on relevance can be substantial. But the feature doesn't play well with A/B testing.

Plot twists: When you set up a replica index, dynamic reranking is by default off, even if the original index used that feature. Then, the standard way of enabling dynamic reranking looks at interactions with that index, which for a test index is likely starting from scratch! So you're testing a control with a well-tuned reranker against a test index with no or poor reranking -- it'll never win. There is a way to work around this -- the test index can use the control index as the source of events. Annoyingly, this is mentioned on the reranker docs, but not the A/B testing docs.

3. You need to QA before starting the test. As I've written previously, Algolia's otherwise solid Dashboard lacks a good tool for doing head-to-head tests. Fortunately, it's not that hard to build one yourself. Pro tip: Before starting a test, you and stakeholders should use a tool like this to make sure you understand the implications of whatever you're changing.

  1. Offline analysis can support QA. In the past, I've written scripts that do the following:
  • Pull a representative sample of hundreds of historical user queries from the data warehouse.
  • Use the Algolia API to pull search results for those queries, on both the control and test indexes. (Remember to set analytics:false to avoid corrupting your own data and AI features!)
  • Build dashboards (I used flexdashboard at the time, but now you could vibe-code anything) that let you see which queries had big changes to the results, if the changes affect metrics such as "average price on page 1", "average popularity of top results", etc.
  • Dive into the results and use your head-to-head tool to deeply understand changes before going live.

5. Algolia only supports full-population A/B tests. Sometimes certain users shouldn't be included in an A/B test or an A/B test analysis. On a two-sided marketplace, you may want to exclude sellers from a test, to minimize confusion before a feature is released. Or you may only want to test users coming from a specific marketing source. Nope, Algolia's built-in A/B testing can't do this. For anything fancier than testing everybody, you'll need to use separate A/B testing tools to handle bucket-assignment. See also #9 and #10 below.



Analyzing A/B test results

6. Algolia's A/B test analysis only works at the query grain. In A/B test analysis, grain is the unit of a test. For search, the grain can be a query ("blue shoes"), a subquery ("blue sho", if you have search-as-you-type in your UI), a search session (multiple searches for a single intent), or even a user funnel (from the first time you see a user, until they make a purchase, if they do). Algolia's A/B test analysis only works at the query grain. That may not be what you want.

For many e-commerce use cases, users refine their initially-broad queries, as they understand better the scope of your catalog. This is good! And your marketing team has a carefully designed abandoned-cart campaign to get users back on the site days later. An A/B test analysis that penalizes poor conversion rates from initial, broad searches like "shoes" may well not be serving your users (or your business) well. Ideally, look at search-session or even multi-session funnels, to determine if a changed search experience increases eventual purchase rates or revenue.

If you're relying on Algolia's A/B test analysis alone, you need to understand the limitations and use the results you get as a proxy for the data you actually want.

7. You can do your own analysis. Good news: When a query is part of an A/B test, the Algolia API will return the abTestID and abTestVariantID in the results object. If you can capture this and send it to your event-tracking system, along with user ID information, you or your data scientists can use your analytics tool or data warehouse to analyze the results. This has many advantages -- you can measure metrics that you aren't sending to Algolia (are users rating their purchases higher?), you can fix the grain issue and look at search sessions or user funnels, you can more carefully exclude internal users and bots, and you can look at subgroups of users separately (do new users behave differently because of the change from returning users?). There’s some setup involved, but this is my recommendation. Aside from very lightweight tests, spend the time to set up the analyses the right way, for your domain and data.

8. Algolia assumes you're making a Superiority decision. In earlier work, I described how many A/B tests are intrinsically lower-stakes, and being statistically confident that your change is better than the status quo actually is too high a bar. Instead, when making small changes, such as tuning search parameters or tiebreaker scores, you're actually making an Agnostic decision. As long as most of the time you choose the winning setting, the statistical test at the end isn't that important. Just choose whichever variant has even slightly better results.

The statistical method for being comfortable with this requires some understanding of how big the likely changes to your metrics will be, and some pre-test calculations to ensure you're getting enough data. You still need to collect data for a while, but you can often make a decision and move on even with traditionally-inconclusive results. Just don't quote the magnitude of the results you saw -- those numbers are close to meaningless from underpowered tests.

9. Including all users can weaken effects. Suppose you're making a change that will only be experienced by a small number of users. Maybe your Test adds a synonym for "shooz" and "shoes", I dunno. Algolia doesn't know which users are affected by this change, and treats all users as part of the experiment. But only users who typed either "shooz" or "shoes" could ever have seen anything different. So you're diluting the data you care about with a huge number of cases where any difference is just random noise. If you could only look at the users you cared about, you might see a 10% vs 20% difference -- huge! But when diluted by other users, it might only be 10.00% vs 10.05% -- minuscule, hard to detect, and hard to reason about.

If you do your own analysis (see #7), you can work around this. Define a funnel of users who could have been affected by the test, then compare conversion rates for those users. The numbers will be smaller, but the signal-to-noise ratio will be much better.

Other notes

10. You don't have to use Algolia's A/B testing framework to test Algolia. The biggest business impact I ever created (roughly a 1.5% conversion-rate relative increase) was due to a coordinated change of a front-end UI feature with a tiebreaker weighting score change. On their own, neither change would have had the same impact, but the synergistic effect changed user decision-making for the better. This test was not run using Algolia's A/B test feature, but instead was run with an external framework that handled user-bucketing, analytics, and the simple logic that determined which user queries saw results from which index (control or test). To make this work, all that's necessary is to have a few lines of code somewhere that says "users in Control use index A while users in Test use index B". We leveraged Algolia's replica-index features without using the A/B test feature.

Burying the lede -- this is now my recommendation: in most cases, don't use Algolia's A/B testing framework. Algolia’s framework can be useful in limited cases, but you’re usually better off, using an external A/B testing framework, ideally one that integrates well with your Analytics system (see #7). Use Algolia for setting up the indexes, ensuring that only the intended change is different (see #1), and using the API for QA (see #3 and #4).

You'll also be able to more easily handle sub-population tests and analysis (#5 and #9), as well as more complex experimental designs, such as tests that ramp up over time while properly maintaining user-bucket assignment.

11. The A/B testing feature can save your bacon. Years ago, a mistake I made corrupted our production Algolia index. Rebuilding the index, with millions of items, would have taken hours. Fortunately, we had a backup copy of the index in the same Algolia app, created daily by a cron job. (If you're not using the CLI and cron to kick off a daily backup, do yourself a favor...) But a production redeploy to point at the backup, or even copying the backup back to production, would have taken time we didn't have. I was able to set up a quasi-A/B test in just seconds. It redirected 100% of traffic from the corrupted primary index to the backup index, buying us time to re-index and fix things the right way.

Now go run better Algolia A/B tests.


Advertisement: Does your organization struggle to get the relevance, user experience, and business impact you need from Algolia? I'm a freelance consultant with years of Algolia experience who can help you get the most out of advanced tools and search algorithms. Get in touch!


Note: This post was primarily human-authored, with AI assistance for research, editing, and organization. The AI filled a Secondary author role. The core ideas and final voice are mine.