11 Algolia A/B Testing Gotchas, Tips, and Lessons!!
I've supported A/B testing of Algolia search systems at three companies now, and have dived deep into A/B testing generally as well as specifically for search. The Algolia documentation on search A/B testing is technically adequate for getting started, and the dashboard has improved, but there are still many ways that you can go wrong when A/B testing Algolia search results. In the style of a 2014-era Buzzfeed listicle, here are 11 Algolia A/B Testing Gotchas, Tips, and Lessons!! All of the horrible illustrations are generated by AI, the rest are from earlier posts.
Before you start the test
- Replicas must be identical except for what you're testing. Although some parameter settings can be tested in real-time, most A/B tests will be on replicas with different ranking criteria, tiebreakers, or settings that require a separate index. The catch: Indexes get out of sync incredibly easily, invalidating your tests. If someone adds a synonym or rule to your control index, without replicating to your test index, any difference you see in metrics could be caused by that change, rather than by whatever you actually intended to test. This has happened to me many times.
My recommendations:
- Use the Algolia CLI to diff the indexes, before you start the test, and every couple of days thereafter.
- Proactively reach out to engineers and business people with Algolia access and let them know that seemingly innocuous changes to production configurations can delay work by weeks.
- Get used to frustration and re-starting tests.
Plot twists: When you set up a replica index, dynamic reranking is by default off, even if the original index used that feature. Then, the standard way of enabling dynamic reranking looks at interactions with that index, which for a test index is likely starting from scratch! So you're testing a control with a well-tuned reranker against a test index with no or poor reranking -- it'll never win. There is a way to work around this -- the test index can use the control index as the source of events. Annoyingly, this is mentioned on the reranker docs, but not the A/B testing docs.
- Offline analysis can support QA. In the past, I've written scripts that do the following:
- Pull a representative sample of hundreds of historical user queries from the data warehouse.
- Use the Algolia API to pull search results for those queries, on both the control and test indexes.
(Remember to set
analytics:falseto avoid corrupting your own data and AI features!) - Build dashboards (I used flexdashboard at the time, but now you could vibe-code anything) that let you see which queries had big changes to the results, if the changes affect metrics such as "average price on page 1", "average popularity of top results", etc.
- Dive into the results and use your head-to-head tool to deeply understand changes before going live.
Analyzing A/B test results
For many e-commerce use cases, users refine their initially-broad queries, as they understand better the scope of your catalog. This is good! And your marketing team has a carefully designed abandoned-cart campaign to get users back on the site days later. An A/B test analysis that penalizes poor conversion rates from initial, broad searches like "shoes" may well not be serving your users (or your business) well. Ideally, look at search-session or even multi-session funnels, to determine if a changed search experience increases eventual purchase rates or revenue.
If you're relying on Algolia's A/B test analysis alone, you need to understand the limitations and use the results you get as a proxy for the data you actually want.
The statistical method for being comfortable with this requires some understanding of how big the likely changes to your metrics will be, and some pre-test calculations to ensure you're getting enough data. You still need to collect data for a while, but you can often make a decision and move on even with traditionally-inconclusive results. Just don't quote the magnitude of the results you saw -- those numbers are close to meaningless from underpowered tests.
If you do your own analysis (see #7), you can work around this. Define a funnel of users who could have been affected by the test, then compare conversion rates for those users. The numbers will be smaller, but the signal-to-noise ratio will be much better.
Other notes
Burying the lede -- this is now my recommendation: in most cases, don't use Algolia's A/B testing framework. Algolia’s framework can be useful in limited cases, but you’re usually better off, using an external A/B testing framework, ideally one that integrates well with your Analytics system (see #7). Use Algolia for setting up the indexes, ensuring that only the intended change is different (see #1), and using the API for QA (see #3 and #4).
You'll also be able to more easily handle sub-population tests and analysis (#5 and #9), as well as more complex experimental designs, such as tests that ramp up over time while properly maintaining user-bucket assignment.
Now go run better Algolia A/B tests.
Advertisement: Does your organization struggle to get the relevance, user experience, and business impact you need from Algolia? I'm a freelance consultant with years of Algolia experience who can help you get the most out of advanced tools and search algorithms. Get in touch!
Note: This post was primarily human-authored, with AI assistance for research, editing, and organization. The AI filled a Secondary author role. The core ideas and final voice are mine.