GeneralIZE -- How else could IZE's hierarchies be generated?

In previous posts, I showed off an interactive demo of the IZE algorithm, and discussed how the algorithm worked. Now, it's worth considering some ways we could generalIZE the algorithm. 🤦‍♂️ Perhaps variations on the algorithm might yield hierarchies that are even better at showcasing the contents of the texts?

As discussed previously, there's pretty good experimental evidence that in order to be useful and intuitive, search results clustering needs to be based on just one or two words (or facet values, or tags). That is, it's too confusing when the algorithm finds clusters based on overall family resemblance, then tries to name them post-hoc. It's better to do the sort of single-word splitting that the IZE algorithm uses. But, could we do so in a different way or a different order, so that the results feel more like underlying clusters are being identified?

Consider an e-commerce application. If your catalog was just Clothing and Accessories, you would want to ensure that those were top-level splits, even if they didn't happen to be the most common facet values. The existing IZE algorithm is greedy -- at each step, it finds the single tag split that maximizes the number of items in the Yes branch. This is very fast (assuming you can quickly calculate the number of items that result from each split). But it may not be optimal, for some particular definition of optimal. To explain, let's review the way the algorithm works.

(A) Underlying binary tree; (B) IZE hierarchy as displayed

(A) Underlying binary tree; (B) IZE hierarchy as displayed

The recursive splitting algorithm finds the next tag that maximizes the number of items in the next Yes (right) branch. The display converts the underlying tree into an easy-to-read rotated hierarchy. There is one row per leaf node (grey circles, with sets of results), plus one row for completely internal nodes (here, "clothing"). Note that the "footwear" and "sneakers" nodes are combined because they are redundant.

Prefer consistent facets

One simple enhancement is relevant to the e-commerce use-case, where tags are part of facets. The existing algorithm treats all tags similarly, but it would be straightforward to prefer certain tags (facet values) in certain circumstances. For example, when splitting a "no" node, you could prefer facet values that are in the same facet as the parent. The results are more intuitive. The sequence of 'no' nodes that make up the left edge of the tree -- rendered at the same level of the IZE hierarchy -- are more likely to share a facet.

Standard IZE algorithm (left) vs. same-facet-boosted algorithm (right) showing brands consistently at the top level.

Standard IZE algorithm (left) vs. same-facet-boosted algorithm (right) showing brands consistently at the top level.

Optimize for similarity

Even more ambitiously, can we find a way to search through a larger set of possible trees so that the resulting IZE hierarchy is more similar to the underlying clusters?

It's helpful to think of a space of possible trees, where each tree T defines a series of splits that yields an IZE hierarchy. There could easily be billions of possible trees, depending on the number of tags and items, each defining a different IZE hierarchy for the same search results. Each tree T could then have a "goodness" score, G(T). There are many ways to define this score, but a reasonable one is to maximize how similar the items are within each node, and to minimize how similar the items in separate nodes are from each other. For e-commerce items, which typically have a set of tags (facet values), one option is to compute the Jaccard similarity between items. G(T) then becomes the sum of the Jaccard similarities between all pairs of items in each node, minus the sum of the Jaccard similarities between all pairs of items in different nodes.

So, could we enumerate all possible trees, compute the G(T) values, and pick the best? Well, no, not at any scale. Even with just dozens of tags, and hundreds of items, the computational complexity makes a real-time search system infeasible. Some sort of approximate or heuristic solution will be needed. We just need good-enough performance in bounded time (say, 1 second) for realistic use. A number of optimizations would need to be used:

  • Instead of enumerating all possible trees, use a heuristic search approach to limit the search to trees that are most likely to yield good results. (AI techniques from the 20th century!)
  • Store lots of intermediate states. Many trees differ by just a subset of their nodes, so only the differences need to be compared. Everything else can be computed once and looked up.
  • Identify equivalent trees. There are many re-orderings of splits that yield the same nodes, just in a different order. We want to prefer the order where there are more items at the top of the list (nodding to the existing heuristic), and the other re-orderings can be pruned and ignored.
  • Implement inside the search system. Modern search systems have highly-optimized representations of subsets of items that can be used for these algorithms. Certain invariant properties of items and facet values can be pre-computed, as well.

Stay tuned for the final episode of this series -- thoughts on how the AI revolution of the last few years affects what could be built in this space.


Notes:

  • Interested in implementing this with me? Have a compelling commercial use-case? Please reach out!
  • This post was primarily human-authored, with AI assistance for research, editing, and organization. The AI filled a Secondary author role. The core ideas and final voice are mine.