Skip to main content

Hierarchical Embeddings for Text Search

List-processing tricks for generating embeddings at different levels of document abstraction to allow efficient semantic searching.

A proposal for better retrieval on large, complex, hierarchically-structured document corpuses, which can be implemented straightforwardly using heuristics and pairwise embedding distances.

We embed each atomic element, merging them heuristically into larger embeddings repeatedly.

Then we search arbitrary text inputs in the reverse direction hierarchically, from the largest embedding to the smallest, to return the best matches in size order while excluding spuriously-similar hits.

X-dimensional text embeddings have heavily supplemented (or even replaced) traditional keyword search these days: neural nets encode documents as numeric fingerprints, allowing easy retrieval of the nearest matches. You search over documents by using neural nets to generate an embedding, which is a list of numbers which act like a sort of ‘thumbprint’ or ‘semantic summary’, and then finding the numerically-closest embeddings from all the other documents available. These embeddings can also be processed versions of text, like summaries which draw on multiple documents, and can become quite complex, like RAPTOR, which summarizes clusters of documents at multiple levels to try to provide useful summaries for downstream Q&A tasks.

I would like to consider a simpler task here: extending recommender-system-like ‘suggested reading’ lists of retrieval hits, based on arbitrary parts of documents rather than the entire document.

List Clustering

We have a problem of breaking up a document into natural ‘clusters’.

As it happens, I faced a similar problem on Gwern.net, in breaking up the list of similar documents into clusters which I could easily label with a name. I solved it when I noticed, looking at them, that the seriation resulted in clear clusters in the list, and these clusters corresponded to sudden ‘jumps’ in the embedding distance. This made sense in retrospect: if there is a topic A and a topic B, then the seriation will tend to result in a list like ‘A A AB B B’, and at ‘AB’, there will be the largest distance in the list. So you can extract n clusters simply by going through the list, and recording the indexes of the largest n pairwise distances, and splitting the list at those. So if you specified n = 2, then the ‘A A AB B B’ list cleanly splits into ‘A A A’ and ‘B B B’ sub-lists.

How do you decide n? You could try to define some distance which corresponds to a ‘jump’, but I settled for a heuristic: the number of clusters tends to go up fairly steadily with the total number of documents, faster than logarithmically but slower than linearly, and a square root seemed to work reasonably well. So if there are k documents, one just extracts n = √k clusters (rounding up, for a ceiling, as I find better a bit too many clusters than too few). This ghetto clustering algorithm worked surprisingly well in practice, and wasn’t too hard to implement.

Documents As Lists-Of-Lists

So I think you could do something similar for our document problem: embed every block element, like paragraphs and list items; then apply the seriation clustering algorithm to determine the break points. Then, one applies this recursively to build up a data-driven hierarchy.

Each cluster’s embeddings get averaged together, to create the next level. So a document might have 100 block elements, which yield √100 = 10 clusters; each of those 10 cluster’s 10 embeddings get averaged into a ‘prototypical’ embedding of that cluster, its center; then the seriation is done again, and √10 = 3 clusters are taken; then 2 clusters; and then the entire document. (One tracks the ‘spatial’ extent of each cluster as well, for use later on in search.) Now one has broken up a document in the semantically logical ways, without being too hardwired with a specific hierarchy. And there will not be too many embeddings, because it is hierarchical: the square root heuristic means we don’t have to store too many embeddings beyond the atomic level (total embeddings: 𝒪(n + n0.5)?), which we can tune the size of. (If block level is too expensive, we can try to work with a few blocks each time, etc.)

Website Implementation

HTML-implementation-wise, we will assume that IDs are available which define ranges inside documents, and the embeddings propagate ranges appropriately by taking the beginning & end of the cluster, and this is a bit like a range tree. This would work well with the Gwern.net transclusion popups, which can transclude specific ID ranges.

We can also enable the ‘drilling down’ nicely by using lazy-loading collapses/disclosure blocks: we can show the first k hits at each level, but then append a collapse block to each hit, which contains the k-best hits within that that hit, so a user can choose which way to drill—either ‘vertically’ or ‘horizontally’. So a text input returns a popup containing a nested list of “Document-level hits: A + [within A], B + [within B], C + [within C] / §-level: E.1+… U.8+… Z.2+… … / paragraphs-level: M.5.1+… … / paragraph-level: N.22.6.9+…”, each of which can be either popped-up or uncollapsed. (And like the seriated sort-by-magic clusters, we can label each range using a LLM to summarize them with a human-readable name, if they do not match up nicely with a section header or something.)

We can also avoid dynamicism by changing the search: you can’t search for arbitrary text, instead, if you mouse over some text, you simply search for the cluster which contains that text. (If you search on a paragraph, then you start with that atomic element; if you search 2 adjacent paragraphs, whatever is the lowest cluster that contains them both.) The results can then be precomputed, or deferred to an API service.

Because embeddings are fairly cheap and independent at the document level, we can simply throw away embeddings at the document-level whenever a document is modified, and re-embed hierarchically that document.

Similar Links

[Similar links by topic]