Experience Report: Vector Embeddings for Semantic Search
tl;dr: If you are considering using vector embeddings for semantic similarity search as a core part of the system you’re building, then at the very least set up a test system to rapidly validate embedding models early in your development phase.
Our experience with vector embeddings for semantic similarity search
Recently we attempted to use semantic similarity search, using vector embeddings, with the intention of providing semantic search over public company filing data (i.e. annual and quarterly filings, press releases, etc). Our journey began by implementing a better index for vector similarity search, based on the assumption that vector encodings of document chunks would provide good enough distinction. That assumption did not hold in our internal testing, where results did not match the query, were of low relevance, or entirely contradictory.
We pivoted to a hybrid keyword + semantic similarity based reranking search, which is also what we ultimately demonstrated at Web Summit 2025. Here we again encountered that vector embeddings did not encode the necessary nuance to compare queries to document chunks in a variety of our tests, even after the narrowing of the search space with keyword search. Although the demo showed promise to those who engaged with it at our booth, we knew from our development experience that vector embedding based similarity search was a technological dead end for us. Nor did we see getting into the murky world of embedding model tuning and validation as a sustainable long term direction.
What are vector embedding models
At their core embeddings are a vector where each dimension is meant to represent some aspect of meaning. The more or less of that value a word/sentence has, the closer or further that word/sentence will be from other words/sentences with that same aspect of meaning, or so the theory goes. The above mostly holds for manual feature engineering (experts picking dimensions) and tagging (expert classification), but that process (being manual) is incredibly slow and expensive.
For a sense of scale, typical embeddings often have hundreds of dimensions and day to day working vocabularies alone are 20,000 - 30,000 words, meaning millions of data points that need to be carefully considered by an expert.
The modern approach is to instead rely on machine learning algorithms to create features by examining things like co-occurrence relations (words that occur near each other, in the same document) over large corpuses (such as wikipedia). Even with this approach it isn’t fast enough to be run at interactive speeds, and instead a model is created beforehand and then used repeatedly. Although the machine learning approach reduces the effort of feature curation and tagging, a substantial amount of expertise and work go into validating the model. Regardless of the technique used, an embedding model can then take input word(s) and produce a vector representation of its “meaning” given the feature set.
How are things considered semantically similar
You might be wondering why we’ve gone to all this effort to create vectors out of word(s). The reason why we’ve gone to all this trouble is to allow us to use mathematical vector operations, such as vector similarity (euclidean or cosine) to determine semantic similarity.
Obtaining a vector once we have an embedding model is straightforward, although depending upon the sophistication of the model it may be computationally expensive. As described in the prior section, we pass word(s) from a query or from chunks of text through a vector embedding model to generate an n-dimensional vector (where n is determined by the specific model). The resultant vectors can then be compared using various distance measures (the most common of which are euclidean and cosine similarity). In the context of search, the similarity can be treated as a score/rank and results can be sorted based on that score. Most models are built and trained with a particular similarity measure in mind, most commonly Cosine Distance.
A Bit of History
The idea of encoding words as vectors and using vector similarity dates back decades, as early as the 1960s when the SMART Information Retrieval System was developed at Cornell University led by Gerard Salton (then called vector space model). As computer costs came down and training sets became more plentiful, its usage grew in the last few decades (notable example: word2vec). Beyond semantic similarity, natural language processing research also showed that encoding words as vectors potentially encode additional relationships (e.g. https://arxiv.org/pdf/1509.01692). Overall the results are promising, but as far as I know, there are few formal proofs in the area for many of the breakthroughs, just observational evidence that they “work” under various experimental conditions.
Failure modes
Prevailing wisdom is to not use vector embeddings and similarity comparisons for single words, due to lack of context. Here context means the number of words we use when producing a vector, and the context a word is used in indicates which one of the many meanings was intended. This caveat alone is very indicative of the problem with vector embeddings, they effectively collapse the meaning of words, phrases, or sentences down to a single point, and are highly sensitive to the size of the document chunk (how finely we break up a document for search) among other issues.
Some of the issues we encountered:
When comparing antonyms (words or phrases where the predominant senses of the text had contradictory meanings of each other) we found that they quite often resulted in similar vectors (specifically “not + WORD” was closer than synonyms of WORD), such as ‘solvent’, ‘not solvent’, ‘insolvent’.
When comparing synonyms (words or phrases where predominant senses of the text mean the same thing) we found that the similarity score was not high enough to observe a stable threshold. For example, ‘liability’ and ‘debt’ did not score meaningfully differently than ‘liability manager’.
These might sound nitpicky, but when attempting to search based on similarity, or even re-ranking results, this lack of distinction introduces a large amount of noise in the results, which is an undesirable outcome for providing good search.
In either case we couldn’t establish a reasonable similarity/disimilarity threshold that was stable across our real world test data. We of course tried a variety of vector embeddings models (e.g.: msmarco-distilbert-dot-v5, uform3-image-text-english-small, etc), but had to limit ourselves to something that could run at interactive speeds as we had to encode the user provided query.
To further complicate matters, our application involved searching across text from a variety of different domains, with evolving vocabulary over time (true of most domains). This made the already unappealing prospect of tuning that much more troublesome. Model tuning has a number of hazards along with the cost of continuous retuning and validation, as well as the need to recompute all vectors and search indexes after each change.
Conclusion
The aim of this post is to encourage caution to those hoping vector similarity search will work in their contexts and/or does not come with significant drawbacks. If we were to persist down the vector embedding route, we’d set up a tuning and testing pipeline that continually ensures we have strong semantic separation given our workload, rather than relying on models working off the shelf. If you are considering going down this route, and semantic similarity search via vector embeddings is a core part of the system you’re building, I urge you to at least set up a test system to rapidly validate the quality of the embedding models for your domain early in your development phase.
At Web Summit we met a number of other exhibitors who also ran into similar (hah!) limitations of similarity search and were actively looking for alternatives. In our discussions we found that the limitations of vector embedding based search are insufficiently discussed in broader industry discourse and we’re glad to see more work happening in at least academic circles (e.g.: On the Theoretical Limitations of Embedding-Based Retrieval). These limitations are precisely why we’re working on a new approach to semantic similarity search, which is based on a novel approach to word sense disambiguation; we'll elaborate on this in future posts.
