There has been a surge of interest in vector search, thanks to a new generation of machine learning models that can represent all sorts of content as vectors, including text, images, events, and more. Often called “embedding models”, these powerful representations can capture similarity between two pieces of content in a way that goes beyond their surface level characteristics.
k-nearest neighbor (kNN) search algorithms find the vectors in a dataset that are most similar to a query vector. Paired with these vector representations, kNN search opens up exciting possibilities for retrieval:
- Finding passages likely to contain the answer to a question
- Detecting near-duplicate images in a large dataset
- Finding songs that sound similar to a given song
Vector search is poised to become an important component of the search toolbox, alongside traditional techniques like term-based scoring.
Elasticsearch currently supports storing vectors through the dense_vector field type and using them to calculate document scores. This allows users to perform an exact kNN search by scanning all documents. Elasticsearch 8.0 builds on this functionality to support fast, approximate nearest neighbor search (ANN). This represents a much more scalable approach, allowing vector search to run efficiently on large datasets.
ANN in Elasticsearch
What is approximate nearest neighbor search?
There are well-established data structures for kNN on low-dimensional vectors, like KD-trees. In fact, Elasticsearch incorporates KD-trees to support searches on geospatial and numeric data. But modern embedding models for text and images typically produce high-dimensional vectors of 100 – 1000 elements, or even more. These vector representations present a unique challenge, as it’s very difficult to efficiently find nearest neighbors in high dimensions.
Faced with this difficulty, nearest neighbor algorithms usually sacrifice perfect accuracy to improve their speed. These approximate nearest neighbor (ANN) algorithms may not always return the true k nearest vectors. But they run efficiently, scaling to large datasets while maintaining good performance.
Choosing an ANN algorithm
Designing ANN algorithms is an active area of academic research, and there are many promising algorithms to choose from. They often present different trade-offs in terms of their search speed, implementation complexity, and indexing cost. Thankfully there is a great open source project called ann-benchmarks which tests the leading algorithms against several datasets and publishes comparisons.
Elasticsearch 8.0 uses an ANN algorithm called Hierarchical Navigable Small World graphs (HNSW), which organizes vectors into a graph based on their similarity to each other. HNSW shows strong search performance across a variety of ann-benchmarks datasets, and also did well in our own testing. Another benefit of HNSW is that it’s widely used in industry, having been implemented in several different systems. In addition to the original academic paper, there are many helpful resources for learning about the algorithm’s details. Although Elasticsearch ANN is currently based on HNSW, the feature is designed in a flexible way to let us incorporate different approaches in the future.
Show me the code!
To index vectors for ANN search, we need to set index: true
and specify the similarity metric we’re using to compare them:
Leave a Reply