Introducing approximate nearest neighbor search in Elasticsearch 8.0

Feb 8, 2022 by iHash Leave a Comment

There has been a surge of interest in vector search, thanks to a new generation of machine learning models that can represent all sorts of content as vectors, including text, images, events, and more. Often called “embedding models”, these powerful representations can capture similarity between two pieces of content in a way that goes beyond their surface level characteristics.

k-nearest neighbor (kNN) search algorithms find the vectors in a dataset that are most similar to a query vector. Paired with these vector representations, kNN search opens up exciting possibilities for retrieval:

Finding passages likely to contain the answer to a question
Detecting near-duplicate images in a large dataset
Finding songs that sound similar to a given song

Vector search is poised to become an important component of the search toolbox, alongside traditional techniques like term-based scoring.

Elasticsearch currently supports storing vectors through the dense_vector field type and using them to calculate document scores. This allows users to perform an exact kNN search by scanning all documents. Elasticsearch 8.0 builds on this functionality to support fast, approximate nearest neighbor search (ANN). This represents a much more scalable approach, allowing vector search to run efficiently on large datasets.

ANN in Elasticsearch

What is approximate nearest neighbor search?

There are well-established data structures for kNN on low-dimensional vectors, like KD-trees. In fact, Elasticsearch incorporates KD-trees to support searches on geospatial and numeric data. But modern embedding models for text and images typically produce high-dimensional vectors of 100 – 1000 elements, or even more. These vector representations present a unique challenge, as it’s very difficult to efficiently find nearest neighbors in high dimensions.

Faced with this difficulty, nearest neighbor algorithms usually sacrifice perfect accuracy to improve their speed. These approximate nearest neighbor (ANN) algorithms may not always return the true k nearest vectors. But they run efficiently, scaling to large datasets while maintaining good performance.

Choosing an ANN algorithm

Designing ANN algorithms is an active area of academic research, and there are many promising algorithms to choose from. They often present different trade-offs in terms of their search speed, implementation complexity, and indexing cost. Thankfully there is a great open source project called ann-benchmarks which tests the leading algorithms against several datasets and publishes comparisons.

Elasticsearch 8.0 uses an ANN algorithm called Hierarchical Navigable Small World graphs (HNSW), which organizes vectors into a graph based on their similarity to each other. HNSW shows strong search performance across a variety of ann-benchmarks datasets, and also did well in our own testing. Another benefit of HNSW is that it’s widely used in industry, having been implemented in several different systems. In addition to the original academic paper, there are many helpful resources for learning about the algorithm’s details. Although Elasticsearch ANN is currently based on HNSW, the feature is designed in a flexible way to let us incorporate different approaches in the future.

Show me the code!

To index vectors for ANN search, we need to set index: true and specify the similarity metric we’re using to compare them:

Source link

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Announcing the technical preview of Elastic Cloud Serverless on Google Cloud

Elastic Cloud Serverless provides the fastest way to start and scale security, observability, and search solutions — without managing infrastructure. Today, we are excited to announce the technical preview of Elastic Cloud Serverless on Google Cloud — now available in the Iowa (us-central1) region. Elastic Cloud Serverless provides the fastest way to start and scale […]

AI Agents: Your data sidekick (minus the coffee breaks)

Do you ever wish you had a personal data guru who could magically sift through all your data, spot patterns before they become problems, summarize everything in a way that actually makes sense and propose recommendations? Well, meet AI Agents—the “digital teammates” who do all that without demanding coffee breaks. I remember a time when […]

How Industry Leaders are Stopping Identity Attacks

The CrowdStrike 2025 Global Threat Report highlights the ongoing threat of identity-based attacks. Adversaries are increasingly exploiting stolen credentials to evade detection, and 79% of detections overall were classified as malware-free. Valid account abuse became the primary initial access method in 35% of cloud intrusions. The report also shares that access broker advertisements rose […]

Microsoft Warns of Malvertising Campaign Infecting Over 1 Million Devices Worldwide

Mar 07, 2025Ravie LakshmananMalvertising / Open Source Microsoft has disclosed details of a large-scale malvertising campaign that’s estimated to have impacted over one million devices globally as part of what it said is an opportunistic attack designed to steal sensitive information. The tech giant, which detected the activity in early December 2024, is tracking it […]

The digital illusion: millennials and online safety risks

Millennials have grown up alongside the rise of social media and digital communication – and in many ways appear to be the most tech-savvy generation. However, our latest research reveals a concerning reality: 70 percent of millennials rarely verify the authenticity of the people they engage with online, leaving them vulnerable to cyberrisks such as […]

Cisco Live Melbourne SOC Report

Executive Summary Cisco has long provided security services for third party events such as the Black Hat and RSA Conferences, as well as the Super Bowl and the Olympic games. These services come in the form of products (Cisco Security Cloud capabilities, including Umbrella, XDR, Malware Analytics, etc. plus Splunk Enterprise Security); and skilled Security […]

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.2.0

Pangu has updated its jailbreak utility for iOS 9.0 to 9.0.2 with a fix for the manage storage bug and the latest version of Cydia. Change log V1.2.0 (2015-10-27) 1. Bundle latest Cydia with new Patcyh which fixed failure to open url scheme in MobileSafari 2. Fixed the bug that “preferences -> Storage&iCloud Usage -> […]

Apple Blocks Pangu Jailbreak Exploits With Release of iOS 9.1

Apple has blocked exploits used by the Pangu Jailbreak with the release of iOS 9.1. Pangu was able to jailbreak iOS 9.0 to 9.0.2; however, in Apple’s document on the security content of iOS 9.1, PanguTeam is credited with discovering two vulnerabilities that have been patched.

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.1.0

Pangu has released an update to its jailbreak utility for iOS 9 that improves its reliability and success rate. Change log V1.1.0 (2015-10-21) 1. Improve the success rate and reliability of jailbreak program for 64bit devices 2. Optimize backup process and improve jailbreak speed, and fix an issue that leads to fail to […]

Activator 1.9.6 Released With Support for iOS 9, 3D Touch

Ryan Petrich has released Activator 1.9.6, an update to the centralized gesture, button, and shortcut manager, that brings support for iOS 9 and 3D Touch.

JBL Flip 6 Portable Bluetooth Speaker (Open Box) for $74

Navee V25 300W Foldable e-Scooter for $299

Smart Tracker Includes Key Ring – Works with Apple Find My App (2-Pack) for $34

Harmony Premium Plan Lifetime Subscription for $99

Lenovo 11.6" 100e Chromebook 2nd Gen (2019) MediaTek MT8173C 4GB RAM 16GB eMMC (Refurbished) for $54

Introducing approximate nearest neighbor search in Elasticsearch 8.0

ANN in Elasticsearch

What is approximate nearest neighbor search?

Choosing an ANN algorithm

Show me the code!

JBL Flip 6 Portable Bluetooth Speaker (Open Box) for $74

Navee V25 300W Foldable e-Scooter for $299

Smart Tracker Includes Key Ring – Works with Apple Find My App (2-Pack) for $34

Harmony Premium Plan Lifetime Subscription for $99

Lenovo 11.6" 100e Chromebook 2nd Gen (2019) MediaTek MT8173C 4GB RAM 16GB eMMC (Refurbished) for $54

ANN in Elasticsearch

What is approximate nearest neighbor search?

Choosing an ANN algorithm

Show me the code!

Share this:

Reader Interactions

Leave a ReplyCancel reply