Improving information retrieval in the Elastic Stack: Steps to improve search relevance

Feb 19, 2023 by iHash Leave a Comment

Since 8.0 and the release of third-party natural language processing (NLP) models for text embeddings, users of the Elastic Stack have access to a wide variety of models to embed their text documents and perform query-based information retrieval using vector search.

Given all these components and their parameters, and depending on the text corpus you want to search in, it can be overwhelming to choose which settings will give the best search relevance.

In this series of blog posts, we will introduce a number of tests we ran using various publicly available data sets and information retrieval techniques that are available in the Elastic Stack. We’ll then provide recommendations of the best techniques to use depending on the setup.

To kick off this series of blogs, we want to set the stage by describing the problem we are addressing and describe some methods we will dig further into in subsequent blogs.

Background and terminology

BM25: A sparse, unsupervised model for lexical search

The classic way documents are ranked for relevance by Elasticsearch according to a text query uses the Lucene implementation of the Okapi BM25 model. Although a few hyperparameters of this model were fine-tuned to optimize the results in most scenarios, this technique is considered unsupervised as labeled queries and documents are not required to use it: it’s very likely that the model will perform reasonably well on any corpus of text, without relying on annotated data. BM25 is known to be a strong baseline in zero-shot retrieval settings.

Under the hood, this kind of model builds a matrix of term frequencies (how many times a term appears in each document) and inverse document frequencies (inverse of how many documents contain each term). It then scores each query term for each document that was indexed based on those frequencies. Because each document typically contains a small fraction of all words used in the corpus, the matrix contains a lot of zeros. This is why this type of representation is called sparse.

Also, this model sums the relevance score of each individual term within a query for a document, without taking into account any semantic knowledge (synonyms, context, etc.). This is called lexical search (as opposed to semantic search). Its shortcoming is the so-called vocabulary mismatch problem, that query vocabulary is slightly different to the document vocabulary. This motivates other scoring models that try to incorporate semantic knowledge to avoid this problem.

Dense models: A dense, supervised model for semantic search

More recently, transformer-based models have allowed for a dense, context aware representation of text, addressing the principal shortcomings mentioned above.

To build such models, the following steps are required:

1. Pre-training
We first need to train a neural network to understand the basic syntax of natural language.

Using a huge corpus of text, the model learns semantic knowledge by training on unsupervised tasks (like Masked Word Prediction or Next Sentence Prediction).
BERT is probably the best known example of these models — it was trained on Wikipedia (2.5B words) and BookCorpus (800M words) using Masked Word Prediction.

This is called pre-training. The model learns vector representations of language tokens, which can be adapted for other tasks with much less training.

Source link

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Announcing general availability of Elastic Cloud Serverless on Google Cloud

Elastic Cloud Serverless provides the fastest way to start and scale security, observability, and search solutions — without managing infrastructure. Today, we are excited to announce the general availability of Elastic Cloud Serverless on Google Cloud — now available in the Iowa (us-central1) region. Elastic Cloud Serverless provides the fastest way to start and scale […]

Triada: a Trojan pre-installed on Android smartphones out of the box

The familiar checkout ritual at the supermarket: once everything’s been scanned — the offer, delivered with a hopeful smile: “Chocolate bar for the road? It’s a good one, and the discount is almost criminal”. If you’re lucky, you get a delicious bonus at a great price. But more often than not they’re trying to sell […]

Black Hat Asia 2025: Innovation in the SOC

Cisco is honored to be a partner of the Black Hat NOC (Network Operations Center), as the Official Security Cloud Provider. This was our 9th year supporting Black Hat Asia. We work with other official providers to bring the hardware, software and engineers to build and secure the Black Hat network: Arista, Corelight, MyRepublic and […]

DPRK Hackers Steal $137M from TRON Users in Single-Day Phishing Attack

Apr 23, 2025Ravie LakshmananMalware / Cryptocurrency Multiple threat activity clusters with ties to North Korea (aka Democratic People’s Republic of Korea or DPRK) have been linked to attacks targeting organizations and individuals in the Web3 and cryptocurrency space. “The focus on Web3 and cryptocurrency appears to be primarily financially motivated due to the heavy sanctions […]

Does Your SSE Understand User Intent?

Enhanced Data Protection With AI Guardrails With AI apps, the threat landscape has changed. Every week, we see customers are asking questions like: How do I mitigate leakage of sensitive data into LLMs? How do I even discover all the AI apps and chatbots users are accessing? We saw how the Las Vegas Cybertruck bomber […]

iProVPN: 3-Year Subscription for $29

Expires April 23, 2026 06:59 PST Buy now and get 90% off KEY FEATURES Seamlessly Connect, Browse & Access Anywhere with Confidence Protect your digital privacy while surfing the web with iProVPN. The internet is not the open space that we imagine; geo-restrictions and censorship are some of the prevailing barriers that limit true online […]

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.2.0

Pangu has updated its jailbreak utility for iOS 9.0 to 9.0.2 with a fix for the manage storage bug and the latest version of Cydia. Change log V1.2.0 (2015-10-27) 1. Bundle latest Cydia with new Patcyh which fixed failure to open url scheme in MobileSafari 2. Fixed the bug that “preferences -> Storage&iCloud Usage -> […]

Apple Blocks Pangu Jailbreak Exploits With Release of iOS 9.1

Apple has blocked exploits used by the Pangu Jailbreak with the release of iOS 9.1. Pangu was able to jailbreak iOS 9.0 to 9.0.2; however, in Apple’s document on the security content of iOS 9.1, PanguTeam is credited with discovering two vulnerabilities that have been patched.

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.1.0

Pangu has released an update to its jailbreak utility for iOS 9 that improves its reliability and success rate. Change log V1.1.0 (2015-10-21) 1. Improve the success rate and reliability of jailbreak program for 64bit devices 2. Optimize backup process and improve jailbreak speed, and fix an issue that leads to fail to […]

Activator 1.9.6 Released With Support for iOS 9, 3D Touch

Ryan Petrich has released Activator 1.9.6, an update to the centralized gesture, button, and shortcut manager, that brings support for iOS 9 and 3D Touch.

Internxt Cloud Storage Lifetime Subscription (20TB) for $499

iProVPN: 3-Year Subscription for $29

Apple AirPods Pro 2 with MagSafe USB-C Charging Case (Refurbished) for $159

Apple Mac mini M2 (Early 2023) 8GB RAM 256GB SSD (Refurbished) for $359

JBL Flip 6 Portable Bluetooth Speaker (Open Box) for $74

Improving information retrieval in the Elastic Stack: Steps to improve search relevance

Background and terminology

BM25: A sparse, unsupervised model for lexical search

Dense models: A dense, supervised model for semantic search

Internxt Cloud Storage Lifetime Subscription (20TB) for $499

iProVPN: 3-Year Subscription for $29

Apple AirPods Pro 2 with MagSafe USB-C Charging Case (Refurbished) for $159

Apple Mac mini M2 (Early 2023) 8GB RAM 256GB SSD (Refurbished) for $359

JBL Flip 6 Portable Bluetooth Speaker (Open Box) for $74

Background and terminology

BM25: A sparse, unsupervised model for lexical search

Dense models: A dense, supervised model for semantic search

Share this:

Reader Interactions

Leave a ReplyCancel reply