Since 8.0 and the release of third-party natural language processing (NLP) models for text embeddings, users of the Elastic Stack have access to a wide variety of models to embed their text documents and perform query-based information retrieval using vector search.
Given all these components and their parameters, and depending on the text corpus you want to search in, it can be overwhelming to choose which settings will give the best search relevance.
In this series of blog posts, we will introduce a number of tests we ran using various publicly available data sets and information retrieval techniques that are available in the Elastic Stack. We’ll then provide recommendations of the best techniques to use depending on the setup.
To kick off this series of blogs, we want to set the stage by describing the problem we are addressing and describe some methods we will dig further into in subsequent blogs.
Background and terminology
BM25: A sparse, unsupervised model for lexical search
The classic way documents are ranked for relevance by Elasticsearch according to a text query uses the Lucene implementation of the Okapi BM25 model. Although a few hyperparameters of this model were fine-tuned to optimize the results in most scenarios, this technique is considered unsupervised as labeled queries and documents are not required to use it: it’s very likely that the model will perform reasonably well on any corpus of text, without relying on annotated data. BM25 is known to be a strong baseline in zero-shot retrieval settings.
Under the hood, this kind of model builds a matrix of term frequencies (how many times a term appears in each document) and inverse document frequencies (inverse of how many documents contain each term). It then scores each query term for each document that was indexed based on those frequencies. Because each document typically contains a small fraction of all words used in the corpus, the matrix contains a lot of zeros. This is why this type of representation is called sparse.
Also, this model sums the relevance score of each individual term within a query for a document, without taking into account any semantic knowledge (synonyms, context, etc.). This is called lexical search (as opposed to semantic search). Its shortcoming is the so-called vocabulary mismatch problem, that query vocabulary is slightly different to the document vocabulary. This motivates other scoring models that try to incorporate semantic knowledge to avoid this problem.
Dense models: A dense, supervised model for semantic search
More recently, transformer-based models have allowed for a dense, context aware representation of text, addressing the principal shortcomings mentioned above.
To build such models, the following steps are required:
1. Pre-training
We first need to train a neural network to understand the basic syntax of natural language.
Using a huge corpus of text, the model learns semantic knowledge by training on unsupervised tasks (like Masked Word Prediction or Next Sentence Prediction).
BERT is probably the best known example of these models — it was trained on Wikipedia (2.5B words) and BookCorpus (800M words) using Masked Word Prediction.
This is called pre-training. The model learns vector representations of language tokens, which can be adapted for other tasks with much less training.
Leave a Reply