As part of our natural language processing (NLP) blog series, we will walk through an example of using a text embedding model to generate vector representations of textual contents and demonstrating vector similarity search on generated vectors. We will deploy a publicly available model on Elasticsearch and use it in an ingest pipeline to generate embeddings from textual documents. We will then show how to use those embeddings in the vector similarity search to find semantically similar documents for a given query.
Vector similarity search or, as is commonly called semantic search, goes beyond the traditional keyword based search and allows users to find semantically similar documents that may not have any common keywords thus providing a wider range of results. Vector similarity search operates on dense vectors and uses k-nearest neighbour search to find similar vectors. For this, contents in the textual form first need to be converted to their numeric vector representations using a text embedding model.
We will use a public dataset from the MS MARCO Passage Ranking Task for demonstration. It consists of real questions from the Microsoft Bing search engine and human generated answers for them. This dataset is a perfect resource for testing vector similarity search, firstly, because question-answering is a one of the most common use cases for vector search, and secondly, the top papers in the MS MARCO leaderboard use vector search in some form.
In our example we will work with a sample of this dataset, use a model to produce text embeddings, and then run vector search on it. We hope to also do a quick verification of the quality of produced results from the vector search.
The first step is to install a text embedding model. For our model we use msmarco-distilbert-base-tas-b from Hugging Face. This is a sentence-transformer model that takes a sentence or a paragraph and maps it to a 768 dimensional dense vector. This model is optimized for semantic search and was specifically trained on the MS MARCO Passage dataset, making it suitable for our task. Besides this model, Elasticsearch supports a number of other models for text embedding. The full list can be found here.
We install the model with the Eland docker agent that we built in the NER example. Running a script below imports our model into our local cluster and deploys it:
eland_import_hub_model \
--url https://<user>:<password>@localhost:9200/ \
--hub-model-id sentence-transformers/msmarco-distilbert-base-tas-b \
--task-type text_embedding \
--start
This time, –task-type is set to text_embedding and the –start option is passed to the Eland script so the model will be deployed automatically without having to start it in the Model Management UI. To speed up inferences, you can increase the number of inference threads with inference_threads parameter.
We can test the successful deployment of the model by using this example in Kibana Console:
POST
/_ml/trained_models/sentence-transformers__msmarco-distilbert-base-tas-b/deployment/_infer
{
"docs": {
"text_field": "how is the weather in jamaica"
}
}
We should see the predicted dense vector as the result:
{
"predicted_value" : [
-0.0919460579752922,
-0.4940606653690338,
0.035987671464681625,
…
]
}
As mentioned in the introduction, we use the MS MARCO Passage Ranking dataset. The dataset is quite big, consisting of over 8 million passages. For our example, we use a subset of it that was used in the testing stage of the 2019 TREC Deep Learning Track. The dataset msmarco-passagetest2019-top1000.tsv used for the re-ranking task contains 200 queries and for each query a list of relevant text passages extracted by a simple IR system. From that dataset, we’ve extracted all unique passages with their ids, and put them into a separate tsv file, totaling 182469 passages. We use this file as our dataset.
We use Kibana’s file upload feature to upload this dataset. Kibana file upload allows us to provide custom names for fields, let’s call them id with type long for passages’ ids, and text with type text for passages’ contents. The index name is collection. After the upload, we can see an index named collection with 182469 documents.
We want to process the initial data with an Inference processor that will add an embedding for each passage. For this, we create a text embedding ingest pipeline and then reindex our initial data with this pipeline.
In the Kibana Console we create an ingest pipeline (as we did in the previous blog post), this time for text embeddings, and call it text-embeddings. The passages are in a field named text. As we did before, we’ll define a field_map to map text to the field text_field that the model expects. Similarly on_failure handler is set to index failures into a different index:
PUT _ingest/pipeline/text-embeddings
{
"description": "Text embedding pipeline",
"processors": [
{
"inference": {
"model_id": "sentence-transformers__msmarco-distilbert-base-tas-b",
"target_field": "text_embedding",
"field_map": {
"text": "text_field"
}
}
}
],
"on_failure": [
{
"set": {
"description": "Index document to 'failed-<index>'",
"field": "_index",
"value": "failed-{{{_index}}}"
}
},
{
"set": {
"description": "Set error message",
"field": "ingest.failure",
"value": "{{_ingest.on_failure_message}}"
}
}
]
}
}
Read more
We want to reindex documents from the collection index into the new collection-with-embeddings index by pushing documents through text-embeddings pipeline, so that documents in the collection-with-embeddings index have an additional field for passages’ embeddings. But before we do that, we need to create and define a mapping for our destination index, in particular for the field text_embedding.predicted_value where the ingest processor will store embeddings. If we don’t do that, embeddings will be indexed into regular float fields and can’t be used for vector similarity search. The model we use produces embeddings as 768 dimensional vectors, hence we use the indexed dense_vector field type with 768 dims, as following:
PUT collection-with-embeddings
{
"mappings": {
"properties": {
"text_embedding.predicted_value": {
"type": "dense_vector",
"dims": 768,
"index": true,
"similarity": "cosine"
},
"text": {
"type": "text"
}
}
}
}
Read more
Finally, we are ready to reindex. Given that reindex will take some time to process all documents and infer on them, we do reindex in the background by invoking the API with the wait_for_completion=false flag.
POST _reindex?wait_for_completion=false
{
"source": {
"index": "collection"
},
"dest": {
"index": "collection-with-embeddings",
"pipeline": "text-embeddings"
}
}
Read more
The above returns a task id. We can monitor progress of the task with:
GET _tasks/<task_id>
Alternatively, track progress by watching Inference count increase in the model stats API or model stats UI.
The reindexed documents now contain the inference results – vector embeddings. As an example one of the documents looks like this:
{
"id": 7130104,
"text": "This is the definition of RNA along with examples of types of RNA molecules. This is the definition of RNA along with examples of types of RNA molecules. RNA Definition",
"text_embedding":
{
"predicted_value":
[
0.057356324046850204,
0.1602816879749298,
-0.18122544884681702,
0.022277727723121643,
....
],
"model_id": "sentence-transformers__msmarco-distilbert-base-tas-b"
}
}
Read more
Currently we don’t support implicitly generate embeddings from query terms during a search request, so our semantic search is organized as a 2-step process:
- Obtaining a text embedding from a textual query. For this, we use _infer API of our model.
- Using vector search to find documents semantically similar to the query text. In Elasticsearch v8.0, we introduced a new _knn_search endpoint that allows efficient approximate nearest neighbours search on indexed dense_vector fields. We use _knn_search API to find closest documents.
For example, give a textual query “how is the weather in jamaica“, we first run _infer API to get its embedding as a dense vector:
POST /_ml/trained_models/sentence-transformers__msmarco-distilbert-base-tas-b/deployment/_infer
{
"docs": {
"text_field": "how is the weather in jamaica"
}
}
After that we plug the resulted dense vector into _knn_search as following:
GET collection-with-embeddings/_knn_search
{
"knn": {
"field": "text_embedding.predicted_value",
"query_vector": [
-0.0919460579752922,
-0.4940606653690338,
0.035987671464681625,
…
],
"k": 10,
"num_candidates": 100
},
"_source": [
"id",
"text"
]
}
Read more
As a result, we get top 10 closest to the query documents sorted by their proximity to the query:
"hits" : [
{
"_index" : "collection-with-embeddings",
"_id" : "6H_OsH8Bi5IvRzQ7g-Aa",
"_score" : 0.9527166,
"_source" : {
"id" : 6140,
"text" : "Ocho Rios Jamaica Weather - Winter ( December, January And February) The winters in this town are usually colder when compared to other parts of the island. The average temperature for December, January and February are 81 °F and 79 °F respectively. All three months usually have a high temperature of 84 °F."
}
},
{
"_index" : "collection-with-embeddings",
"_id" : "6n_OsH8Bi5IvRzQ7g-Aa",
"_score" : 0.95225316,
"_source" : {
"id" : 6142,
"text" : "Jamaica Weather and When to Go. Jamaica weather essentials. For more details on the current temperature, wind, and stuff like that you can check any search engine weather feature. The rainy months, also called the rainy season, are generally from the end of April, or early May, until the end of September or early October."
}
},
{
"_index" : "collection-with-embeddings",
"_id" : "5n_OsH8Bi5IvRzQ7g-Aa",
"_score" : 0.9394933,
"_source" : {
"id" : 6138,
"text" : "Quick Answer. Hurricane season in Jamaica starts on June 1 and ends on Nov. 30. Satellite weather forecasts work to allow tourists and island dwellers adequate time to take precautions when hurricanes approach during those months. Continue Reading."
}
},
…
Read more
As we used only a subset of MS MARCO dataset, we can not do a full evaluation. What we can do instead is a simple verification on a few queries just to get a sense that we indeed are getting relevant results, and not some random ones. From the TREC 2019 Deep Learning Track judgements for Passage Ranking Task, we take the 3 last queries, submit them to our vector similarity search, get top 10 results and consult the TREC judgments to see how relevant are the results that we have received. For the Passage Ranking task, passages are judged on a four-point scale of Irrelevant (0), Related (the passage is on-topic but does not answer the question) (1), Highly Relevant (2), and Perfectly Relevant (3).
Please note that our verification is not a rigorous evaluation, it is used only for our quick demo. Since we only indexed passages that are known to be related to the queries, it is a much easier task than the original passage retrieval task. In the future we intend to do a rigorous evaluation on the MS MARCO dataset.
Query #1124210 “tracheids are part of _____” submitted to our vector search returns the following results:
Passage id |
Relevance rating |
Passage |
---|---|---|
2258591 |
2 – highly relevant |
Tracheid of oak shows pits along the walls. It is longer than a vessel element and has no perforation plates. Tracheids are elongated cells in the xylem of vascular plants that serve in the transport of water and mineral salts.Tracheids are one of two types of tracheary elements, vessel elements being the other. Tracheids, unlike vessel elements, do not have perforation plates.racheids provide most of the structural support in softwoods, where they are the major cell type. Because tracheids have a much higher surface to volume ratio compared to vessel elements, they serve to hold water against gravity (by adhesion) when transpiration is not occurring. |
2258592 |
3 – perfectly relevant |
Tracheid. a dead lignified plant cell that functions in water conduction. Tracheids are found in the xylem of all higher plants except certain angiosperms, such as cereals and sedges, in which the water-conducting function is performed by vessels, or tracheae.Tracheids are usually polygonal in cross section; their walls have annular, spiral, or scalene thickenings or rimmed pores.racheids are found in the xylem of all higher plants except certain angiosperms, such as cereals and sedges, in which the water-conducting function is performed by vessels, or tracheae. Tracheids are usually polygonal in cross section; their walls have annular, spiral, or scalene thickenings or rimmed pores. |
2728448 |
2 – highly relevant |
The xylem tracheary elements consist of cells known as tracheids and vessel members, both of which are typically narrow, hollow, and elongated. Tracheids are less specialized than the vessel members and are the only type of water-conducting cells in most gymnosperms and seedless vascular plants. |
7443586 |
2 – highly relevant |
1 The xylem tracheary elements consist of cells known as tracheids and vessel members, both of which are typically narrow, hollow, and elongated. Tracheids are less specialized than the vessel members and are the only type of water-conducting cells in most gymnosperms and seedless vascular plants. |
8026737 |
2 – highly relevant |
Its major components include xylem parenchyma, xylem fibers, tracheids, and xylem vessels. Tracheids are one of the two types of tracheary elements of vascular plants. (The other being the vessel elements). A tracheid cell loses its protoplast at maturity. Thus, at maturity, it becomes one of the non-living components of the xylem. |
2258595 |
2 – highly relevant |
Summary: Vessels have perforations at the end plates while tracheids do not have end plates. Tracheids are derived from single individual cells while vessels are derived from a pile of cells. Tracheids are present in all vascular plants whereas vessels are confined to angiosperms.Tracheids are thin whereas vessel elements are wide. Tracheids have a much higher surface-to-volume ratio as compared to vessel elements.Vessels are broader than tracheids with which they are associated.Morphology of the perforation plate is different from that in tracheids.racheids are thin whereas vessel elements are wide. Tracheids have a much higher surface-to-volume ratio as compared to vessel elements. Vessels are broader than tracheids with which they are associated. Morphology of the perforation plate is different from that in tracheids. |
181177 |
2 – highly relevant |
In most plants, pitted tracheids function as the primary transport cells. The other type of tracheary element, besides the tracheid, is the vessel element. Vessel elements are joined by perforations into vessels. In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes. |
131190 |
3 – perfectly relevant |
Xylem tracheids are pointed, elongated xylem cells, the simplest of which have continuous primary cell walls and lignified secondary wall thickenings in the form of rings, hoops, or reticulate networks. |
2258597 |
2 – highly relevant |
Thank you… In plants xylem and phloem are the complex tissues which are the components parts of conductive system. In higher plants xylem contains tracheids, vessels (tracheae), xylem fibres(wood fibres) and xylem parenchyma (wood parenchyma).Tracheids These are elongated narrow tube like cells with hard thick and lignified walls with large cell cavity.hank you… In plants xylem and phloem are the complex tissues which are the components parts of conductive system. In higher plants xylem contains tracheids, vessels (tracheae), xylem fibres(wood fibres) and xylem parenchyma (wood parenchyma). |
6541866 |
2 – highly relevant |
In most plants, pitted tracheids function as the primary transport cells. The other type of tracheary element, besides the tracheid, is the vessel element. Vessel elements are joined by perforations into vessels. In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes.n most plants, pitted tracheids function as the primary transport cells. The other type of tracheary element, besides the tracheid, is the vessel element. Vessel elements are joined by perforations into vessels. In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes. |
Query #1129237 “hydrogen is a liquid below what temperature” returns the following results:
Passage id |
Relevance rating |
Passage |
---|---|---|
128984 |
3 – perfectly relevant |
Hydrogen gas has the molecular formula H 2. At room temperature and under standard pressure conditions, hydrogen is a gas that is tasteless, odorless and colorless. Hydrogen can exist as a liquid under high pressure and an extremely low temperature of 20.28 kelvin (−252.87°C, −423.17 °F). Hydrogen is often stored in this way as liquid hydrogen takes up less space than hydrogen in its normal gas form. Liquid hydrogen is also used as a rocket fuel. |
5906130 |
3 – perfectly relevant |
Rating Newest Oldest. Best Answer: Hydrogen, like water, can exist in 3 states….Solid, Liquid and Gas Its temperature as a solid is −259.14 °C’ Hydrogen melts to liquid at −252.87 °C. It boils and vaporises at -252.125 °C Just cooling or compressing Hydrogen won’t liquefy or freeze it. |
4254815 |
1 – related |
Answer The boiling point of liquid hydrogen is 20.268 K (-252.88 °C or -423.184 °F) The freezing point of hydrogen is 14.025 K (-259.125 °C or -434. |
8588222 |
0 – irrelevant |
Answer to: Hydrogen is a liquid below what temperature? By signing up, you’ll get thousands of step-by-step solutions to your homework questions…. for Teachers for Schools for Companies |
8588219 |
3 – perfectly relevant |
User: Hydrogen is a liquid below what temperature? a. 100 degrees C c. -183 degrees C b. -253 degrees C d. 0 degrees C Weegy: Hydrogen is a liquid below 253 degrees C. User: What is the boiling point of oxygen? a. 100 degrees C c. -57 degrees C b. 8 degrees C d. -183 degrees C Weegy: The boiling point of oxygen is -183 degrees C. |
4254811 |
3 – perfectly relevant |
At STP (standard temperature and pressure) hydrogen is a gas. It cools to a liquid at -423 °F, which is only about 37 degrees above absolute zero. Eleven degrees cooler, at … -434 °F, it starts to solidify. |
128989 |
3 – perfectly relevant |
Confidence votes 11.4K. At STP (standard temperature and pressure) hydrogen is a gas. It cools to a liquid at -423 °F, which is only about 37 degrees above absolute zero. Eleven degrees cooler, at -434 °F, it starts to solidify. |
2697752 |
2 – highly relevant |
Hydrogen’s state of matter is gas at standard conditions of temperature and pressure. Hydrogen condenses into a liquid or freezes solid at extremely cold… Hydrogen’s state of matter is gas at standard conditions of temperature and pressure. Hydrogen condenses into a liquid or freezes solid at extremely cold temperatures. Hydrogen’s state of matter can change when the temperature changes, becoming a liquid at temperatures between minus 423.18 and minus 434.49 degrees Fahrenheit. It becomes a solid at temperatures below minus 434.49 F.Due to its high flammability, hydrogen gas is commonly used in combustion reactions, such as in rocket and automobile fuels. |
6080460 |
3 – perfectly relevant |
Hydrogen can exist as a liquid under high pressure and an extremely low temperature of 20.28 kelvin (−252.87°C, −423.17 °F). Hydrogen is often stored in this way as liquid hydrogen takes up less space than hydrogen in its normal gas form. Liquid hydrogen is also used as a rocket fuel.ydrogen is found in large amounts in giant gas planets and stars, it plays a key role in powering stars through fusion reactions. Hydrogen is one of two important elements found in water (H 2 O). Each molecule of water is made up of two hydrogen atoms bonded to one oxygen atom. |
3905802 |
3 – perfectly relevant |
Hydrogen is found naturally in the molecular H2 form. To exist as a liquid, H2 must be cooled below hydrogen’s critical point of 33 K. However, for hydrogen to be in a fully liquid state without boiling at atmospheric pressure, it needs to be cooled to 20.28 K (−423.17 °F/−252.87 °C). |
Query #1133167 “how is the weather in jamaica” returns the following results:
Passage id |
Relevance rating |
Passage |
---|---|---|
3023123 |
2 – highly relevant |
Climate – Jamaica. Temperature, rainfall, prevailing weather conditions, when to go, what to pack. In Jamaica the climate is tropical, hot all year round, with little difference between winter and summer (just a few degrees). Even in winter, the maximum temperatures are around 27/30 °C, and minimum temperatures around 20/23 °C. |
434121 |
2 – highly relevant |
Temperature, rainfall, prevailing weather conditions, when to go, what to pack. In Jamaica the climate is tropical, hot all year round, with little difference between winter and summer (just a few degrees). Even in winter, the maximum temperatures are around 27/30 °C (81/86 °F), and minimum temperatures around 20/23 °C (68/73 °F). |
4922619 |
2 – highly relevant |
Weather. Jamaica averages about 80 degrees year-round, so climate is less a factor in booking travel than other destinations. The days are warm and the nights are cool. Rain usually falls for short periods in the late afternoon, with sunshine the rest of the day. |
434119 |
2 – highly relevant |
Map from Google – Jamaica. 1 In Jamaica the climate is tropical, hot all year round, with little difference between winter and summer (just a few degrees). Even in winter, the maximum temperatures are around 27/30 °C (81/86 °F), and minimum temperatures around 20/23 °C (68/73 °F). |
8255706 |
2 – highly relevant |
And it’s absolutely true. This is Jamaica weather! Most of our days are filled with warmth and sunshine, even during the rainy season. Jamaica has a tropical climate with hot and humid weather at sea level. The higher inland regions have a more temperate climate. (Bring a light jacket just in case you travel to the mountains where temperatures can be 10 degrees cooler or in case you go on a windy boat ride). |
190806 |
2 – highly relevant |
It is always important to know what the weather in Jamaica will be like before you plan and take your vacation. For the most part, the average temperature in Jamaica is between 80 °F and 90 °F (27 °FCelsius-29 °Celsius). Luckily, the weather in Jamaica is always vacation friendly. You will hardly experience long periods of rain fall, and you will become accustomed to weeks upon weeks of sunny weather. |
1824486 |
2 – highly relevant |
The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably… |
4498474 |
3 – perfectly relevant |
The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. |
1824480 |
3 – perfectly relevant |
Quick Answer. The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. |
434125 |
3 – perfectly relevant |
The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. |
As we can see for all 3 queries Elasticsearch returned mostly relevant results, and the top 1st result for all queries was either highly or perfectly relevant.
NLP is a powerful feature in the Elastic Stack with an exciting roadmap. Discover new features and keep up with the latest developments by building your cluster in Elastic Cloud. Sign up for a free 14-day trial today and try the examples in this blog.
Leave a Reply