A quantitative approach to prompt tuning and LLM evaluation
Elastic has long been developing machine learning (ML) and AI-powered security detections. We constantly bring in new technologies when available to help make our users’ lives easier. So, with the rise of generative AI (GenAI), we have developed even more Elastic Security features to use this powerful, new technology. Among those are:
-
Elastic AI Assistant for Security: Our chatbot is built to answer questions about Elastic Security, help generate or translate natural language queries to ES|QL, provide context on alerts, and integrate with custom knowledge sources for bespoke organizational questions.
-
Attack Discovery (AD): This feature reviews alerts and discovers any active attacks, prioritizing and summarizing them for the user.
-
Automatic Import: This feature creates custom integrations based on a few sample log lines, alleviating the burden of creating parsing logic and normalization pipelines.
For those familiar with GenAI development, the area is rapidly growing. At Elastic, we are in a unique position in that we have real and proven GenAI-powered products that are serving users at scale — not just tinkering or as proof-of-concepts. This unique position is two-fold — we closely partner with and use leading GenAI development frameworks. In fact, we were named #2 in the Top 5 LangGraph Agents in Production 2024 by LangChain. We were also named GenAI Infrastructure and Data Partner of the Year by Amazon Web Services.
Table of Contents
Driving GenAI development
Elastic is also a creator of GenAI development tools, which not only enables our products but also those built by users of the Elastic Stack. We are the world’s most widely downloaded vector database, supporting RAG applications around the world. Due to this combination, we have a driver’s seat view of GenAI development, which we’re aiming to share more with those interested in building a production-grade GenAI system.
In this blog, we’ll share the behind the scenes of how our Security GenAI team and Security ML team develop and improve these GenAI features. How are we quantitatively ensuring that each improvement is really better? Given that we are in production and serving enterprise users at scale, we needed a robust and reproducible way of prompt tuning and evaluating various large language model (LLM) providers.
Constant improvements: Making security analysts’ lives easier
Since the release of Elastic AI Assistant in June 2023, delivering high-quality results to our users has been a top priority. Fast forward to 2025, we’ve not only rolled out numerous enhancements to the AI Assistant but also introduced groundbreaking features, such as Attack Discovery and Automatic Import. Throughout the development of these features and enhancements, we meticulously evaluated the quality of the outputs generated by various LLMs, continuously refining prompts and underlying code to meet our high standards.
Elastic AI Assistant for Security
One notable example is AI Assistant’s natural language-to-ES|QL generation functionality. To ensure AI Assistant returned valid ES|QL queries from natural language inputs, we started with a hands-on and largely manual approach. We created a spreadsheet filled with realistic queries that an analyst might use in a security operations center (SOC). Each query was manually put into the AI Assistant with responses recorded and compared to expected outputs.
While effective, this process was time-intensive. When LangSmith became available, we quickly integrated it into our workflow, enabling us to trace and debug with greater efficiency. LangSmith’s evaluation capabilities also allowed us to build the first iteration of our internal evaluation framework. This framework supports automated evaluations based on a set of parameters, including a list of LLMs and input datasets. With these tools, we successfully transitioned from manual to automated evaluations, significantly improving our workflow.
Attack Discovery
Evaluating Attack Discovery presented a more complex challenge for two key reasons.
-
AD’s input consists of sets of alerts representing one or more malicious attack scenarios. Creating realistic input alerts was essential to assess AD’s performance effectively.
-
Determining the ideal output required expertise in cybersecurity. AD’s goal is to explain malicious attacks chronologically and in a narrative style that can be easily understood by security analysts of all levels. This need for expert judgment meant that early evaluations relied heavily on manual review from Elastic’s security experts, who also provided the engineering team with realistic alert sets for testing.
Over time, our evaluation process has evolved into a robust framework designed to ensure that our GenAI features deliver tangible value to our security customers. In the sections that follow, we’ll dive deeper into the latest state of this framework and explore how we use it to ensure the quality and reliability of our AI-powered solutions.
GenAI evaluation framework: Knowing — not guessing — that each improvement is better
As mentioned in the previous section, we started using LangSmith and LangGraph together, enabling us to capture traces of each LLM call. On top of that, we developed a tailored evaluation framework, which became an essential tool in our development process. As we developed more improvements, there was more to consider. For example, which LLM model should we pick? We have a recommended LLM matrix as an outcome of those tests. And which prompts and variations perform the best?
Here are the components of the evaluation framework (which we will walk through in detail in following sections):
-
Test scenarios: Diverse scenarios that the user may come across and each with its own gold standard examples
-
Curated test dataset: An accumulation of gold standard examples covering various test scenarios
-
Tracing: Capturing the AI Agent execution graph as well as LLM calls and run metadata
-
Evaluation rubrics: Various behavior rubrics; for example, does this response seem like a hallucination? Does this response capture all the known user IDs in the query?
-
Scoring mechanism: A mathematical way to calculate final scores based on business requirements or desired heuristics
First, we’ll go through the test scenarios and curated test dataset, as well as how we easily created and tracked them with LangSmith.
Test scenarios and curated test datasets
Since Attack Discovery helps Elastic Security users find attacks from alerts, we needed to consider various attack types. We initially validated across datasets from detonated malware samples hosted and shared on Elastician James Spiteri’s ohmymalware.com project, but have since come up with many new attack scenarios, covering for example, living-off-the-cloud attack, various advanced persistent threats, and well-known vulnerabilities like the Log4j vulnerability (2021). Credit also goes to the incredible Elastic Security Labs team — one of such evaluation scenarios came from work presented at AWS re:Invent 2024.
For each scenario, we created a few expected responses. For some use cases, this might involve human-written outputs to compare with GenAI responses. But for our use case, we were able to run the scenarios through any LLM with a human-in-the-loop to decide if the result was good enough based on our criteria. For example, was the output clear to read from a user standpoint? And was the LLM summary accurate enough?
If the output is qualitatively good enough, we add it to our curated test dataset. Since we are using LangGraph and LangSmith, adding an example to a dataset is further simplified as the LangSmith UI has ways to add an existing output to a dataset.
It is important to have the scenarios and test datasets in order to have a baseline of “goodness” of GenAI outputs. But we didn’t immediately get to this point; the initial effort of creating scenarios can be time-consuming, and since LLM outputs can have a high variance, some of the selection of curated examples can be difficult.
However, this was a well-invested effort on an ongoing basis in order to know if our improvements are actually making the product better. This also enabled us to run automated LLM evaluations (“LLM-as-judge”) and experiment whenever we deploy a new change. The prompts used for LLM-as-judge can also be tuned. For simplicity, we will refer to both the prompts used to generate outputs as well as the “judge” outputs as prompts in this article.
Tracing
Next, we’ll touch on the tracing components. As mentioned above, we use LangGraph to design and run our AI Agent workflows behind the scenes, while LangSmith provides the tracing capabilities as well as streamlined tools for us to create test datasets and run evaluations.
For completeness, the following image illustrates the high-level workflow of how the Elastic Security AI Agents work — from when it gets a user request to when it generates the response. We use Elasticsearch as a vector database to power retrieval augmented generation (RAG) functionality.
Note: For users to enable AI Assistant and Attack Discovery, an LLM connector is required. We support all major providers — see our documentation page for an up-to-date list.
Evaluation rubrics and scoring mechanism
Rubrics are a way of evaluating a defined “desired behavior” of the LLM outputs and can contain many items — each responsible for checking a subset of desired behaviors. For instance, this could include the desired behavior of “the LLM should respond with plain language” and the evaluation rubric including the item “is the response written in plain language?”
For Elastic Security, this is an example of our rubric prompts and contains many evaluation items:
5. Evaluate the value of the “summaryMarkdown” field of all the “attackDiscoveries” in the submission JSON. Are the values of “summaryMarkdown” in the “submission” at least partially similar to that of the “expected response”, regardless of the order in which they appear, and summarize the same incident(s)? Summarize each summary, and explain your answer with lettered steps.
6. Evaluate the value of the “title” field of all the “attackDiscoveries” in the submission json. Are the “title” values in the submission at least partially similar to the title(s) of the “expected response”, regardless of the order in which they appear, and mention the same incident(s)?
With this rubric, we then use an LLM evaluator to check if the responses satisfy the rubric as illustrated in the image below. This is done directly in the flow between when a user submits their query and when the response is displayed. The rubric prompt checks in real time if the LLM output is good enough; if not, it will go back to the initial generator LLM to regenerate a response. See an example LangSmith trace.
For your use case, you may want to compare a few LLMs to determine which ones work the best for you. In our case, with this framework, we can evaluate an “evaluator” LLM as well as rubric prompts quantitatively as well.
Lastly, the scoring mechanism can create a final score based on your defined behaviors. For example, if you want to weigh a certain rubric higher, then you can multiply that score by a weight. In our case, we wanted to have a threshold of accuracy, so we would drop a prompt if the accuracy was lower than 85%. This is doable with your program of choice.
Putting it all together, you’d have an easily understandable results table — and you’d be able to see at a glance:
-
Is this new prompt doing better? Is it doing better on certain rubric items or not?
-
Rubrics themselves can also be treated as prompts to improve on! For example, we tightened up the wording of the rubrics in one improvement, and when we reran this framework, it confirmed that they performed better.
-
-
Which LLM was best at a specific task?
-
Which LLM has that highest score per our scoring mechanism?
Looking ahead
In this blog, we’ve walked through our GenAI development process — particularly how we can improve prompts and compare different configurations like selecting different LLMs, which is extensible to comparing and selecting all components, such as vector databases. This is the backing behind future improvements to Attack Discovery, Elastic AI Assistant, and more.
If you’re a user of Attack Discovery or Elastic AI Assistant for Security, thank you for using our tools. We look forward to your feedback! If you’re interested in learning more and using AI to speed up attack triage, check out the Attack Discovery page and Security Labs articles.
Lastly, if you’re a GenAI developer, we hope that this article can help you with structuring an evaluation workflow. We’re also continuously improving on our GenAI development systems and looking forward to sharing more.
If you’re interested in learning more about how Elastic enables and powers GenAI tools around the world, check out our Elasticsearch Labs articles.
The release and timing of any features or functionality described in this post remain at Elastic’s sole discretion. Any features or functionality not currently available may not be delivered on time or at all.
In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.
Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.
Leave a Reply