At Cisco, AI threat research is fundamental to informing the ways we evaluate and protect models. In a space that is dynamic and rapidly evolving, these efforts help ensure that our customers are protected against emerging vulnerabilities and adversarial techniques.
This regular threat roundup shares useful highlights and critical intelligence from third-party threat research with the broader AI security community. As always, please remember that this is not an exhaustive or all-inclusive list of AI threats, but rather a curation that our team believes is particularly noteworthy.
Table of Contents
Notable threats and developments: February 2025
Adversarial reasoning at jailbreaking time
Cisco’s own AI security researchers at Robust Intelligence, in close collaboration with researchers from the University of Pennsylvania, developed an Adversarial Reasoning approach to automated model jailbreaking via test-time computation. This technique uses advanced model reasoning to effectively exploit the feedback signals provided by a large language model (LLM) to bypass its guardrails and execute harmful objectives.
The research in this paper expands on a recently published Cisco blog evaluating the security alignment of DeepSeek R1, OpenAI o1-preview, and various other frontier models. Researchers were able to achieve a 100% attack success rate (ASR) against the DeepSeek model, revealing massive security flaws and potential usage risks. This work suggests that future work on model alignment must consider not only individual prompts, but entire reasoning paths to develop robust defenses for AI systems.
MITRE ATLAS: AML.T0054 – LLM Jailbreak
Reference: arXiv
Voice-based jailbreaks for multimodal LLMs
Researchers from the University of Sydney and the University of Chicago have introduced a novel attack methodology called the Flanking Attack, the first instance of a voice-based jailbreak aimed at multimodal LLMs. The technique leverages voice modulation and context obfuscation to bypass model safeguards, proving to be a significant threat even when traditional text-based vulnerabilities have been extensively addressed.
In initial evaluations, the Flanking Attack achieved a high average attack success rate (ASR) between 0.67 and 0.93 across various harm scenarios including illegal activities, misinformation, and privacy violations. These findings highlight a massive potential risk to models like Gemini and GPT-4o that support audio inputs and reinforce the need for rigorous security measures for multimodal AI systems.
MITRE ATLAS: AML.T0054 – LLM Jailbreak
Reference: arXiv
Terminal DiLLMa: LLM terminal hijacking
Security researcher and red teaming expert Johann Rehberger shared a post on his personal blog exploring the potential for LLM applications to hijack terminals, building on a vulnerability first identified by researcher Leon Derczynski. This affects terminal services or command line (CLI) tools, for example, that integrate LLM responses without proper sanitization.
This vulnerability surrounds the use of ANSI escape codes in outputs from LLMs like GPT-4; these codes can control terminal behavior and can lead to harmful consequences such as terminal state alteration, command execution, and data exfiltration. The vector is most potent in scenarios where LLM outputs are directly displayed on terminal interfaces; in these cases, protections must be in place to prevent manipulation by an adversary.
MITRE ATLAS: AML.T0050 – Command and Scripting Interpreter
Reference: Embrace the Red; Inter Human Agreement (Substack)
TollCommander: Manipulating LLM tool-calling systems
A team of researchers representing three universities in China developed ToolCommander, an attack framework that injects malicious tools into an LLM application in order to perform privacy theft, denial of service, and unscheduled tool calling. The framework works in two stages, first capturing user queries through injection of a privacy theft tool and using this information to enhance subsequent attacks in the second stage, which involves injection of commands to call specific tools or disrupt tool scheduling.
Evaluations successfully revealed vulnerabilities in multiple LLM systems including GPT-4o mini, Llama 3, and Qwen2 with varying success rates; GPT and Llama models showed greater vulnerability, with ASRs as high as 91.67%. As LLM agents become increasingly common in various applications, this research underscores the importance of robust security measures for tool-calling capabilities.
MITRE ATLAS: AML.T0029 – Denial of ML Service; AML.T0053 – LLM Plugin Compromise
Reference: arXiv
We’d love to hear what you think. Ask a Question, Comment Below, and Stay Connected with Cisco Secure on social!
Cisco Security Social Channels
Instagram
Facebook
Twitter
LinkedIn
Share:
Leave a Reply