26 matches found
Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety
Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that operates at the...
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
Multi-turn jailbreaks exploit the ability of large language models to accumulate and act on conversational context. Instead of stating a harmful request directly, an attacker can gradually steer the conversation toward an unsafe answer. Recent methods demonstrate this risk, but they are usually...
Jailbreaking Frontier Foundation Models through Intention Deception
Large vision-language models exhibit remarkable capability but remain highly susceptible to jailbreaking. Existing safety training approaches aim to have the model learn a refusal boundary between safe and unsafe, based on the user's intent. It has been found that this binary training regime ofte...
Your LLM Agent Can Leak Your Data: Data Exfiltration Via Backdoored Tool Use
Tool-use large language model LLM agents are increasingly deployed to support sensitive workflows, relying on tool calls for retrieval, external API access, and session memory management. While prior research has examined various threats, the risk of systematic data exfiltration by backdoored...
The Echo Chamber Multi-Turn LLM Jailbreak
The availability of Large Language Models LLMs has led to a new generation of powerful chatbots that can be developed at relatively low cost. As companies deploy these tools, security challenges need to be addressed to prevent financial loss and reputational damage. A key security challenge is...
Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models
Large Language Models LLMs face a significant threat from multi-turn jailbreak attacks, where adversaries progressively steer conversations to elicit harmful outputs. However, the practical effectiveness of existing attacks is undermined by several critical limitations: they struggle to maintain ...
Multi-Turn Jailbreaking Attack in Multi-Modal Large Language Models
In recent years, the security vulnerabilities of Multi-modal Large Language Models MLLMs have become a serious concern in the Generative Artificial Intelligence GenAI research. These highly intelligent models, capable of performing multi-modal tasks with high accuracy, are also severely susceptib...
Cisco Finds Open-Weight AI Models Easy to Exploit in Long Chats
Cisco’s new research shows that open-weight AI models, while driving innovation, face serious security risks as multi-turn attacks, including conversational persistence, can bypass safeguards and expose data...
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Large Language Models LLMs remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework fo...
HarmNet: A Framework for Adaptive Multi-Turn Jailbreak Attacks on Large Language Models
Large Language Models LLMs remain vulnerable to multi-turn jailbreak attacks. We introduce HarmNet, a modular framework comprising ThoughtNet, a hierarchical semantic network; a feedback-driven Simulator for iterative query refinement; and a Network Traverser for real-time adaptive attack...
Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
Large language models LLMs are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive rejection, which either fails against adaptive attackers or overly...
Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models
Large language models LLMs remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. These attacks target different harm categories like malware generation, harassment, or fraud through distinct conversational approaches...
NEXUS: Network Exploration for EXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks
Large Language Models LLMs have revolutionized natural language processing but remain vulnerable to jailbreak attacks, especially multi-turn jailbreaks that distribute malicious intent across benign exchanges and bypass alignment mechanisms. Existing approaches often explore the adversarial space...
STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining STAC, a novel multi-turn attack framework that exploits agent tool use. STA...
Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs
Large language models LLMs have undergone safety alignment efforts to mitigate harmful outputs. However, as LLMs become more sophisticated in reasoning, their intelligence may introduce new security risks. While traditional jailbreak attacks relied on singlestep attacks, multi-turn jailbreak...
Searching for Privacy Risks in LLM Agents Via Simulation
The widespread deployment of LLM-based agents is likely to introduce a critical privacy threat: malicious agents that proactively engage others in multi-turn interactions to extract sensitive information. These dynamic dialogues enable adaptive attack strategies that can cause severe privacy...
Amazon Nova AI Challenge -- Trusted AI: Advancing Secure, AI-Assisted Software Development
AI systems for software development are rapidly gaining prominence, yet significant challenges remain in ensuring their safety. To address this, Amazon launched the Trusted AI track of the Amazon Nova AI Challenge, a global competition among 10 university teams to drive advances in secure AI. In...
ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls
Large Language Models LLMs have demonstrated impressive fluency and reasoning capabilities, but their potential for misuse has raised growing concern. In this paper, we present ScamAgent, an autonomous multi-turn agent built on top of LLMs, capable of generating highly realistic scam call scripts...
Large Reasoning Models Are Autonomous Jailbreak Agents
Jailbreaking -- bypassing built-in safety mechanisms in AI models -- has traditionally required complex technical procedures or specialized human expertise. In this study, we show that the persuasive capabilities of large reasoning models LRMs simplify and scale jailbreaking, converting it into a...
RedCoder: Automated Multi-Turn Red Teaming for Code LLMs
Large Language Models LLMs for code generation i.e., Code LLMs have demonstrated impressive capabilities in AI-assisted software development and testing. However, recent studies have shown that these models are prone to generating vulnerable or even malicious code under adversarial settings...