3 matches found
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
Multi-turn jailbreaks exploit the ability of large language models to accumulate and act on conversational context. Instead of stating a harmful request directly, an attacker can gradually steer the conversation toward an unsafe answer. Recent methods demonstrate this risk, but they are usually...
Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
Large language models LLMs are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive rejection, which either fails against adaptive attackers or overly...
NEXUS: Network Exploration for EXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks
Large Language Models LLMs have revolutionized natural language processing but remain vulnerable to jailbreak attacks, especially multi-turn jailbreaks that distribute malicious intent across benign exchanges and bypass alignment mechanisms. Existing approaches often explore the adversarial space...