3 matches found
Jailbreaking in the Haystack
Recent advances in long-context language models LMs have enabled million-token inputs, expanding their capabilities across complex tasks like computer-use agents. Yet, the safety implications of these extended contexts remain unclear. To bridge this gap, we introduce NINJA short for...
Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
Large language models LLMs are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive rejection, which either fails against adaptive attackers or overly...
GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms
Large Language Models LLMs have been equipped with safety mechanisms to prevent harmful outputs, but these guardrails can often be bypassed through "jailbreak" prompts. This paper introduces a novel graph-based approach to systematically generate jailbreak prompts through semantic transformations...