371 matches found
Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models
As Large Language Models evolve for user convenience, vulnerability to jailbreak attacks continues to be reported despite ongoing efforts in safety training. Traditional jailbreak techniques typically focus on a single prompt injection, neglecting the models' ability to remember the flow of...
Evolving Skill-Structured Attack Memory Enhances LLM Jailbreaking
Jailbreak attacks on large language models LLMs aim to induce LLMs to produce content that they are expected to refuse. Automated black-box jailbreak generation is especially important for safety evaluation, where the attacker observes only model outputs and needs to automatically search for...
BAIT: Boundary-Guided Disclosure Escalation Via Self-Conditioned Reasoning
In this work, we propose BAIT Boundary-Aware Iterative Trap, a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed...
Reasoning As an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs
Large Reasoning Models LRMs have demonstrated remarkable capabilities in reasoning and generation tasks and are increasingly deployed in real-world applications. However, their explicit chain-of-thought CoT mechanism introduces new security risks, making them particularly vulnerable to jailbreak...
Babel: Jailbreaking Safety Attention Via Obfuscation Distribution Optimized Sampling
Despite rigorous safety alignment, Large Language Models LLMs remain vulnerable to jailbreak attacks. Existing black-box methods often rely on heuristic templates or exhaustive trials, lacking mechanistic interpretability and query efficiency. In this study, we investigate an intrinsic...
Re-Triggering Safeguards within LLMs for Jailbreak Detection
This paper proposes a jailbreaking prompt detection method for large language models LLMs to defend against jailbreak attacks. Although recent LLMs are equipped with built-in safeguards, it remains possible to craft jailbreaking prompts that bypass them. We argue that such jailbreaking prompts ar...
Position: AI Security Policy Should Target Systems, Not Models
We present swarm-attack, an open-source adversarial testing framework in which multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization. Together, our results demonstrate that both safety bypass of frontier models and software...
The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security beyond Binary Scoring
Jailbreak attacks -- adversarial prompts that bypass LLM alignment through purely linguistic manipulation -- pose a growing operational security threat, yet the field lacks large-scale, reproducible infrastructure for generating, categorizing, and evaluating them systematically. This paper...
OrchJail: Jailbreaking Tool-Calling Text-To-Image Agents by Orchestration-Guided Fuzzing
Tool-calling text-to-image T2I agents can plan and execute multi-step tool chains to accomplish complex generation and editing queries. However, this capability introduces a new safety attack surface: harmful outputs may arise from tool orchestration, where individually benign steps combine into...
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
Defending large language models LLMs against jailbreak attacks, such as Greedy Coordinate Gradient GCG, remains a challenge, particularly under adaptive threat models where an attacker directly targets the defense mechanism. JBShield, a recent jailbreak defense with a 0% attack success rate in so...
ContextualJailbreak: Evolutionary Red-Teaming Via Simulated Conversational Priming
Large language models LLMs remain vulnerable to jailbreak attacks that bypass safety alignment and elicit harmful responses. A growing body of work shows that contextual priming, where earlier turns covertly bias later replies, constitutes a powerful attack surface, with hand-crafted multi-turn...
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal Tracing, we uncover the Refusal Trajectory-a persiste...
Jailbroken Frontier Models Retain Their Capabilities
As language model safeguards become more robust, attackers are pushed toward developing increasingly complex jailbreaks. Prior work has found that this complexity imposes a "jailbreak tax" that degrades the target model's task performance. We show that this tax scales inversely with model...
Evaluating Jailbreaking Vulnerabilities in LLMs Deployed As Assistants for Smart Grid Operations: A Benchmark against NERC Standards
The deployment of Large Language Models LLMs as assistants in electric grid operations promises to streamline compliance and decision-making but exposes new vulnerabilities to prompt-based adversarial attacks. This paper evaluates the risk of jailbreaking LLMs, i.e., circumventing safety alignmen...
AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models
Automated red-teaming methods for large language models typically optimize attack prompts within a fixed, human-designed strategy, leaving the attack strategy itself unchanged. We instead optimize the strategy. We propose AutoRISE, a method that searches over executable attack programs rather tha...
AVISE: Framework for Evaluating the Security of AI Systems
As artificial intelligence AI systems are increasingly deployed across critical domains, their security vulnerabilities pose growing risks of high-profile exploits and consequential system failures. Yet systematic approaches to evaluating AI security remain underdeveloped. In this paper, we...
HarmChip: Evaluating Hardware Security Centric LLM Safety Via Jailbreak Benchmarking
The integration of large language models LLMs into electronic design automation EDA workflows has introduced powerful capabilities for RTL generation, verification, and design optimization, but also raises critical security concerns. Malicious LLM outputs in this domain pose hardware-level threat...
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
Large Language Models LLMs are increasingly deployed across diverse domains, yet their vulnerability to jailbreak attacks, where adversarial inputs bypass safety mechanisms to elicit harmful outputs, poses significant security risks. While prior work has primarily focused on prompt injection...
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
Large language models remain vulnerable to jailbreak attacks -- inputs designed to bypass safety mechanisms and elicit harmful responses -- despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering HMNS, a circuit-level intervention that i identifies attentio...
ha-ps4-jb
🎮 PS4 JB Web Server — Home Assistant Add-on A Home Assistant...