12 matches found
Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models
As Large Language Models evolve for user convenience, vulnerability to jailbreak attacks continues to be reported despite ongoing efforts in safety training. Traditional jailbreak techniques typically focus on a single prompt injection, neglecting the models' ability to remember the flow of...
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
Defending large language models LLMs against jailbreak attacks, such as Greedy Coordinate Gradient GCG, remains a challenge, particularly under adaptive threat models where an attacker directly targets the defense mechanism. JBShield, a recent jailbreak defense with a 0% attack success rate in so...
TrojanPraise: Jailbreak LLMs Via Benign Fine-Tuning
The demand of customized large language models LLMs has led to commercial LLMs offering black-box fine-tuning APIs, yet this convenience introduces a critical security loophole: attackers could jailbreak the LLMs by fine-tuning them with malicious data. Though this security issue has recently bee...
Jailbreaking Large Vision Language Models in Intelligent Transportation Systems
Large Vision Language Models LVLMs demonstrate strong capabilities in multimodal reasoning and many real-world applications, such as visual question answering. However, LVLMs are highly vulnerable to jailbreaking attacks. This paper systematically analyzes the vulnerabilities of LVLMs integrated ...
Jailbreaking in the Haystack
Recent advances in long-context language models LMs have enabled million-token inputs, expanding their capabilities across complex tasks like computer-use agents. Yet, the safety implications of these extended contexts remain unclear. To bridge this gap, we introduce NINJA short for...
CAVGAN: Unifying Jailbreak and Defense of LLMs Via Generative Adversarial Attacks on Their Internal Representations
Security alignment enables the Large Language Model LLM to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection...
Universal Jailbreak Suffixes Are Strong Attention Hijackers
We study suffix-based jailbreaks$\unicodex2013$a powerful family of attacks against large language models LLMs that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack Zou et al., 2023, we observe that suffixes vary in efficacy: some...
InfoFlood: Jailbreaking Large Language Models with Information Overload
Large Language Models LLMs have demonstrated remarkable capabilities across various domains. However, their potential to generate harmful responses has raised significant societal and regulatory concerns, especially when manipulated by adversarial techniques known as "jailbreak" attacks. Existing...
Alphabet Index Mapping: Jailbreaking LLMs through Semantic Dissimilarity
Large Language Models LLMs have demonstrated remarkable capabilities, yet their susceptibility to adversarial attacks, particularly jailbreaking, poses significant safety and ethical concerns. While numerous jailbreak methods exist, many suffer from computational expense, high token usage, or...
PIG: Privacy Jailbreak Attack on LLMs Via Gradient-Based Iterative In-Context Optimization
Large Language Models LLMs excel in various domains but pose inherent privacy risks. Existing methods to evaluate privacy leakage in LLMs often use memorized prefixes or simple instructions to extract data, both of which well-alignment models can easily block. Meanwhile, Jailbreak attacks bypass...
Practical Reasoning Interruption Attacks on Reasoning Large Language Models
Reasoning large language models RLLMs have demonstrated outstanding performance across a variety of tasks, yet they also expose numerous security vulnerabilities. Most of these vulnerabilities have centered on the generation of unsafe content. However, recent work has identified a distinct...
Token-Level Constraint Boundary Search for Jailbreaking Text-To-Image Models
Recent advancements in Text-to-Image T2I generation have significantly enhanced the realism and creativity of generated images. However, such powerful generative capabilities pose risks related to the production of inappropriate or harmful content. Existing defense mechanisms, including prompt...