24 matches found
Analysis of LLMs against Prompt Injection and Jailbreak Attacks
Large Language Models LLMs are widely deployed in real-world systems. Given their broader applicability, prompt engineering has become an efficient tool for resource-scarce organizations to adopt LLMs for their own purposes. At the same time, LLMs are vulnerable to prompt-based attacks. Thus,...
ShallowJail: Steering Jailbreaks against Large Language Models
Large Language ModelsLLMs have been successful in numerous fields. Alignment has usually been applied to prevent them from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either...
Jailbreaking LLMs and VLMs: Mechanisms, Evaluation, and Unified Defense
This paper provides a systematic survey of jailbreak attacks and defenses on Large Language Models LLMs and Vision-Language Models VLMs, emphasizing that jailbreak vulnerabilities stem from structural factors such as incomplete training data, linguistic ambiguity, and generative uncertainty. It...
Odysseus: Jailbreaking Commercial Multimodal LLM-Integrated Systems Via Dual Steganography
By integrating language understanding with perceptual modalities such as images, multimodal large language models MLLMs constitute a critical substrate for modern AI systems, particularly intelligent agents operating in open and interactive environments. However, their increasing accessibility al...
NegBLEURT Forest: Leveraging Inconsistencies for Detecting Jailbreak Attacks
Jailbreak attacks designed to bypass safety mechanisms pose a serious threat by prompting LLMs to generate harmful or inappropriate content, despite alignment with ethical guidelines. Crafting universal filtering rules remains difficult due to their inherent dependence on specific contexts. To...
HarmNet: A Framework for Adaptive Multi-Turn Jailbreak Attacks on Large Language Models
Large Language Models LLMs remain vulnerable to multi-turn jailbreak attacks. We introduce HarmNet, a modular framework comprising ThoughtNet, a hierarchical semantic network; a feedback-driven Simulator for iterative query refinement; and a Network Traverser for real-time adaptive attack...
EUVD-2025-14804
Malicious code in bioql PyPI...
Breaking the Code: Security Assessment of AI Code Agents through Systematic Jailbreaking Attacks
Code-capable large language model LLM agents are increasingly embedded into software engineering workflows where they can read, write, and execute code, raising the stakes of safety-bypass "jailbreak" attacks beyond text-only settings. Prior evaluations emphasize refusal or harmful-text detection...
Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism Via Probabilistically Ablating Refusal Direction
Jailbreak attacks pose persistent threats to large language models LLMs. Current safety alignment methods have attempted to address these issues, but they experience two significant limitations: insufficient safety alignment depth and unrobust internal defense mechanisms. These limitations make...
Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?
Jailbreak attacks on Large Language Models LLMs have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient GCG has emerged as a general and effective approach that...
NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models
In deployment and application, large language models LLMs typically undergo safety alignment to prevent illegal and unethical outputs. However, the continuous advancement of jailbreak attack techniques, designed to bypass safety mechanisms with adversarial prompts, has placed increasing pressure ...
Mitigating Jailbreaks with Intent-Aware LLMs
Despite extensive safety-tuning, large language models LLMs remain vulnerable to jailbreak attacks via adversarially crafted instructions, reflecting a persistent trade-off between safety and task performance. In this work, we propose Intent-FT, a simple and lightweight fine-tuning approach that...
Attention Slipping: a Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs
As large language models LLMs become more integral to society and technology, ensuring their safety becomes essential. Jailbreak attacks exploit vulnerabilities to bypass safety guardrails, posing a significant threat. However, the mechanisms enabling these attacks are not well understood. In thi...
Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks
The widespread deployment of large language models LLMs has raised critical concerns over their vulnerability to jailbreak attacks, i.e., adversarial prompts that bypass alignment mechanisms and elicit harmful or policy-violating outputs. While proprietary models like GPT-4 have undergone extensi...
Investigating Vulnerabilities and Defenses against Audio-Visual Attacks: a Comprehensive Survey Emphasizing Multimodal Models
Multimodal large language models MLLMs, which bridge the gap between audio-visual and natural language processing, achieve state-of-the-art performance on several audio-visual tasks. Despite the superior performance of MLLMs, the scarcity of high-quality audio-visual training data and computation...
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
Large Reasoning Models LRMs introduce a new generation paradigm of explicitly reasoning before answering, leading to remarkable improvements in complex tasks. However, they pose great safety risks against harmful queries and adversarial attacks. While recent mainstream safety efforts on LRMs,...
JailbreaksOverTime: Detecting Jailbreak Attacks under Distribution Shift
Safety and security remain critical concerns in AI deployment. Despite safety training through reinforcement learning with human feedback RLHF 32, language models remain vulnerable to jailbreak attacks that bypass safety guardrails. Universal jailbreaks - prefixes that can circumvent alignment fo...
Concept Enhancement Engineering: a Lightweight and Efficient Robust Defense against Jailbreak Attacks in Embodied AI
Embodied Intelligence EI systems integrated with large language models LLMs face significant security risks, particularly from jailbreak attacks that manipulate models into generating harmful outputs or executing unsafe physical actions. Traditional defense strategies, such as input filtering and...
GHSA-F3MF-HM6V-JFHH Mesop Class Pollution vulnerability leads to DoS and Jailbreak attacks
From @jackfromeast and @superboy-zjc: We have identified a class pollution vulnerability in Mesop = 0.14.0 application that allows attackers to overwrite global variables and class attributes in certain Mesop modules during runtime. This vulnerability could directly lead to a denial of service Do...
Mesop Class Pollution vulnerability leads to DoS and Jailbreak attacks
From @jackfromeast and @superboy-zjc: We have identified a class pollution vulnerability in Mesop = 0.14.0 application that allows attackers to overwrite global variables and class attributes in certain Mesop modules during runtime. This vulnerability could directly lead to a denial of service Do...