5 matches found
Jailbreaking LLMs and VLMs: Mechanisms, Evaluation, and Unified Defense
This paper provides a systematic survey of jailbreak attacks and defenses on Large Language Models LLMs and Vision-Language Models VLMs, emphasizing that jailbreak vulnerabilities stem from structural factors such as incomplete training data, linguistic ambiguity, and generative uncertainty. It...
ARMOR: Aligning Secure and Safe Large Language Models Via Meticulous Reasoning
Large Language Models LLMs have demonstrated remarkable generative capabilities. However, their susceptibility to misuse has raised significant safety concerns. While post-training safety alignment methods have been widely adopted, LLMs remain vulnerable to malicious instructions that can bypass...
Agent Safety Alignment Via Reinforcement Learning
The emergence of autonomous Large Language Model LLM agents capable of tool usage has introduced new safety risks that go beyond traditional conversational misuse. These agents, empowered to execute external functions, are vulnerable to both user-initiated threats e.g., adversarial prompts and...
Jailbreak Distillation: Renewable Safety Benchmarking
Large language models LLMs are rapidly deployed in critical applications, raising urgent needs for robust safety benchmarking. We propose Jailbreak Distillation JBDistill, a novel benchmark construction framework that "distills" jailbreak attacks into high-quality and easily-updatable safety...
Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction
LLM-based autonomous agents possess capabilities such as reasoning, tool invocation, and environment interaction, enabling the execution of complex multi-step tasks. The internal reasoning process, i.e., thought, of behavioral trajectory significantly influences tool usage and subsequent actions...