2 matches found
SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents Via Distilled Structured Reasoning
The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic red-teaming framework for arbitrary black-box LLM agents. We emplo...
Probe Before You Talk: Towards Black-Box Defense against Backdoor Unalignment for Large Language Models
Backdoor unalignment attacks against Large Language Models LLMs enable the stealthy compromise of safety alignment using a hidden trigger while evading normal safety auditing. These attacks pose significant threats to the applications of LLMs in the real-world Large Language Model as a Service...