2 matches found
Blue Teaming Function-Calling Agents
We present an experimental evaluation that assesses the robustness of four open source LLMs claiming function-calling capabilities against three different attacks, and we measure the effectiveness of eight different defences. Our results show how these models are not safe by default, and how the...
Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
Large language models LLMs are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive rejection, which either fails against adaptive attackers or overly...