Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
Defending large language models LLMs against jailbreak attacks, such as Greedy Coordinate Gradient GCG, remains a challenge, particularly under adaptive threat models where an attacker directly targets the defense mechanism. JBShield, a recent jailbreak defense with a 0% attack success rate in so...