Probe Before You Talk: Towards Black-Box Defense against Backdoor Unalignment for Large Language Models
Backdoor unalignment attacks against Large Language Models LLMs enable the stealthy compromise of safety alignment using a hidden trigger while evading normal safety auditing. These attacks pose significant threats to the applications of LLMs in the real-world Large Language Model as a Service...