SDD: Self-Degraded Defense against Malicious Fine-Tuning
Open-source Large Language Models LLMs often employ safety alignment methods to resist harmful instructions. However, recent research shows that maliciously fine-tuning these LLMs on harmful data can easily bypass these safeguards. To counter this, we theoretically uncover why malicious fine-tuni...