Why LLM Safety Guardrails Collapse after Fine-Tuning: a Similarity Analysis between Alignment and Fine-Tuning Datasets
Recent advancements in large language models LLMs have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrail...