Safety Alignment Can Be Not Superficial with Explicit Safety Signals
Recent studies on the safety alignment of large language models LLMs have revealed that existing approaches often operate superficially, leaving models vulnerable to various adversarial attacks. Despite their significance, these studies generally fail to offer actionable solutions beyond data...