Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models
Prior work shows that LLMs finetuned on malicious behaviors in a narrow domain e.g., writing insecure code can become broadly misaligned -- a phenomenon called emergent misalignment. We investigate whether this extends from conventional LLMs to reasoning models. We finetune reasoning models on...