3 matches found
Emergent Misalignment As Prompt Sensitivity: a Research Note
Betley et al. 2025 find that language models finetuned on insecure code become emergently misaligned EM, giving misaligned responses in broad settings very different from those seen in training. However, it remains unclear as to why emergent misalignment occurs. We evaluate insecure models across...
Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models
Prior work shows that LLMs finetuned on malicious behaviors in a narrow domain e.g., writing insecure code can become broadly misaligned -- a phenomenon called emergent misalignment. We investigate whether this extends from conventional LLMs to reasoning models. We finetune reasoning models on...
“Emergent Misalignment” in LLMs
Interesting research: "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs": Abstract: We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model act...