Detecting backdoored language models at scale
Today, we are releasing new research on detecting backdoors in open-weight language models. Our research highlights several key properties of language model backdoors, laying the groundwork for a practical scanner designed to detect backdoored models at scale and improve overall trust in AI...