Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
Large language models remain vulnerable to jailbreak attacks -- inputs designed to bypass safety mechanisms and elicit harmful responses -- despite advances in alignment and instruction tuning. We propose Head-Masked Nullspace Steering HMNS, a circuit-level intervention that i identifies attentio...