3 matches found
Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations
In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that...
Activation Surgery: Jailbreaking White-Box LLMs without Touching the Prompt
Most jailbreak techniques for Large Language Models LLMs primarily rely on prompt modifications, including paraphrasing, obfuscation, or conversational strategies. Meanwhile, abliteration techniques also known as targeted ablations of internal components have been used to study and explain LLM...
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
Safely aligning large language models LLMs often demands extensive human-labeled preference data, a process that's both costly and time-consuming. While synthetic data offers a promising alternative, current methods frequently rely on complex iterative prompting or auxiliary models. To address...