2 matches found
Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations
In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that...
LPASS: Linear Probes As Stepping Stones for Vulnerability Detection Using Compressed LLMs
Large Language Models LLMs are being extensively used for cybersecurity purposes. One of them is the detection of vulnerable codes. For the sake of efficiency and effectiveness, compression and fine-tuning techniques are being developed, respectively. However, they involve spending substantial...