4 matches found
Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations
In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that...
Threat Modelling Using Domain-Adapted Language Models: Empirical Evaluation and Insights
Large Language ModelsLLMs are increasingly explored for cybersecurity applications such as vulnerability detection. In the domain of threat modelling, prior work has primarily evaluated a number of general-purpose Large Language Models under limited prompting settings. In this study, we extend th...
Bypassing AI Control Protocols Via Agent-As-A-Proxy Attacks
As AI agents automate critical workloads, they remain vulnerable to indirect prompt injection IPI attacks. Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought CoT and tool-use actions to ensure alignment with user intent. We demonstrate that these...
Leaner Training, Lower Leakage: Revisiting Memorization in LLM Fine-Tuning with LoRA
Memorization in large language models LLMs makes them vulnerable to data extraction attacks. While pre-training memorization has been extensively studied, fewer works have explored its impact in fine-tuning, particularly for LoRA fine-tuning, a widely adopted parameter-efficient method. In this...