2 matches found
Sparse Autoencoders Are Capable LLM Jailbreak Mitigators
Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering CC-Delta, an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without...
TechniqueRAG: Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text
Accurately identifying adversarial techniques in security texts is critical for effective cyber defense. However, existing methods face a fundamental trade-off: they either rely on generic models with limited domain precision or require resource-intensive pipelines that depend on large labeled...