4 matches found
The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers
Detecting whether a model has been poisoned is a longstanding problem in AI security. In this work, we present a practical scanner for identifying sleeper agent-style backdoors in causal language models. Our approach relies on two key findings: first, sleeper agents tend to memorize poisoning dat...
Probe Before You Talk: Towards Black-Box Defense against Backdoor Unalignment for Large Language Models
Backdoor unalignment attacks against Large Language Models LLMs enable the stealthy compromise of safety alignment using a hidden trigger while evading normal safety auditing. These attacks pose significant threats to the applications of LLMs in the real-world Large Language Model as a Service...
MixBridge: Heterogeneous Image-To-Image Backdoor Attack through Mixture of Schrödinger Bridges
This paper focuses on implanting multiple heterogeneous backdoor triggers in bridge-based diffusion models designed for complex and arbitrary input distributions. Existing backdoor formulations mainly address single-attack scenarios and are limited to Gaussian noise input models. To fill this gap...
RADEP: a Resilient Adaptive Defense Framework against Model Extraction Attacks
Machine Learning as a Service MLaaS enables users to leverage powerful machine learning models through cloud-based APIs, offering scalability and ease of deployment. However, these services are vulnerable to model extraction attacks, where adversaries repeatedly query the application programming...