12 matches found
Guaranteed Jailbreaking Defense Via Disrupt-And-Rectify Smoothing
This paper proposes a guaranteed defense method for large language models LLMs to safeguard against jailbreaking attacks. Drawing inspiration from the denoised-smoothing approach in the adversarial defense domain, we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify...
Re-Triggering Safeguards within LLMs for Jailbreak Detection
This paper proposes a jailbreaking prompt detection method for large language models LLMs to defend against jailbreak attacks. Although recent LLMs are equipped with built-in safeguards, it remains possible to craft jailbreaking prompts that bypass them. We argue that such jailbreaking prompts ar...
Cybersecurity Predictions 2026: The Hype We Can Ignore (And the Risks We Can't)
As organizations plan for 2026, cybersecurity predictions are everywhere. Yet many strategies are still shaped by headlines and speculation rather than evidence. The real challenge isn't a lack of forecasts—it's identifying which predictions reflect real, emerging risks and which can safely be...
Secure Retrieval-Augmented Generation against Poisoning Attacks
Large language models LLMs have transformed natural language processing NLP, enabling applications from content generation to decision support. Retrieval-Augmented Generation RAG improves LLMs by incorporating external knowledge but also introduces security risks, particularly from data poisoning...
NEXUS: Network Exploration for EXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks
Large Language Models LLMs have revolutionized natural language processing but remain vulnerable to jailbreak attacks, especially multi-turn jailbreaks that distribute malicious intent across benign exchanges and bypass alignment mechanisms. Existing approaches often explore the adversarial space...
RLCracker: Exposing the Vulnerability of LLM Watermarks with Adaptive RL Attacks
Large Language Models LLMs watermarking has shown promise in detecting AI-generated content and mitigating misuse, with prior work claiming robustness against paraphrasing and text editing. In this paper, we argue that existing evaluations are not sufficiently adversarial, obscuring critical...
Coward: toward Practical Proactive Federated Backdoor Defense Via Collision-Based Watermark
Backdoor detection is currently the mainstream defense against backdoor attacks in federated learning FL, where malicious clients upload poisoned updates that compromise the global model and undermine the reliability of FL deployments. Existing backdoor detection techniques fall into two...
TuneShield: Mitigating Toxicity in Conversational AI While Fine-Tuning on Untrusted Data
Recent advances in foundation models, such as LLMs, have revolutionized conversational AI. Chatbots are increasingly being developed by customizing LLMs on specific conversational datasets. However, mitigating toxicity during this customization, especially when dealing with untrusted training dat...
Mitigating Data Poisoning Attacks to Local Differential Privacy
The distributed nature of local differential privacy LDP invites data poisoning attacks and poses unforeseen threats to the underlying LDP-supported applications. In this paper, we propose a comprehensive mitigation framework for popular frequency estimation, which contains a suite of novel...
A Critical Evaluation of Defenses against Prompt Injection Attacks
Large Language Models LLMs are vulnerable to prompt injection attacks, and several defenses have recently been proposed, often claiming to mitigate these attacks successfully. However, we argue that existing studies lack a principled approach to evaluating these defenses. In this paper, we argue...
DataSentinel: a Game-Theoretic Detection of Prompt Injection Attacks
LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection...
NCorr-FP: a Neighbourhood-Based Correlation-Preserving Fingerprinting Scheme for Intellectual Property Protection of Structured Data
Ensuring data ownership and traceability of unauthorised redistribution are central to safeguarding intellectual property in shared data environments. Data fingerprinting addresses these challenges by embedding recipient-specific marks into the data, typically via content modifications. We propos...