10 matches found
Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection
We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a...
Guaranteed Jailbreaking Defense Via Disrupt-And-Rectify Smoothing
This paper proposes a guaranteed defense method for large language models LLMs to safeguard against jailbreaking attacks. Drawing inspiration from the denoised-smoothing approach in the adversarial defense domain, we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify...
Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs
Large Language ModelsLLMs are widely deployed, yet are vulnerable to jailbreak prompts that elicit policy-violating outputs. Although prior studies have uncovered these risks, they typically treat all tokens as equally important during prompt mutation, overlooking the varying contributions of...
LLM Causality Analysis Framework
A comprehensive framework for multi-level causality analysis in Large Language Models LLMs, enabling systematic investigation of safety mechanisms and misbehavior detection across token, neuron, layer, and representation levels. Includes the whitepaper 2512.04841.pdf titled SoK: A Comprehensive...
EUVD-2015-0048
Malware in sbrugna...
Watermarking Autoregressive Image Generation
Watermarking the outputs of generative models has emerged as a promising approach for tracking their provenance. Despite significant interest in autoregressive image generation models and their potential for misuse, no prior work has attempted to watermark their outputs at the token level. In thi...
Learning Obfuscations of LLM Embedding Sequences: Stained Glass Transform
The high cost of ownership of AI compute infrastructure and challenges of robust serving of large language models LLMs has led to a surge in managed Model-as-a-service deployments. Even when enterprises choose on-premises deployments, the compute infrastructure is typically shared across many tea...
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning...
Token-Level Constraint Boundary Search for Jailbreaking Text-To-Image Models
Recent advancements in Text-to-Image T2I generation have significantly enhanced the realism and creativity of generated images. However, such powerful generative capabilities pose risks related to the production of inappropriate or harmful content. Existing defense mechanisms, including prompt...
Deposits into vaults are not tracked at a token level
Handle itsmeSTYJ Vulnerability details Impact If the strategy used accepts multiple tokens, a user can deposit in a cheaper token and withdraw in a more expensive token because the vault only tracks ownership based on how many shares they own. Proof of Concept 1. An approved strategist adds 2...