10 matches found
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
Existing benchmarks of language-model refusal on malicious-coding tasks routinely conflate requests for executable malicious software with requests for harmful security knowledge. This conflation matters because the two request types plausibly trigger distinct refusal pathways in safety-aligned...
Jailbroken Frontier Models Retain Their Capabilities
As language model safeguards become more robust, attackers are pushed toward developing increasingly complex jailbreaks. Prior work has found that this complexity imposes a "jailbreak tax" that degrades the target model's task performance. We show that this tax scales inversely with model...
OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs
The rapid integration of Multimodal Large Language Models MLLMs into critical applications is increasingly hindered by persistent safety vulnerabilities. However, existing red-teaming benchmarks are often fragmented, limited to single-turn text interactions, and lack the scalability required for...
Prompt Injection Through Poetry
In a new paper, "Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models," researchers found that turning LLM prompts into poetry resulted in jailbreaking the models: Abstract : We present evidence that adversarial poetry functions as a universal single-turn...
False Sense of Security: Why Probing-Based Malicious Input Detection Fails to Generalize
Large Language Models LLMs can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers ha...
Towards Safety and Security Testing of Cyberphysical Power Systems by Shape Validation
The increasing complexity of cyberphysical power systems leads to larger attack surfaces to be exploited by malicious actors and a higher risk of faults through misconfiguration. We propose to meet those risks with a declarative approach to describe cyberphysical power systems and to automaticall...
The Scales of Justitia: a Comprehensive Survey on Safety Evaluation of LLMs
With the rapid advancement of artificial intelligence technology, Large Language Models LLMs have demonstrated remarkable potential in the field of Natural Language Processing NLP, including areas such as content generation, human-computer interaction, machine translation, and code generation,...
Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
Rapid deployment of vision-language models VLMs magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a...
SAGE: a Generic Framework for LLM Safety Evaluation
Whitepaper called SAGE: A Generic Framework For LLM Safety Evaluation...
T2VShield: Model-Agnostic Jailbreak Defense for Text-To-Video Models
The rapid development of generative artificial intelligence has made text to video models essential for building future multimodal world simulators. However, these models remain vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of...