376 matches found
Claude Fable 5 and Mythos 5 “abruptly disabled” after US gov. ban
Anthropic has been ordered by the US government to cut off its newest Claude Fable 5 and Mythos 5 models for fear of abuse by adversaries. Reuters reports that Anthropic said it will "abruptly disable" its most advanced AI models for all users after the US government ordered it to suspend access...
Criminal AI-as-a-Service in 2026: How the Underground Market Is Operationalizing Cybercrime
Introduction The underground market for criminally oriented generative AI has moved beyond the early hype surrounding 'malicious chatbots.' The gradual integration of AI as a productivity layer within cybercrime operations has become the dominant story, indicating that while the potential for ful...
Malicious code in jailbreak-code (npm)
--- -= Per source details. Do not edit below this line.=- Source: amazon-inspector 9f729dde017c78154685be850893a9f3ebd58bf0b5cb1229e7e49fb09b14f5d5 The package presents itself as an AI developer CLI but is engineered as a credential and payment harvester. src/c2.ts hardcodes a Discord webhook URL...
MAL-2026-5543 Malicious code in jailbreak-code (npm)
--- -= Per source details. Do not edit below this line.=- Source: amazon-inspector 9f729dde017c78154685be850893a9f3ebd58bf0b5cb1229e7e49fb09b14f5d5 The package presents itself as an AI developer CLI but is engineered as a credential and payment harvester. src/c2.ts hardcodes a Discord webhook URL...
Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code
Large Language Models LLMs are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding GCD has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this...
Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models
As Large Language Models evolve for user convenience, vulnerability to jailbreak attacks continues to be reported despite ongoing efforts in safety training. Traditional jailbreak techniques typically focus on a single prompt injection, neglecting the models' ability to remember the flow of...
Evolving Skill-Structured Attack Memory Enhances LLM Jailbreaking
Jailbreak attacks on large language models LLMs aim to induce LLMs to produce content that they are expected to refuse. Automated black-box jailbreak generation is especially important for safety evaluation, where the attacker observes only model outputs and needs to automatically search for...
BAIT: Boundary-Guided Disclosure Escalation Via Self-Conditioned Reasoning
In this work, we propose BAIT Boundary-Aware Iterative Trap, a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed...
Reasoning As an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs
Large Reasoning Models LRMs have demonstrated remarkable capabilities in reasoning and generation tasks and are increasingly deployed in real-world applications. However, their explicit chain-of-thought CoT mechanism introduces new security risks, making them particularly vulnerable to jailbreak...
Babel: Jailbreaking Safety Attention Via Obfuscation Distribution Optimized Sampling
Despite rigorous safety alignment, Large Language Models LLMs remain vulnerable to jailbreak attacks. Existing black-box methods often rely on heuristic templates or exhaustive trials, lacking mechanistic interpretability and query efficiency. In this study, we investigate an intrinsic...
Re-Triggering Safeguards within LLMs for Jailbreak Detection
This paper proposes a jailbreaking prompt detection method for large language models LLMs to defend against jailbreak attacks. Although recent LLMs are equipped with built-in safeguards, it remains possible to craft jailbreaking prompts that bypass them. We argue that such jailbreaking prompts ar...
Position: AI Security Policy Should Target Systems, Not Models
We present swarm-attack, an open-source adversarial testing framework in which multiple lightweight LLM agents coordinate through shared memory, parallel exploration, and evolutionary optimization. Together, our results demonstrate that both safety bypass of frontier models and software...
The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security beyond Binary Scoring
Jailbreak attacks -- adversarial prompts that bypass LLM alignment through purely linguistic manipulation -- pose a growing operational security threat, yet the field lacks large-scale, reproducible infrastructure for generating, categorizing, and evaluating them systematically. This paper...
OrchJail: Jailbreaking Tool-Calling Text-To-Image Agents by Orchestration-Guided Fuzzing
Tool-calling text-to-image T2I agents can plan and execute multi-step tool chains to accomplish complex generation and editing queries. However, this capability introduces a new safety attack surface: harmful outputs may arise from tool orchestration, where individually benign steps combine into...
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
Defending large language models LLMs against jailbreak attacks, such as Greedy Coordinate Gradient GCG, remains a challenge, particularly under adaptive threat models where an attacker directly targets the defense mechanism. JBShield, a recent jailbreak defense with a 0% attack success rate in so...
ContextualJailbreak: Evolutionary Red-Teaming Via Simulated Conversational Priming
Large language models LLMs remain vulnerable to jailbreak attacks that bypass safety alignment and elicit harmful responses. A growing body of work shows that contextual priming, where earlier turns covertly bias later replies, constitutes a powerful attack surface, with hand-crafted multi-turn...
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal Tracing, we uncover the Refusal Trajectory-a persiste...
Jailbroken Frontier Models Retain Their Capabilities
As language model safeguards become more robust, attackers are pushed toward developing increasingly complex jailbreaks. Prior work has found that this complexity imposes a "jailbreak tax" that degrades the target model's task performance. We show that this tax scales inversely with model...
Evaluating Jailbreaking Vulnerabilities in LLMs Deployed As Assistants for Smart Grid Operations: A Benchmark against NERC Standards
The deployment of Large Language Models LLMs as assistants in electric grid operations promises to streamline compliance and decision-making but exposes new vulnerabilities to prompt-based adversarial attacks. This paper evaluates the risk of jailbreaking LLMs, i.e., circumventing safety alignmen...
AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models
Automated red-teaming methods for large language models typically optimize attack prompts within a fixed, human-designed strategy, leaving the attack strategy itself unchanged. We instead optimize the strategy. We propose AutoRISE, a method that searches over executable attack programs rather tha...