216 matches found
MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems
Hierarchical multi-agent systems MAS are rapidly being deployed in high-stakes workflows across domains such as finance and software engineering. In these systems, safety and security are inherently distributed across role-specialized agents, significantly expanding the attack surface, particular...
PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections
Large Language Models LLMs are rapidly evolving into agentic systems that interact with external tools and environments, introducing new security risks such as indirect prompt injection attacks through untrusted external sources. Existing defenses mainly focus on blocking malicious content at...
Updating the taxonomy of failure modes in agentic AI systems: What a year of red teaming taught us
In this article 1. Why the Taxonomy Needed Updating 2. Seven new failure modes 3. Operational findings: What red teaming showed 4. New mitigations 5. What to do this quarter When the Microsoft AI Red Team published the Taxonomy of Failure Modes in Agentic AI Systems in April 2025, the goal was a...
RedEdit: Agentic Red-Teaming of Image Safety Classifiers Via MCTS-Guided Photo-Editing
Image safety classifiers serve as a critical component of contemporary content moderation systems on the internet. However, their resilience against user-style malicious image editing remains underexplored. Such behaviors are highly prevalent in daily scenarios but difficult to fully reproduce. T...
Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming
Recent computer-using-agent CUA red-teaming papers report prompt-injection attack success rates ASR of 42-98%, but these headline numbers cluster on retired models and on the most-vulnerable model in each paper's panel. We ask whether those techniques, reproduced as hand-crafted templates, still...
-cascade-scan
cascade-scan AI Agent security evaluation framework — autom...
ADR: An Agentic Detection System for Enterprise Agentic AI Security
We present the Agentic AI Detection and Response ADR system, the first large-scale, production-proven enterprise framework for securing AI agents operating through the Model Context Protocol MCP. We identify three persistent challenges in this domain: 1 limited observability -- existing Endpoint...
A Red Teaming Framework for Evaluating Robustness of AI-Enabled Security Orchestration, Automation, and Response Systems
AI-enabled Security Orchestration, Automation, and Response SOAR systems increasingly employ autonomous agents for cyber defense, yet their resilience to adaptive adversaries is underexplored. We introduce an autonomous red teaming framework that integrates large language models LLMs with...
IPI-Proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents against Indirect Prompt Injection
Web-browsing AI agents are increasingly deployed in enterprise settings under strict whitelists of approved domains, yet adversaries can still influence them by embedding hidden instructions in the HTML pages those domains serve. Existing red-teaming resources fall short of this scenario:...
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue tha...
Your Purple Team Isn't Purple — It's Just Red and Blue in the Same Room
Defending a network at 2 am looks a lot like this: an analyst copy-pasting a hash from a PDF into a SIEM query. A red team script is being rewritten by hand so the blue team can use it. A patch waiting on a change-approval window that's longer than the exploitation window itself. Nobody in that...
LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
The integration of Large Language Models LLMs into Electronic Design Automation EDA and hardware security is rapidly reshaping the semiconductor industry. While LLMs offer unprecedented capabilities in generating Register Transfer Level RTL code, automating testbenches, and bridging the semantic...
MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring
We introduce a red-teaming methodology that exposes harder-to-catch attacks for coding-agent monitors, suggesting that current practices may under-elicit attacks and overstate monitor performance. We identify three challenges with current red-teaming. First, mode collapse in attack generation,...
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
Multi-turn jailbreaks exploit the ability of large language models to accumulate and act on conversational context. Instead of stating a harmful request directly, an attacker can gradually steer the conversation toward an unsafe answer. Recent methods demonstrate this risk, but they are usually...
Autonomous Adversary: Red-Teaming in the Age of LLM
Language Model Agents LMAs are emerging as a powerful primitive for augmenting red-team operations. They can support attack planning, adversary emulation, and the orchestration of multi-step activity such as lateral movement, a core enabling capability of advanced persistent threat APT campaigns...
DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents
AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have...
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
AI systems are entering critical domains like healthcare, finance, and defense, yet remain vulnerable to adversarial attacks. While AI red teaming is a primary defense, current approaches force operators into manual, library-specific workflows. Operators spend weeks hand-crafting workflows -...
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
Memory systems enable otherwise-stateless LLM agents to persist user information across sessions, but also introduce a new attack surface. We characterize the Trojan Hippo attack, a class of persistent memory attacks that operates in a more realistic threat model than prior memory poisoning work:...
STARE: Step-Wise Temporal Alignment and Red-Teaming Engine for Multi-Modal Toxicity Attack
Red-teaming Vision-Language Models is essential for identifying vulnerabilities where adversarial image-text inputs trigger toxic outputs. Existing approaches treat image generation as a black box, returning only terminal toxicity scores and leaving open the question of when and how toxic semanti...
Training a General Purpose Automated Red Teaming Model
Automated methods for red teaming LLMs are an important tool to identify LLM vulnerabilities that may not be covered in static benchmarks, allowing for more thorough probing. They can also adapt to each specific LLM to discover weaknesses unique to it. Most current automated red teaming methods a...