646 matches found
Towards Cybersecurity SuperIntelligence (CSI): What'S the Best Harness for Cybersecurity?
What is the best harness for cybersecurity AI? Cybersecurity systems are converging on a single execution scaffold per agent, an iterative shell loop driven by a Large Language Model LLM. However, scaffolds are not interchangeable, rarely interoperable, and no single scaffold dominates across all...
Relevance As a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents
AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment mechanisms that govern model outputs. Prior work shows that enabling...
SEC-Bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?
Large language models LLMs now support automated software security tasks, including vulnerability discovery and proof-of-concept PoC generation. Existing benchmarks do not faithfully evaluate LLMs in real-world bug hunting scenarios because they rely on fuzzing harnesses, target-specific...
CyberMaskQA: A Privacy-Aware Benchmark for Evaluating Large Language Models in Cybersecurity Question Answering
Large language models LLMs are increasingly applied to cybersecurity question answering QA for critical tasks such as incident response and vulnerability analysis. However, real-world operational contexts, including system logs and network configurations, inherently contain sensitive identifiers,...
Parser-Free Querying of Security Logs
Security analysts routinely query system logs to detect threats and investigate incidents, but each log source uses its own semi-structured format: logs are cheap to produce, but expensive to use. The standard approach, building per-source parsers to normalize logs into structured schemas, is...
Measuring Security without Fooling Ourselves: Why Benchmarking Agents Is Hard
The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncertainty. We then...
UNAD+: An Explainable Hybrid Framework for Unknown Network Attack Detection
The detection of previously unseen network attacks remains a major challenge for intrusion detection systems. Although supervised learning methods often perform well on known attack classes, they are limited when new attack types are not represented in the training data. Unsupervised methods are...
HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection
Recent benchmark efforts have advanced the evaluation of large language models LLMs in cybersecurity, including tasks such as penetration testing and vulnerability identification. However, a critical cybersecurity task, namely intrusion detection from system logs, remains unexplored. In this work...
Not What You Asked For: Typographic Attacks in Household Robot Manipulation
Open-vocabulary embodied AI agents increasingly rely on vision-language models such as CLIP for object perception and task grounding. However, the shared embedding space that enables this flexibility introduces a structural vulnerability to typographic attacks, where printed text in a physical...
ContraFix: Agentic Vulnerability Repair Via Differential Runtime Evidence and Skill Reuse
Large language model LLM agents are increasingly used for automated vulnerability repair AVR, where repository-level reasoning enables them to inspect context and produce source-code patches. However, recent empirical results show that these agents still struggle with real-world vulnerabilities...
ADR: An Agentic Detection System for Enterprise Agentic AI Security
We present the Agentic AI Detection and Response ADR system, the first large-scale, production-proven enterprise framework for securing AI agents operating through the Model Context Protocol MCP. We identify three persistent challenges in this domain: 1 limited observability -- existing Endpoint...
How AI Hallucinations Are Creating Real Security Risks
AI hallucinations are introducing serious security risks into critical infrastructure decision-making by exploiting human trust through highly confident yet incorrect outputs. When an AI model lacks certainty, it doesn’t have a mechanism to recognize that. Instead, it generates the most probable...
Do Coding Agents Understand Least-Privilege Authorization?
As coding agents gain access to shells, repositories, and user files, least-privilege authorization becomes a prerequisite for safe deployment: an agent should receive enough authority to complete the task, without unnecessary authority that exposes sensitive surfaces.To study whether current...
DCVD: Dual-Channel Cross-Modal Fusion for Joint Vulnerability Detection and Localization
Software vulnerability detection plays a critical role in ensuring system security, where real-world auditing requires not only determining whether a function is vulnerable but also pinpointing the specific lines responsible. However, existing approaches either rely on a single information source...
Defense at AI speed: Microsoft’s new multi-model agentic security system tops leading industry benchmark
In this article 1. AI-powered vulnerability discovery at hyper-scale 2. Codename: MDASH—Microsoft Security’s new multi-model agentic scanning harness 3. Using codename MDASH for security research 4. The 5.12.2026 Patch Tuesday cohort 5. Two deep dives 1. CVE-2026-33827—Remote unauthenticated UAF ...
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety evaluations: eve...
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue tha...
LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
The rapid proliferation of LLM-based autonomous agents in real operating system environments introduces a new category of safety risk beyond content safety: behavior jailbreak, where an adversary induces an agent to execute dangerous OS-level operations with irreversible consequences. Existing...
MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring
We introduce a red-teaming methodology that exposes harder-to-catch attacks for coding-agent monitors, suggesting that current practices may under-elicit attacks and overstate monitor performance. We identify three challenges with current red-teaming. First, mode collapse in attack generation,...
ai.timefold.solver:timefold-solver-quarkus-benchmark-deployment (>=0.8.38 <=0.9.38), ai.timefold.solver:timefold-solver-quarkus-benchmark-integration-test (>=0.8.38 <=0.9.38) +3086 more potentially affected by CVE-2026-6860 via io.vertx:vertx-core (>=4.3.4 <=4.3.8)
io.vertx:vertx-core MAVEN version =4.3.4, =0.8.38, =0.8.38, =0.8.38, =0.8.38, =0.8.38, =0.8.38, =0.8.38, =0.8.38, =0.8.38, =0.8.38, =0.8.38, =22.9.0, =22.9.0, =22.9.0, =22.9.0, =22.9.5 and more Source cves: CVE-2026-6860 Source advisory: OSV:GHSA-3G76-F9XQ-8VP6https://vulners.com/osv/OSV...