78 matches found
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code...
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
Coding agents often pass per-prompt safety review yet ship exploitable code when their tasks are decomposed into routine engineering tickets. The challenge is structural: existing safety alignment evaluates overt requests in isolation, leaving models blind to malicious end-states that emerge from...
Spring Office Hours Podcast: S5E14 - Spec Driven Development with Simon Martinelli
Join Dan Vega and DaShaun Carter for the latest updates from the Spring Ecosystem. In this episode, Dan and DaShaun are joined by Java Champion, Vaadin Champion, and Oracle ACE Pro Simon Martinelli to talk about Spec-Driven Development. With AI reshaping how we write code, Simon makes the case th...
A Ground-Truth-Based Evaluation of Vulnerability Detection across Multiple Ecosystems
Automated vulnerability detection tools are widely used to identify security vulnerabilities in software dependencies. However, the evaluation of such tools remains challenging due to the heterogeneous structure of vulnerability data sources, inconsistent identifier schemes, and ambiguities in...
Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model LLM agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The...
RealVuln: Benchmarking Rule-Based, General-Purpose LLM, and Security-Specialized Scanners on Real-World Code
How do security scanners perform on real-world code? We present RealVuln, the first open-source benchmark comparing Rule-Based SAST, General-Purpose LLMs, and Security-Specialized scanners on 26 intentionally vulnerable Python repositories educational and Capture-The-Flag applications with 796...
RedShell: A Generative AI-Based Approach to Ethical Hacking
The application of Machine Learning techniques in code generation is now a common practice for most developers. Tools such as ChatGPT from OpenAI leverage the natural language processing capabilities of Large Language Models to generate machine code from natural language descriptions. In the...
SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents
We present SIR-Bench, a benchmark of 794 test cases for evaluating autonomous security incident response agents that distinguishes genuine forensic investigation from alert parroting. Derived from 129 anonymized incident patterns with expert-validated ground truth, SIR-Bench measures not only...
The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities
System prompt configuration can make the difference between near-total phishing blindness and near-perfect detection in LLM email agents. We present PhishNChips, a study of 11 models under 10 prompt strategies, showing that prompt-model interaction is a first-order security variable: a single...
A Large-Scale Empirical Study on the Generalizability of Disclosed Java Library Vulnerability Exploits
Open-source software supply chain security relies heavily on assessing affected versions of library vulnerabilities. While prior studies have leveraged exploits for verifying vulnerability affected versions, they point out a key limitation that exploits are version-specific and cannot be directly...
OrgForge-IT: A Verifiable Synthetic Benchmark for LLM-Based Insider Threat Detection
Synthetic insider threat benchmarks face a consistency problem: corpora generated without an external factual constraint cannot rule out cross-artifact contradictions. The CERT dataset -- the field's canonical benchmark -- is also static, lacks cross-surface correlation scenarios, and predates th...
New e-book: Establishing a proactive defense with Microsoft Security Exposure Management
Effective exposure management begins by illuminating and hardening risks across the entire attack surface. Some of the most meaningful shifts in security happen quietly—when teams take a clear look at their exposure landscape and acknowledge the gap between where they stand today and where they...
Persistent Human Feedback, LLMs, and Static Analyzers for Secure Code Generation and Vulnerability Detection
Existing literature heavily relies on static analysis tools to evaluate LLMs for secure code generation and vulnerability detection. We reviewed 1,080 LLM-generated code samples, built a human-validated ground-truth, and compared the outputs of two widely used static security tools, CodeQL and...
PIDSMaker: Building and Evaluating Provenance-Based Intrusion Detection Systems
Recent provenance-based intrusion detection systems PIDSs have demonstrated strong potential for detecting advanced persistent threats APTs by applying machine learning to system provenance graphs. However, evaluating and comparing PIDSs remains difficult: prior work uses inconsistent preprocessi...
AutoDFBench 1.0: A Benchmarking Framework for Digital Forensic Tool Testing and Generated Code Evaluation
The National Institute of Standards and Technology NIST Computer Forensic Tool Testing CFTT programme has become the de facto standard for providing digital forensic tool testing and validation. However to date, no comprehensive framework exists to automate benchmarking across the diverse forensi...
PentestEval: Benchmarking LLM-Based Penetration Testing with Modular and Stage-Level Design
Penetration testing is essential for assessing and strengthening system security against real-world threats, yet traditional workflows remain highly manual, expertise-intensive, and difficult to scale. Although recent advances in Large Language Models LLMs offer promising opportunities for...
EUVD-2025-35304
Nautobot Single Source of Truth SSoT is an app for Nautobot. Prior to version 3.10.0, an unauthenticated attacker could access this page to view the Service Now public instance name e.g. companyname.service-now.com. This is considered low-value information. This does not expose the Secret, the...
CVE-2025-62607 Nautobot Single Source of Truth (SSoT) has an unauthenticated ServiceNow configuration URL
Nautobot Single Source of Truth SSoT is an app for Nautobot. Prior to version 3.10.0, an unauthenticated attacker could access this page to view the Service Now public instance name e.g. companyname.service-now.com. This is considered low-value information. This does not expose the Secret, the...
CVE-2025-60511
Moodle OpenAI Chat Block plugin 3.0.1 2025021700 suffers from an Insecure Direct Object Reference IDOR vulnerability due to insufficient validation of the blockId parameter in /blocks/openaichat/api/completion.php. An authenticated student can impersonate another user's block e.g., administrator...
Malicious code in test-mlw2-gyron-terts-mayed-truth (npm)
The package test-mlw2-gyron-terts-mayed-truth was found to contain malicious code...