17 matches found
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code...
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
Coding agents often pass per-prompt safety review yet ship exploitable code when their tasks are decomposed into routine engineering tickets. The challenge is structural: existing safety alignment evaluates overt requests in isolation, leaving models blind to malicious end-states that emerge from...
A Ground-Truth-Based Evaluation of Vulnerability Detection across Multiple Ecosystems
Automated vulnerability detection tools are widely used to identify security vulnerabilities in software dependencies. However, the evaluation of such tools remains challenging due to the heterogeneous structure of vulnerability data sources, inconsistent identifier schemes, and ambiguities in...
Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model LLM agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The...
RealVuln: Benchmarking Rule-Based, General-Purpose LLM, and Security-Specialized Scanners on Real-World Code
How do security scanners perform on real-world code? We present RealVuln, the first open-source benchmark comparing Rule-Based SAST, General-Purpose LLMs, and Security-Specialized scanners on 26 intentionally vulnerable Python repositories educational and Capture-The-Flag applications with 796...
SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents
We present SIR-Bench, a benchmark of 794 test cases for evaluating autonomous security incident response agents that distinguishes genuine forensic investigation from alert parroting. Derived from 129 anonymized incident patterns with expert-validated ground truth, SIR-Bench measures not only...
RedShell: A Generative AI-Based Approach to Ethical Hacking
The application of Machine Learning techniques in code generation is now a common practice for most developers. Tools such as ChatGPT from OpenAI leverage the natural language processing capabilities of Large Language Models to generate machine code from natural language descriptions. In the...
The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities
System prompt configuration can make the difference between near-total phishing blindness and near-perfect detection in LLM email agents. We present PhishNChips, a study of 11 models under 10 prompt strategies, showing that prompt-model interaction is a first-order security variable: a single...
A Large-Scale Empirical Study on the Generalizability of Disclosed Java Library Vulnerability Exploits
Open-source software supply chain security relies heavily on assessing affected versions of library vulnerabilities. While prior studies have leveraged exploits for verifying vulnerability affected versions, they point out a key limitation that exploits are version-specific and cannot be directly...
OrgForge-IT: A Verifiable Synthetic Benchmark for LLM-Based Insider Threat Detection
Synthetic insider threat benchmarks face a consistency problem: corpora generated without an external factual constraint cannot rule out cross-artifact contradictions. The CERT dataset -- the field's canonical benchmark -- is also static, lacks cross-surface correlation scenarios, and predates th...
Persistent Human Feedback, LLMs, and Static Analyzers for Secure Code Generation and Vulnerability Detection
Existing literature heavily relies on static analysis tools to evaluate LLMs for secure code generation and vulnerability detection. We reviewed 1,080 LLM-generated code samples, built a human-validated ground-truth, and compared the outputs of two widely used static security tools, CodeQL and...
PIDSMaker: Building and Evaluating Provenance-Based Intrusion Detection Systems
Recent provenance-based intrusion detection systems PIDSs have demonstrated strong potential for detecting advanced persistent threats APTs by applying machine learning to system provenance graphs. However, evaluating and comparing PIDSs remains difficult: prior work uses inconsistent preprocessi...
AutoDFBench 1.0: A Benchmarking Framework for Digital Forensic Tool Testing and Generated Code Evaluation
The National Institute of Standards and Technology NIST Computer Forensic Tool Testing CFTT programme has become the de facto standard for providing digital forensic tool testing and validation. However to date, no comprehensive framework exists to automate benchmarking across the diverse forensi...
PentestEval: Benchmarking LLM-Based Penetration Testing with Modular and Stage-Level Design
Penetration testing is essential for assessing and strengthening system security against real-world threats, yet traditional workflows remain highly manual, expertise-intensive, and difficult to scale. Although recent advances in Large Language Models LLMs offer promising opportunities for...
Uncovering Reliable Indicators: Improving IoC Extraction from Threat Reports
Indicators of Compromise IoCs are critical for threat detection and response, marking malicious activity across networks and systems. Yet, the effectiveness of automated IoC extraction systems is fundamentally limited by one key issue: the lack of high-quality ground truth. Current extraction too...
Towards a Standardized Methodology and Dataset for Evaluating LLM-Based Digital Forensic Timeline Analysis
Large language models LLMs have seen widespread adoption in many domains including digital forensics. While prior research has largely centered on case studies and examples demonstrating how LLMs can assist forensic investigations, deeper explorations remain limited, i.e., a standardized approach...
LAVA - Large-scale Automated Vulnerability Addition
Evaluating and improving bug-finding tools is currently difficult due to a shortage of ground truth corpora i.e., software that has known bugs with triggering inputs. LAVA attempts to solve this problem by automatically injecting bugs into software. Every LAVA bug is accompanied by an input that...