Lucene search
K

17 matches found

Packet Storm News
Packet Storm News
added 2026/05/12 12:0 a.m.9 views

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code...

6.1AI score
Exploits0
Packet Storm News
Packet Storm News
added 2026/05/05 12:0 a.m.10 views

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

Coding agents often pass per-prompt safety review yet ship exploitable code when their tasks are decomposed into routine engineering tickets. The challenge is structural: existing safety alignment evaluates overt requests in isolation, leaving models blind to malicious end-states that emerge from...

5.9AI score
Exploits0
Packet Storm News
Packet Storm News
added 2026/04/22 12:0 a.m.6 views

A Ground-Truth-Based Evaluation of Vulnerability Detection across Multiple Ecosystems

Automated vulnerability detection tools are widely used to identify security vulnerabilities in software dependencies. However, the evaluation of such tools remains challenging due to the heterogeneous structure of vulnerability data sources, inconsistent identifier schemes, and ambiguities in...

5.3AI score
Exploits0
Packet Storm News
Packet Storm News
added 2026/04/21 12:0 a.m.21 views

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model LLM agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The...

5.8AI score
Exploits0
Packet Storm News
Packet Storm News
added 2026/04/15 12:0 a.m.5 views

RealVuln: Benchmarking Rule-Based, General-Purpose LLM, and Security-Specialized Scanners on Real-World Code

How do security scanners perform on real-world code? We present RealVuln, the first open-source benchmark comparing Rule-Based SAST, General-Purpose LLMs, and Security-Specialized scanners on 26 intentionally vulnerable Python repositories educational and Capture-The-Flag applications with 796...

5.8AI score
Exploits0
Packet Storm News
Packet Storm News
added 2026/04/13 12:0 a.m.7 views

SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents

We present SIR-Bench, a benchmark of 794 test cases for evaluating autonomous security incident response agents that distinguishes genuine forensic investigation from alert parroting. Derived from 129 anonymized incident patterns with expert-validated ground truth, SIR-Bench measures not only...

5.8AI score
Exploits0
Packet Storm News
Packet Storm News
added 2026/04/13 12:0 a.m.3 views

RedShell: A Generative AI-Based Approach to Ethical Hacking

The application of Machine Learning techniques in code generation is now a common practice for most developers. Tools such as ChatGPT from OpenAI leverage the natural language processing capabilities of Large Language Models to generate machine code from natural language descriptions. In the...

5.9AI score
Exploits0
Packet Storm News
Packet Storm News
added 2026/03/26 12:0 a.m.5 views

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

System prompt configuration can make the difference between near-total phishing blindness and near-perfect detection in LLM email agents. We present PhishNChips, a study of 11 models under 10 prompt strategies, showing that prompt-model interaction is a first-order security variable: a single...

5.9AI score
Exploits0
Packet Storm News
Packet Storm News
added 2026/03/26 12:0 a.m.0 views

A Large-Scale Empirical Study on the Generalizability of Disclosed Java Library Vulnerability Exploits

Open-source software supply chain security relies heavily on assessing affected versions of library vulnerabilities. While prior studies have leveraged exploits for verifying vulnerability affected versions, they point out a key limitation that exploits are version-specific and cannot be directly...

6.2AI score
Exploits0
Packet Storm News
Packet Storm News
added 2026/03/23 12:0 a.m.5 views

OrgForge-IT: A Verifiable Synthetic Benchmark for LLM-Based Insider Threat Detection

Synthetic insider threat benchmarks face a consistency problem: corpora generated without an external factual constraint cannot rule out cross-artifact contradictions. The CERT dataset -- the field's canonical benchmark -- is also static, lacks cross-surface correlation scenarios, and predates th...

5.8AI score
Exploits0
Packet Storm News
Packet Storm News
added 2026/02/05 12:0 a.m.5 views

Persistent Human Feedback, LLMs, and Static Analyzers for Secure Code Generation and Vulnerability Detection

Existing literature heavily relies on static analysis tools to evaluate LLMs for secure code generation and vulnerability detection. We reviewed 1,080 LLM-generated code samples, built a human-validated ground-truth, and compared the outputs of two widely used static security tools, CodeQL and...

5.5AI score
Exploits0
Packet Storm News
Packet Storm News
added 2026/01/30 12:0 a.m.4 views

PIDSMaker: Building and Evaluating Provenance-Based Intrusion Detection Systems

Recent provenance-based intrusion detection systems PIDSs have demonstrated strong potential for detecting advanced persistent threats APTs by applying machine learning to system provenance graphs. However, evaluating and comparing PIDSs remains difficult: prior work uses inconsistent preprocessi...

5.6AI score
Exploits0
Packet Storm News
Packet Storm News
added 2025/12/18 12:0 a.m.5 views

AutoDFBench 1.0: A Benchmarking Framework for Digital Forensic Tool Testing and Generated Code Evaluation

The National Institute of Standards and Technology NIST Computer Forensic Tool Testing CFTT programme has become the de facto standard for providing digital forensic tool testing and validation. However to date, no comprehensive framework exists to automate benchmarking across the diverse forensi...

7.3AI score
Exploits0
Packet Storm News
Packet Storm News
added 2025/12/16 12:0 a.m.42 views

PentestEval: Benchmarking LLM-Based Penetration Testing with Modular and Stage-Level Design

Penetration testing is essential for assessing and strengthening system security against real-world threats, yet traditional workflows remain highly manual, expertise-intensive, and difficult to scale. Although recent advances in Large Language Models LLMs offer promising opportunities for...

6.6AI score
Exploits0
Packet Storm News
Packet Storm News
added 2025/06/12 12:0 a.m.2 views

Uncovering Reliable Indicators: Improving IoC Extraction from Threat Reports

Indicators of Compromise IoCs are critical for threat detection and response, marking malicious activity across networks and systems. Yet, the effectiveness of automated IoC extraction systems is fundamentally limited by one key issue: the lack of high-quality ground truth. Current extraction too...

6.9AI score
Exploits0
Packet Storm News
Packet Storm News
added 2025/05/05 12:0 a.m.2 views

Towards a Standardized Methodology and Dataset for Evaluating LLM-Based Digital Forensic Timeline Analysis

Large language models LLMs have seen widespread adoption in many domains including digital forensics. While prior research has largely centered on case studies and examples demonstrating how LLMs can assist forensic investigations, deeper explorations remain limited, i.e., a standardized approach...

6.9AI score
Exploits0
Kitploit
Kitploit
added 2020/01/12 9:18 p.m.60 views

LAVA - Large-scale Automated Vulnerability Addition

Evaluating and improving bug-finding tools is currently difficult due to a shortage of ground truth corpora i.e., software that has known bugs with triggering inputs. LAVA attempts to solve this problem by automatically injecting bugs into software. Every LAVA bug is accompanied by an input that...

7AI score
Exploits0References4
Rows per page
Query Builder