16 matches found
ExploitGym AI Exploit Benchmark Tool
ExploitGym is a large-scale, realistic benchmark built from real-world vulnerabilities designed to evaluate AI agents' ability to develop exploits...
CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-To-End Cybersecurity Capabilities
AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world...
Towards Demystifying and Repairing LLM-In-The-Loop Vulnerabilities
Large Language ModelsLLMs have been actively integrated into modern software systems as critical components. LLM-in-the-loop vulnerabilities, where vulnerabilities are introduced by LLMs and their dependent downstream components, such as frameworks, introduce new risks. Although some benchmark...
Automating Agent Hijacking Via Structural Template Injection
Agent hijacking, highlighted by OWASP as a critical threat to the Large Language Model LLM ecosystem, enables adversaries to manipulate execution by injecting malicious instructions into retrieved content. Most existing attacks rely on manually crafted, semantics-driven prompt manipulation, which...
Execution-State-Aware LLM Reasoning for Automated Proof-Of-Vulnerability Generation
Proof-of-Vulnerability PoV generation is a critical task in software security, serving as a cornerstone for vulnerability validation, false positive reduction, and patch verification. While directed fuzzing effectively drives path exploration, satisfying complex semantic constraints remains a...
Breaking Isolation: A New Perspective on Hypervisor Exploitation Via Cross-Domain Attacks
Hypervisors are under threat by critical memory safety vulnerabilities, with pointer corruption being one of the most prevalent and severe forms. Existing exploitation frameworks depend on identifying highly-constrained structures in the host machine and accurately determining their runtime...
Practical-Vulnerability-Exploitation
Practical-Vulnerability-Exploitation Hands-on exploi...
Prompting the Priorities: A First Look at Evaluating LLMs for Vulnerability Triage and Prioritization
Security analysts face increasing pressure to triage large and complex vulnerability backlogs. Large Language Models LLMs offer a potential aid by automating parts of the interpretation process. We evaluate four models ChatGPT, Claude, Gemini, and DeepSeek across twelve prompting techniques to...
LLM Agents for Automated Web Vulnerability Reproduction: Are We There Yet?
Large language model LLM agents have demonstrated remarkable capabilities in software engineering and cybersecurity tasks, including code generation, vulnerability discovery, and automated testing. One critical but underexplored application is automated web vulnerability reproduction, which...
A Systematic Study on Generating Web Vulnerability Proof-Of-Concepts Using Large Language Models
Recent advances in Large Language Models LLMs have brought remarkable progress in code understanding and reasoning, creating new opportunities and raising new concerns for software security. Among many downstream tasks, generating Proof-of-Concept PoC exploits plays a central role in vulnerabilit...
SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios
Large language model LLM powered code agents are rapidly transforming software engineering by automating tasks such as testing, debugging, and repairing, yet the security risks of their generated code have become a critical concern. Existing benchmarks have offered valuable insights but remain...
LLaVul: a Multimodal LLM for Interpretable Vulnerability Reasoning about Source Code
Increasing complexity in software systems places a growing demand on reasoning tools that unlock vulnerabilities manifest in source code. Many current approaches focus on vulnerability analysis as a classifying task, oversimplifying the nuanced and context-dependent real-world scenarios. Even...
PatchProve
PatchProve A PoC-Driven Benchmark for Evaluating Large Lang...
VADER: a Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation
Ensuring that large language models LLMs can effectively assess, detect, explain, and remediate software vulnerabilities is critical for building robust and secure software systems. We introduce VADER, a human-evaluated benchmark designed explicitly to assess LLM performance across four key...
Code Security Advent Calendar 2021
We are happy to announce our sixth consecutive Code Security Advent Calendar! Born at RIPS in 2016, each calendar comprises 24 little code puzzles containing hidden security vulnerabilities that wait to be spotted. This is our way to share good vibes with the community while learning and having f...
EVABS - Extremely Vulnerable Android Labs
An open source Android application that is intentionally vulnerable so as to act as a learning platform for Android application security beginners. The effort is to introduce beginners with very limited or zero knowledge to some of the major and commonly found real-world based Android application...