4 matches found
Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges
Large Language Model LLM agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag CTF challenges ...
AnyPoC: Universal Proof-Of-Concept Test Generation for Scalable LLM-Based Bug Detection
While recent LLM-based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detection. We frame this challenge as a test generation task: given a candidate report, synthesizing an...
Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains
As AI agents become integral to enterprise workflows, their reliance on shared tool libraries and pre-trained components creates significant supply chain vulnerabilities. While previous work has demonstrated behavioral backdoor detection within individual LLM architectures, the critical question ...
IDAsec - IDA plugin for reverse-engineering and dynamic interactions with the Binsec platform
IDA plugin for reverse-engineering and dynamic interactions with the Binsec platform Features Decoding an instruction in DBA IR Loading execution traces generated by Pinsec Triggering analyzes on Binsec and retrieving results Dependencies protobuf ZMQ capstone for trace disassembly graphviz to dr...