658 matches found
Multilingual Source Tracing of Speech Deepfakes: a First Benchmark
Recent progress in generative AI has made it increasingly easy to create natural-sounding deepfake speech from just a few seconds of audio. While these tools support helpful applications, they also raise serious concerns by making it possible to generate convincing fake speech in many languages...
Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM As a Judge, and a Lightweight CTF Benchmark
Recent advances in LLM agentic systems have improved the automation of offensive security tasks, particularly for Capture the Flag CTF challenges. We systematically investigate the key factors that drive agent success and provide a detailed recipe for building effective LLM-based offensive securi...
Malicious code in astro-benchmark (npm)
The package communicates with a domain associated with malicious activity. --- -= Per source details. Do not edit below this line.=-...
MAL-2025-6696 Malicious code in astro-benchmark (npm)
The package communicates with a domain associated with malicious activity. --- -= Per source details. Do not edit below this line.=-...
Dedupe Python Library 操作系统命令注入漏洞
Dedupe Python Library is an open source Python library for accurate and scalable fuzzy matching, de-duplication from Dedupe.io. Dedupe Python Library suffers from an operating system command injection vulnerability that stems from issuecomment triggering the execution of untrusted code in the...
Breaking Obfuscation: Cluster-Aware Graph with LLM-Aided Recovery for Malicious JavaScript Detection
With the rapid expansion of web-based applications and cloud services, malicious JavaScript code continues to pose significant threats to user privacy, system integrity, and enterprise security. But, detecting such threats remains challenging due to sophisticated code obfuscation techniques and...
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we...
Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security
As large language models LLMs increasingly integrate native code interpreters, they enable powerful real-time execution capabilities, substantially expanding their utility. However, such integrations introduce potential system-level cybersecurity threats, fundamentally different from prompt-based...
Tab-MIA: a Benchmark Dataset for Membership Inference Attacks on Tabular Data in LLMs
Large language models LLMs are increasingly trained on tabular data, which, unlike unstructured text, often contains personally identifiable information PII in a highly structured and explicit format. As a result, privacy risks arise, since sensitive records can be inadvertently retained by the...
PhreshPhish: a Real-World, High-Quality, Large-Scale Phishing Website Dataset and Benchmark
Phishing remains a pervasive and growing threat, inflicting heavy economic and reputational damage. While machine learning has been effective in real-time detection of phishing attacks, progress is hindered by lack of large, high-quality datasets and benchmarks. In addition to poor-quality due to...
The Man behind the Sound: Demystifying Audio Private Attribute Profiling Via Multimodal Large Language Model Agents
Our research uncovers a novel privacy risk associated with multimodal large language models MLLMs: the ability to infer sensitive personal attributes from audio data -- a technique we term audio private attribute profiling. This capability poses a significant threat, as audio can be covertly...
AICrypto: a Comprehensive Benchmark for Evaluating Cryptography Capabilities of Large Language Models
Whitepaper called AICrypto: A Comprehensive Benchmark For Evaluating Cryptography Capabilities Of Large Language Models...
Implementing and Evaluating Post-Quantum DNSSEC in CoreDNS
The emergence of quantum computers poses a significant threat to current secure service, application and/or protocol implementations that rely on RSA and ECDSA algorithms, for instance DNSSEC, because public-key cryptography based on number factorization or discrete logarithm is vulnerable to...
Towards Privacy-Preserving and Personalized Smart Homes Via Tailored Small Language Models
Large Language Models LLMs have showcased remarkable generalizability in language comprehension and hold significant potential to revolutionize human-computer interaction in smart homes. Existing LLM-based smart home assistants typically transmit user commands, along with user profiles and home...
Post-Processing in Local Differential Privacy: an Extensive Evaluation and Benchmark Platform
Local differential privacy LDP has recently gained prominence as a powerful paradigm for collecting and analyzing sensitive data from users' devices. However, the inherent perturbation added by LDP protocols reduces the utility of the collected data. To mitigate this issue, several post-processin...
DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective
The widespread application of Deep Learning across diverse domains hinges critically on the quality and composition of training datasets. However, the common lack of disclosure regarding their usage raises significant privacy and copyright concerns. Dataset auditing techniques, which aim to...
BackFed: an Efficient and Standardized Benchmark Suite for Backdoor Attacks in Federated Learning
Federated Learning FL systems are vulnerable to backdoor attacks, where adversaries train their local models on poisoned data and submit poisoned model updates to compromise the global model. Despite numerous proposed attacks and defenses, divergent experimental settings, implementation errors, a...
Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties
Mobile GUI agents are designed to autonomously execute diverse device-control tasks by interpreting and interacting with mobile screens. Despite notable advancements, their resilience in real-world scenarios where screen content may be partially manipulated by untrustworthy third parties remains...
Evaluating the Evaluators: Trust in Adversarial Robustness Tests
Despite significant progress in designing powerful adversarial evasion attacks for robustness verification, the evaluation of these methods often remains inconsistent and unreliable. Many assessments rely on mismatched models, unverified implementations, and uneven computational budgets, which ca...
JsDeObsBench: Measuring and Benchmarking LLMs for JavaScript Deobfuscation
Deobfuscating JavaScript JS code poses a significant challenge in web security, particularly as obfuscation techniques are frequently used to conceal malicious activities within scripts. While Large Language Models LLMs have recently shown promise in automating the deobfuscation process,...