7 matches found
Attention Is Where You Attack
Safety-aligned large language models rely on RLHF and instruction tuning to refuse harmful requests, yet the internal mechanisms implementing safety behavior remain poorly understood. We introduce the Attention Redistribution Attack ARA, a white-box adversarial attack that identifies...
Many Hands Make Light Work: An LLM-Based Multi-Agent System for Detecting Malicious PyPI Packages
Malicious code in open-source repositories such as PyPI poses a growing threat to software supply chains. Traditional rule-based tools often overlook the semantic patterns in source code that are crucial for identifying adversarial components. Large language models LLMs show promise for software...
Persistent Backdoor Attacks under Continual Fine-Tuning of LLMs
Backdoor attacks embed malicious behaviors into Large Language Models LLMs, enabling adversaries to trigger harmful outputs or bypass safety controls. However, the persistence of the implanted backdoors under user-driven post-deployment continual fine-tuning has been rarely examined. Most prior...
Adapting Large Language Models to Emerging Cybersecurity Using Retrieval Augmented Generation
Security applications are increasingly relying on large language models LLMs for cyber threat detection; however, their opaque reasoning often limits trust, particularly in decisions that require domain-specific cybersecurity knowledge. Because security threats evolve rapidly, LLMs must not only...
Pentesting-Assistant
Pentesting-Assistant AI-powered penetration testing assist...
Invariant-Based Robust Weights Watermark for Large Language Models
Watermarking technology has gained significant attention due to the increasing importance of intellectual property IP rights, particularly with the growing deployment of large language models LLMs on billions resource-constrained edge devices. To counter the potential threats of IP theft by...
Model Inversion Attacks on Llama 3: Extracting PII from Large Language Models
Large language models LLMs have transformed natural language processing, but their ability to memorize training data poses significant privacy risks. This paper investigates model inversion attacks on the Llama 3.2 model, a multilingual LLM developed by Meta. By querying the model with carefully...