6 matches found
Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities
LLM agents with tool access can discover and exploit security vulnerabilities. This is known. What is not known is which features of a system prompt trigger this behaviour, and which do not. We present a systematic taxonomy based on approximately 10,000 trials across seven models, 37 prompt...
Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models
Large language models LLMs remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. These attacks target different harm categories like malware generation, harassment, or fraud through distinct conversational approaches...
Security Concerns for Large Language Models: a Survey
Large Language Models LLMs such as GPT-4 and its recent iterations, Google's Gemini, Anthropic's Claude 3 models, and xAI's Grok have caused a revolution in natural language processing, but their capabilities also introduce new security vulnerabilities. In this survey, we provide a comprehensive...
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
Multi-turn interactions with language models LMs pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn...
Teaching LLMs to Be Deceptive
Interesting research: "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training": Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given th...
Poisoning AI Models
New research into poisoning AI models: The researchers first trained the AI models using supervised learning and then used additional "safety training" methods, including more supervised learning, reinforcement learning, and adversarial training. After this, they checked if the AI still had hidde...