5 matches found
ARES: Adaptive Red-Teaming and End-To-End Repair of Policy-Reward System
Reinforcement Learning from Human Feedback RLHF is central to aligning Large Language Models LLMs, yet it introduces a critical vulnerability: an imperfect Reward Model RM can become a single point of failure when it fails to penalize unsafe behaviors. While existing red-teaming approaches...
SecCodePRM: A Process Reward Model for Code Security
Large Language Models are rapidly becoming core components of modern software development workflows, yet ensuring code security remains challenging. Existing vulnerability detection pipelines either rely on static analyzers or use LLM/GNN-based detectors trained with coarse program-level...
PRM-Free Security Alignment of Large Models Via Red Teaming and Adversarial Training
Large Language Models LLMs have demonstrated remarkable capabilities across diverse applications, yet they pose significant security risks that threaten their safe deployment in critical domains. Current security alignment methodologies predominantly rely on Process Reward Models PRMs to evaluate...
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning...
BadReward: Clean-Label Poisoning of Reward Models in Text-To-Image RLHF
Reinforcement Learning from Human Feedback RLHF is crucial for aligning text-to-image T2I models with human preferences. However, RLHF's feedback mechanism also opens new pathways for adversaries. This paper demonstrates the feasibility of hijacking T2I models by poisoning a small fraction of...