Lucene search
K

5 matches found

Packet Storm News
Packet Storm News
added 2026/04/20 12:0 a.m.2 views

ARES: Adaptive Red-Teaming and End-To-End Repair of Policy-Reward System

Reinforcement Learning from Human Feedback RLHF is central to aligning Large Language Models LLMs, yet it introduces a critical vulnerability: an imperfect Reward Model RM can become a single point of failure when it fails to penalize unsafe behaviors. While existing red-teaming approaches...

5.8AI score
Exploits0
Packet Storm News
Packet Storm News
added 2026/02/10 12:0 a.m.5 views

SecCodePRM: A Process Reward Model for Code Security

Large Language Models are rapidly becoming core components of modern software development workflows, yet ensuring code security remains challenging. Existing vulnerability detection pipelines either rely on static analyzers or use LLM/GNN-based detectors trained with coarse program-level...

5.7AI score
Exploits0
Packet Storm News
Packet Storm News
added 2025/07/14 12:0 a.m.2 views

PRM-Free Security Alignment of Large Models Via Red Teaming and Adversarial Training

Large Language Models LLMs have demonstrated remarkable capabilities across diverse applications, yet they pose significant security risks that threaten their safe deployment in critical domains. Current security alignment methodologies predominantly rely on Process Reward Models PRMs to evaluate...

7.1AI score
Exploits0
Packet Storm News
Packet Storm News
added 2025/06/06 12:0 a.m.1 views

Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance

Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning...

7AI score
Exploits0
Packet Storm News
Packet Storm News
added 2025/06/03 12:0 a.m.3 views

BadReward: Clean-Label Poisoning of Reward Models in Text-To-Image RLHF

Reinforcement Learning from Human Feedback RLHF is crucial for aligning text-to-image T2I models with human preferences. However, RLHF's feedback mechanism also opens new pathways for adversaries. This paper demonstrates the feasibility of hijacking T2I models by poisoning a small fraction of...

6.9AI score
Exploits0
Rows per page
Query Builder