2 matches found
Chasing Shadows: Pitfalls in LLM Security Research
Large language models LLMs are increasingly prevalent in security research. Their unique characteristics, however, introduce challenges that undermine established paradigms of reproducibility, rigor, and evaluation. Prior work has identified common pitfalls in traditional machine learning researc...
IF-GUIDE: Influence Function-Guided Detoxification of LLMs
We study how training data contributes to the emergence of toxic behaviors in large-language models. Most prior work on reducing model toxicity adopts $reactive$ approaches, such as fine-tuning pre-trained and potentially toxic models to align them with human values. In contrast, we propose a...