IF-GUIDE: Influence Function-Guided Detoxification of LLMs
We study how training data contributes to the emergence of toxic behaviors in large-language models. Most prior work on reducing model toxicity adopts $reactive$ approaches, such as fine-tuning pre-trained and potentially toxic models to align them with human values. In contrast, we propose a...