LLMs Cannot Reliably Judge (Yet?): a Comprehensive Assessment on the Robustness of LLM-As-A-Judge
Large Language Models LLMs have demonstrated remarkable intelligence across various tasks, which has inspired the development and widespread adoption of LLM-as-a-Judge systems for automated model testing, such as red teaming and benchmarking. However, these systems are susceptible to adversarial...