R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning
Vision-language models VLMs, such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks. However, due to their inherent vulnerability and the common practice of selecting from a limited set of...