NAP-Tuning: Neural Augmented Prompt Tuning for Adversarially Robust Vision-Language Models
Vision-Language Models VLMs such as CLIP have demonstrated remarkable capabilities in understanding relationships between visual and textual data through joint embedding spaces. Despite their effectiveness, these models remain vulnerable to adversarial attacks, particularly in the image modality,...