Investigating Vulnerabilities and Defenses against Audio-Visual Attacks: a Comprehensive Survey Emphasizing Multimodal Models
Multimodal large language models MLLMs, which bridge the gap between audio-visual and natural language processing, achieve state-of-the-art performance on several audio-visual tasks. Despite the superior performance of MLLMs, the scarcity of high-quality audio-visual training data and computation...