Description: AIM: Adapting Image Models for Efficient Video Action Recognition
Taojiannan Yang ♠,♣ , Yi Zhu ♠ , Yusheng Xie ♠ , Aston Zhang ♠ , Chen Chen ♣ , Mu Li ♠
♠ Amazon Web Services, ♣ University of Central Florida
Multimodal models are sensitive to image/text perturbations (original image-text pairs are shown in blue boxes, perturbed ones are in red). Image captioning (Top): Adding image perturbations can result in incorrect captions, e.g., the tabby kitten is mistakenly described as a woman/dog. Text-to-image generation (bottom): Applying text perturbations can result in the generated images containing incomplete visual information, e.g., the tree is missing in the example above.