Test-Time Backdoor Attacks on
Multimodal Large Language Models

Dong Lu1*, Tianyu Pang2*☨, Chao Du2, Qian Liu2, Xianjun Yang3, Min Lin2
*Equal Contribution, Corresponding Author
1Southern University of Science and Technology
2Sea AI Lab, Singapore
3University of California, Santa Barbara    

Abstract

Backdoor attacks are commonly executed by contaminating training data, such that a trigger can activate predetermined harmful effects during the test phase. In this work, we present AnyDoor, a test-time backdoor attack against multimodal large language models (MLLMs), which involves injecting the backdoor into the textual modality using adversarial test images (sharing the same universal perturbation), without requiring access to or modification of the training data. AnyDoor employs similar techniques used in universal adversarial attacks, but distinguishes itself by its ability to decouple the timing of setup and activation of harmful effects.

In our experiments, we validate the effectiveness of AnyDoor against popular MLLMs such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, as well as provide comprehensive ablation studies. Notably, because the backdoor is injected by a universal perturbation, AnyDoor can dynamically change its backdoor trigger prompts/harmful effects, exposing a new challenge for defending against backdoor attacks.

Data Generation for DALL-E

As detailed in our paper, the DALL-E dataset utilizes a generative method. Initially, we randomly select textual descriptions from MS-COCO captions and subsequently use these as prompts to generate images via DALL-E. Following this, we craft questions related to the contents of images using ChatGPT-4. To conclude the process, we generate the original answers with LLaVa-1.5 as reference.

Overview of Proposed Method

Experiment Results

Demonstrations of test-time backdoor attacks. One practical way to carry out test-time backdoor attacks is to craft a universal perturbation using our AnyDoor method and then stick it onto the camera of an MLLM agent, following previous strategies used for physical-world attacks. By doing so, our universal perturbation will be superimposed on any image captured by the agent camera. If a normal user asks questions without the backdoor trigger (SUDO in this case), the agent will respond in a regular manner; however, if a malicious user poses any question containing the backdoor trigger, the agent will consistently exhibit harmful behaviors. In addition to these demos, our test-time backdoor attacks are effective for any trigger or target harmful behavior.

Demonstrations of attacking under continuously changing scenes, where we apply a universal adversarial perturbation to randomly selected frames in a video.

Visualization

Related Work

There's a lot of excellent works that attack multimodal large language models in recent days, please refer to the Related Work section in our paper for more details.

BibTeX

@article{
      lu2024testtime,
      title={Test-Time Backdoor Attacks on Multimodal Large Language Models},
      author={Lu, Dong and Pang, Tianyu 
        and Du, Chao and Liu, Qian and Yang, Xianjun and Lin, Min},
      journal={arXiv preprint arXiv:2402.08577},
      year={2024},
      }