MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

1USC  2UPenn  3UMN  4UC Davis  5UW-Madison  6UCLA  7OSU  8Microsoft Research
*Equal Leadership  Equal Contribution

What is MuirBench?

MuirBench is a benchmark containing 11,264 images and 2,600 multiple-choice questions, providing robust evaluation on 12 multi-image understanding tasks.

Each example comes from one task in MuirBench, presenting diverse multi-image relations.

MuirBench -- Novel Features


  • MuirBench evaluates on a comprehensive range of 12 multi-image understanding abilities, e.g. geographic understanding, diagram understanding, visual retrieval, ..., etc, while prior benchmarks generally contain single-image questions.
  • MuirBench contains 10 diverse multi-image relations, e.g. narrative, complementary, etc.


  • MuirBench provides a robust evaluation on models by unanswerable instance variants. Three major ways to create the unanswerable instances are as below.

Abstract

We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.

Qualitative Results

For each task, we show the ground truth (in blue), and choice of GPT-4o, Gemini Pro, and Mantis. Notice that the markers are intentionally added for visualization purposes.

Quantitative Results

Results of different models on MuirBench. The first row shows task names and number of test instances. We see that most models perform similarly to random choice, and are far from humans.

Overall performance: the average accuracies of the most advanced multimodal LLMs on MuirBench are no better than 68%, which are still far from enabling satisfactory utility. The mean accuracies of open-source multimodal LLMs that have considered multi-images hover between 23.73% and 44.50%, which fall behind from advanced proprietary LLMs. Notably, there is no obvious correlation between model sizes and performances, indicating the importance of training data and training processes in developing multimodal LLMs with multi-image understanding capabilities. For certain models and tasks, some results are only on par or even below random guessing.

Experiment Analysis


1. In which multi-image tasks do multimodal LLMs show relative strengths and weaknesses?

As in the figure, we observe that multimodal LLMs perform relatively better on image-text matching, visual retrieval, and diagram understanding. In contrast, multi-image ordering and visual grounding appear to be more challenging for these models, because these tasks require understanding the whole multi-image context and conducting more complicated reasoning processes across images and modalities afterwards.


2. Do multimodal LLMs perform worse on the unanswerable set?

We compare performances on answerable and unanswerable sets for some best-performing models. All the studied models have severe performance drop when changing answerable instances to unanswerable counterparts. A closer look of the error cases reveals that models often avoid abstention when facing unanswerable questions. These observations not only highlight the importance of assessing model behavior under a more realistic setting, but also show that the pairwise design improves the reliability of MuirBench.


3. Are errors caused by specific image positions or unanserable types?

  • As in the left figure, we analyze the error rates of varying input positions of images and report the performance of GPT-4o, GeminiProVision, and Mantis-8B-Idefics2. The highest accuracy is achieved when images are positioned in options, while the highest error rate can be observed when images are in the middle of questions. This consistent trend across different models suggests that the position of images within a question correlates with the error rate.
  • As in the right figure, results show that the error rate also correlates with the type of unanswerable instances. All the three models perform relatively better when we only change the questions to make it incompatible with original images and options. However, all models are confused when the correct option is removed and fail to choose “none of the other options” in this scenario. The performance on unanswerable instances created by reordering or replacing images is divergent. Notably, GPT-4o performs much better than the other models in these cases.

BibTeX


        @article{wang2024muirbench,
          title={MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding},
          author={Wang, Fei and Fu, Xingyu and Huang, James Y and Li, Zekun and Liu, Qin and Liu, Xiaogeng and Ma, Mingyu Derek and Xu, Nan and Zhou, Wenxuan and Zhang, Kai and others},
          journal={arXiv preprint arXiv:2406.09411},
          year={2024}
        }