Logo II-Bench

An Image Implication Understanding Benchmark for Multimodal Large Language Models

NeurIPS 2024 D&B Track

geometric reasoning

Overview of II-Bench: II-Bench comprises 1,222 images, spanning six domains: life, art, society, psychology, environment and others.

🔔News

🔥[2024-09-26]: II-Bench has been accepted to the NeurIPS 2024 Datasets and Benchmarks.

🔥[2024-06-26]: We released II-Bench challenge on EvalAI. You can submit your results and evaluate them there.😆

🌟[2024-06-25]: We added the results of latest Claude 3.5 Sonnet, which achieved the SOTA performance among all models so far.

Introduction

The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Our extensive experiments on 20 MLLMs reveal significant performance gaps between models and humans, particularly in abstract domains like Art and Psychology. Our findings highlight the need for improved emotional understanding in MLLMs and suggest that incorporating emotional polarity information can enhance model performance. II-Bench aims to inspire advancements in multimodal AI research and foster the development of more sophisticated artificial general intelligence (AGI).

Logo II-Benchmark

Overview

We introduce the Image Implication Understanding Benchmark II-Bench, a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex implication images. These images, including abstract artworks, comics and posters, possess visual implications that require an understanding of visual details and reasoning ability. II-Bench reveals whether current MLLMs, leveraging their inherent comprehension abilities, can accurately decode the metaphors embedded within the complex and abstract information presented in these images. algebraic reasoning

II-Bench contains a total of 1,222 various images. These images are manually collected and annotated by 50 undergraduate students from various disciplines and institutions, with sources from multiple renowned illustration websites. Each image is manually designed with one to three multiple-choice questions, each with six options and only one correct answer. The questions cover the metaphors, symbolism, and detailed understanding of the images. The benchmark includes a total of 1,434 multiple-choice questions, with 1,399 questions used to construct the test set and 35 questions used to construct the development and validation set for few-shot tasks.

Statistics

Experiment Results

Leaderboard

We conduct experiments on II-Bench using both open-source and closed-source MLLMs. For each model, we employ eight different settings: 1-shot, 2-shot, 3-shot, zero-shot (None), CoT, Domain, Emotion and Rhetoric. "Emotion" denotes prompts where the model is informed about the emotional polarity of the images(e.g., positive, negative), "Domain" involves adding information about the image’s domain (e.g., life, environment) to the prompt, and "Rhetoric" signifies prompt with information about the rhetorical devices used in the image (e.g., metaphor, personification), while "None" indicates the use of standard prompts without any additional information. Uniform prompts are applied across all MLLMs.

Open-Source Human Proprietary
Reset Overall Life Art Society Psychology Environment Others Positive Neutral Negative
Claude 3.5 sonnet 80.9 81.4 77.6 80.9 78.3 86.3 83.1 81.1 80.9 80.9
Qwen-VL-MAX 74.8 74.7 71.8 74.6 73.0 76.5 84.6 80.1 74.5 72.9
Gemini-1.5 Pro 73.9 73.7 74.1 74.4 63.2 80.4 83.1 80.1 70.8 75.4
GPT-4o 72.6 72.5 72.9 73.3 68.4 76.5 75.4 78.6 71.2 72.5
GPT-4V 65.9 65.0 69.4 65.3 59.9 76.5 80.0 69.4 66.0 64.0
LLaVA-1.6-34B 73.8 73.8 71.8 73.3 71.1 78.4 81.5 79.1 72.9 72.9
CogVLM2-Llama3-Chat 70.3 68.9 68.2 70.9 67.8 72.5 86.2 69.9 71.1 69.1
MiniCPM-Llama3-2.5 69.4 68.4 71.8 69.4 64.5 80.4 78.5 75.0 69.3 66.9
Yi-VL-34B-Chat 67.9 67.5 70.6 67.7 63.8 70.6 76.9 74.0 68.2 64.5
Idefics2-8B 67.7 67.2 74.1 67.7 62.5 74.5 70.8 68.9 67.0 68.4
InternVL-Chat-1.5 66.3 63.6 65.9 68.5 65.8 64.7 76.9 73.5 65.4 64.5
InternLM-XComposer2-VL 62.1 61.7 62.4 62.3 58.6 70.6 66.2 65.8 63.0 58.7
Yi-VL-6B-Chat 61.3 60.9 63.5 60.7 56.6 66.7 72.3 61.7 61.7 61.1
DeepSeek-VL-Chat-7B 60.3 59.0 58.8 58.4 61.8 68.6 76.9 65.8 60.1 58.0
BLIP-2 FLAN-T5-XXL 57.8 57.1 63.5 57.0 53.3 66.7 66.2 67.9 57.2 54.3
Mantis-8B-siglip-Llama3 57.5 56.8 61.2 57.5 53.9 64.7 61.5 59.2 58.0 55.6
InstructBLIP-T5-XXL 56.7 56.2 58.8 58.6 45.4 64.7 64.6 63.3 56.1 54.6
Qwen-VL-Chat 53.4 53.2 49.4 52.1 50.0 60.8 72.3 56.1 52.6 53.6
mPLUGw-OWL2 53.2 54.0 56.5 50.5 52.0 60.8 56.9 55.6 52.6 53.1
BLIP-2 FLAN-T5-XL 52.8 53.0 58.8 52.5 42.8 64.7 58.5 56.1 52.9 51.0
InstructBLIP-T5-XL 47.3 45.6 48.2 48.8 44.7 52.9 50.8 46.9 48.3 45.4
Human_avg 90.3 90.0 88.2 91.4 86.6 96.1 92.3 84.7 89.1 92.2
Human_best 98.2 97.9 98.8 98.3 97.4 100.0 100.0 98.0 98.0 98.8

Overall results of different MLLMs and humans on different domains and emotions. The best-performing model in each category is in-bold, and the second best is underlined.

Different Prompt Skills

Analysis of Chain-of-Thought (CoT). The results indicate that CoT had no significant effect on improving accuracy. In some cases, particularly with smaller open-source models, the accuracy even declined when CoT was used. For example, CogVLM2-Llama3-Chat-19B scores 70.3% without CoT and drops to 69.3% with CoT, InternVL-Chat-1.5 scores 66.3% and 63.3% as the same. These findings align with other benchmarks, which show that CoT is not particularly effective for image understanding tasks.

Analysis of Different Types and Domains. To evaluate the impact of different label information on model accuracy, we conduct an ablation study by providing corresponding label information (Emotion, Domain, Rhetoric) for the images in the prompt. This outcome is consistent with the human perspective of image metaphor comprehension. Emotion labels likely provide more intuitive and salient cues that align closely with human interpretative processes, thereby facilitating better model performance. In contrast, Domain and Rhetoric labels, while still beneficial, are not as immediately intuitive or universally applicable, thus resulting in slightly lower effectiveness in improving model accuracy. At the same time, from the perspective of model training, the model has a normal understanding of emotion, unlike the specific nouns we define ourselves in the Rhetoric and Domain labels. The model does not see many descriptions of such specific nouns during pre-training, which does not help improve accuracy.

Overall results of different prompts on II-Bench.The label(Emotion, Domain, Rhetoric) means providing corresponding information for the images in the prompt. The best-performing model in each category is in-bold, and the second best is underlined .

Analysis of Few-shot Examples.. Specifically, the performance tends to drop as more examples are provided. This can be attributed to the models’ inferior multi-image capabilities compared to their single-image capabilities, leading to a decline in accuracy with an increasing number of shots. Additionally, as the number of shots increases, the input length becomes longer, and the model’s long text ability is insufficient, resulting in poor long context performance. An example is Qwen-VL-Max, where inputs exceeding 6,000 tokens cause errors. Moreover, chat models generally exhibit good instruction following ability, reducing the necessity for few-shot examples.

Few-shot results of different models on the II-Bench. * means exceeds the context length.

Error Analysis

In order to perform a comprehensive error analysis of GPT-4V’s performance on II-Bench, we randomly select 100 erroneous samples from each domain, in proportion to their representation in the dataset. These samples are meticulously analyzed by expert annotators. GPT-4V’s errors can be categorized into the following types: Metaphorical Misunderstanding, Detail Misunderstanding, Detail Ignorance, Surface-Level Interpretation, Reasoning Error, Reject to Answer and Answer Extraction Error.

error distribution

GPT-4V error response distribution.

Error Examples

Correct Examples

BibTeX

@article{liu2024ii,
  title={II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models},
  author={Liu, Ziqiang and Fang, Feiteng and Feng, Xi and Du, Xinrun and Zhang, Chenhao and Wang, Zekun and Bai, Yuelin and Zhao, Qixuan and Fan, Liyang and Gan, Chengguang and others},
  journal={arXiv preprint arXiv:2406.05862},
  year={2024}
}