The advent of Multimodal Large Language Models (MLLMs), leveraging the power of Large Language Models, has recently demonstrated superior multimodal understanding and reasoning abilities, heralding a new era for artificial general intelligence (AGI). However, achieving AGI necessitates more than just comprehension and reasoning. A crucial capability required is effective planning in diverse scenarios, which involves making reasonable decisions based on complex environments to solve real-world problems. Despite its importance, the planning abilities of current MLLMs in varied scenarios remain underexplored, leaving a significant gap in our understanding of their full potential.
In this paper, we introduce EgoPlan-Bench2, a rigorous and comprehensive benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios. EgoPlan-Bench2 encompasses everyday tasks spanning 4 major domains and 24 detailed scenarios, closely aligned with human daily life. EgoPlan-Bench2 is constructed through a semi-automatic process utilizing egocentric videos, complemented by manual verification. Grounded in a first-person perspective, it mirrors the way humans approach problem-solving in everyday life. We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning. To further improve the planning proficiency of current MLLMs, we propose a training-free approach using multimodal Chain-of-Thought (CoT) prompting through investigating the effectiveness of various multimodal prompts in complex planning. Our approach enhances the performance of GPT-4V by 10.24% on EgoPlan-Bench2 without additional training. Our work not only sheds light on the current limitations of MLLMs in planning, but also provides insights for future enhancements in this critical area.
This repository describes the usage of our proposed EgoPlan-Bench2, and provides the corresponding codes for benchmarking MLLMs and enhancing GPT-4V's performance by multimodal CoT prompting. Welcome to evaluate your models and explore methods to enhance the models' EgoPlan capabilities on our benchmark!
Figure 2. Left: Scenarios distribution of EgoPlan-Bench2, which covers 4 major domains and 24 fine-grained scenarios. Right: Video length distribution. Our benchmark has a full spectrum of video duration, ranging from a few seconds to five minutes.
Figure 3. Models' performance across different scenarios and video lengths.
Figure 4. The accuracy of 21 MLLMs across the 4 main domains in human life.
Figure 5. The pipeline of our training-free multimodal Chain-of-Thought (CoT) prompting method. We utilize predicted actions sequences as a prompt for representing historical task progress, and bounding box of key objects as a prompt to enhance the understanding of visual observations. By combining these elements with CoT reasoning and a self-consistency mechanism, we strengthen GPT-4V's planning capabilities without the need for additional training.
Question:
Choices:
Ground Truth:
@article{qiu2024egoplanbench2,
title = {EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios},
author = {Qiu, Lu and Ge, Yuying and Chen, Yi and Ge, Yixiao and Shan, Ying and Liu, Xihui},
year = {2024},
journal = {arXiv preprint arXiv:2412.04447}
}