mathverse-cuhk.github.io - MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Description: Evaluating mathematical reasoning of foundation models in visual contexts

mathvista (2) math vista (2)

Example domain paragraphs

The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams .

To this end, we introduce MathVerse , an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into 6 distinct versions , each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether

With MathVerse , we unveil that, most existing MLLMs struggle to understand math diagrams, relying heavily on textual questions . Surprisingly, some of them even achieve 5%+ higher accuracy without the visual input, e.g., Qwen-VL-Max and InternLM-XComposer2. In contrast, GPT-4V and ShareGPT4V demonstrate relatively better comprehension of the visual content for mathematical reasoning. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs.

Links to mathverse-cuhk.github.io (1)