The most advanced MLLMs (e.g. Gemini-1.5) still struggle to comprehend multimodal documents. All MLLMs exhibit poor performance on image needles. MLLMs fail to recognize the exact number of images in ...