James Y. Huang, Sheng Zhang, Qianchu Liu, Guanghui Qin, Tinghui Zhu, Tristan Naumann, Muhao Chen, Hoifung Poon

<aside>

TL;DR

We introduce BeMyEyes, a modular, multiagent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. By combining the complementary strengths of perceiver and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs. Notably, we demonstrate that using Qwen2.5-VL-7B as “eyes”, DeepSeek-R1 can outperform large-scale VLMs like GPT-4o across a wide range of knowledge-intensive, multimodal reasoning tasks.

</aside>

Figure 1: Using BeMyEyes enables text-only models such as DeepSeek-R1 and GPT-4 to reach state-of-the-art performance on challenging multimodal benchmarks without modifying their parameters. Grey bars denote text-only baselines, where models receive only the benchmark questions without images. Dotted lines indicate GPT-4o performance.

Turning LLMs into VLMs is costly

Large language models (LLMs) excel at reasoning over textual information but cannot directly process other modalities. One major approach to extend LLMs to new modalities (e.g. vision) is to build vision language models (VLMs) that couple pre-trained visual encoders with powerful LLM backbones. However, training or adapting such models typically requires substantial computational resources, large-scale multimodal datasets, and often non-trivial architectural modifications when extending to new modalities.

BeMyEyes: A Multi-Agent Approach to Modality Extension

We introduce BeMyEyes, a multi-agent framework for extending LLMs to new modalities with the help of perceiver agents as the “eyes” of LLMs. In this paradigm, an LLM can act as a reasoning agent that leverages its extensive world knowledge and advanced reasoning capabilities, while collaborating with a perceiver agent that processes and conveys information from non-textual inputs.

                                               *Figure 2: Overview of the BeMyEyes framework.*

More specifically, our framework orchestrates collaboration between a small, adaptable VLM as the perceiver agent, and a large, frozen LLM as the reasoner agent through multi-turn conversations. This collaboration combines the complementary strengths of small VLMs and large LLMs: the perceiver serves as the “eyes” that ground the task in visual evidence, and the reasoner acts as an intelligent expert that drives decision making. To further improve collaboration, we introduce a data synthesis and supervised fine-tuning pipeline that allows us to train perceiver agents by distilling strong perceptual and instruction-following capabilities from larger VLMs.

BeMyEyes offers three key advantages:

Efficiency: We substantially reduces the cost of developing powerful multi-modal models on top of existing LLMs, since only the much more compact perceiver agent needs to be adapted to support new modalities.
Generalizability: Our design preserves the generalization capabilities of LLMs, allowing them to draw on their extensive knowledge and strong reasoning abilities when operating over non-textual inputs.