Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Zhixin Zhang1, Yiyuan Zhang1,2†, Xiaohan Ding3, Xiangyu Yue1
1MMLab, CUHK   2Shanghai AI Lab   3Tencent
Corresponding Author
arXiv Code PDF

Teaser

Vision Search Assistant acquires unknown visual knowledge through web search.
Above is an intuitive comparison of answering the user's question with an unseen image.
Vision Search Assistant is developed based on LLaVA-1.6-7B and its ability to
answer the question on unseen images outperforms the state-of-the-art models
including LLava-1.6-34B, Qwen2-VL-72B, and InternVL2-76B.

Abstract

Search engines enable the retrieval of unknown information with texts. However, traditional methods fall short when it comes to understanding unfamiliar visual content, such as identifying an object that the model has never seen before. This challenge is particularly pronounced for large Vision Language Models (VLMs): if the model has not been exposed to the object depicted in an image, it struggles to generate reliable answers to the user's question regarding that image. Moreover, as new objects and events continuously emerge, frequently updating VLMs is impractical due to heavy computational burdens. To address this limitation, we propose Vision Search Assistant, a novel framework that facilitates collaboration between VLMs and web agents. This approach leverages VLMs' visual understanding capabilities and web agents' real-time information access to perform open-world Retrieval-Augmented Generation via the web. By integrating visual and textual representations through this collaboration, the model can provide informed responses even when the image is novel to the system. Extensive experiments conducted on both open-set and closed-set QA benchmarks demonstrate that the Vision Search Assistant significantly outperforms the other models and can be widely applied to existing VLMs.

Vision Search Assistant

framework

Illustration of Vision Search Assistant framework. We first identify the critical objects and generate their descriptions considering their correlations, named Correlated Formulation, using the Vision Language Model (VLM). We then use the LLM to generate sub-questions that leads to the final answer, which is referred to as the Planning Agent. The web pages returned from the search engine are analyzed, selected, and summarized by the same LLM, which is referred to as the Searching Agent. We use the original image, the user's prompt, the Correlated Formulation together with the obtained web knowledge to generate the final answer. Vision Search Assistant produces reliable answers, even for novel images, by leveraging the collaboration between VLM and web agents to gather visual information from the web effectively.

Web Knowledge Search: The Chain of Search

The core of Web Knowledge Search is an iterative algorithm named Chain of Search, which is designed to obtain the comprehensive web knowledge of the correlated formulations.

framework

Here, we deduce the update of the directed graph when k = 1, 2, ..., and web knowledge is progressively extracted from each update. For example, in the 1st update, we generate sub-questions based on V_0. Each question will be sent to the search engine and the model will receive a returned set of web pages. The content of those pages is summarized by the LLM to obtain web knowledge at the first step X_w^1.


Samples

main

We present series of demos of Vision Search Assistant on novel images, novel events, and in-the-wild scenarios. Vision Search Assistant delivers promising potential as a powerful multimodal engine.

Ablation

av2

Ablation Study on "What to Search". We use the object-level description to avoid the visual redundancy of the image. If we use the image-based caption, the search agent can not precisely focus on the key information (the handbag in this figure).

av2

Ablation Study on "How to search". We propose the "Chain of Search" to progressively obtain related web knowledge for VLMs. In this sample, the agent should first search for "Which conference did this paper submit to" and then find "Best papers of ICML 2024". Conversely, it's difficult to directly obtain the required knowledge since the page-rank method prefers more hyper-link pages instead of exact relevance, especially when there are multi-hop associations.

av2

Ablation Study on "Complex Scenarios". We use the visual correlation to improve the ability in multiple-object scenarios. In this sample, the caption of Biden can not answer the questions on the groupwise debate, the visual correlation ("debate" in this demo) between Trump can effectively improve the answer quality.

Open-Set Results

nuscenes

Open-Set Evaluation. We conduct a human expert evaluation on open-set QA tasks. Vision Search Assistant significantly outperformed Perplexity.ai Pro and GPT-4o-Web across three key objectives: factuality, relevance, and supportiveness.

Closed-Set Results

av2

Closed-Set Evaluation on the LLaVA-W benchmark. We use GPT-4o (0806) for evaluation. Naive search here denotes the VLM with Google image search.

BibTeX

If you find our work useful, please cite our paper. BibTex code is provided below:
@article{zhang2024visionsearchassistantempower,
  title={Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines},
  author={Zhang, Zhixin and Zhang, Yiyuan and Ding, Xiaohan and Yue, Xiangyu},
  journal={arXiv preprint arXiv:2410.21220},
  year={2024}
}