Illustration of Vision Search Assistant framework. We first identify the critical objects and generate their descriptions considering their correlations, named Correlated Formulation, using the Vision Language Model (VLM). We then use the LLM to generate sub-questions that leads to the final answer, which is referred to as the Planning Agent. The web pages returned from the search engine are analyzed, selected, and summarized by the same LLM, which is referred to as the Searching Agent. We use the original image, the user's prompt, the Correlated Formulation together with the obtained web knowledge to generate the final answer. Vision Search Assistant produces reliable answers, even for novel images, by leveraging the collaboration between VLM and web agents to gather visual information from the web effectively.
The core of Web Knowledge Search is an iterative algorithm named Chain of Search, which is designed to obtain the comprehensive web knowledge of the correlated formulations.
Here, we deduce the update of the directed graph when k = 1, 2, ..., and web knowledge is progressively extracted from each update. For example, in the 1st update, we generate sub-questions based on V_0. Each question will be sent to the search engine and the model will receive a returned set of web pages. The content of those pages is summarized by the LLM to obtain web knowledge at the first step X_w^1.
We present series of demos of Vision Search Assistant on novel images, novel events, and in-the-wild scenarios. Vision Search Assistant delivers promising potential as a powerful multimodal engine.
Ablation Study on "What to Search". We use the object-level description to avoid the visual redundancy of the image. If we use the image-based caption, the search agent can not precisely focus on the key information (the handbag in this figure).
Ablation Study on "How to search". We propose the "Chain of Search" to progressively obtain related web knowledge for VLMs. In this sample, the agent should first search for "Which conference did this paper submit to" and then find "Best papers of ICML 2024". Conversely, it's difficult to directly obtain the required knowledge since the page-rank method prefers more hyper-link pages instead of exact relevance, especially when there are multi-hop associations.
Ablation Study on "Complex Scenarios". We use the visual correlation to improve the ability in multiple-object scenarios. In this sample, the caption of Biden can not answer the questions on the groupwise debate, the visual correlation ("debate" in this demo) between Trump can effectively improve the answer quality.
Open-Set Evaluation. We conduct a human expert evaluation on open-set QA tasks. Vision Search Assistant significantly outperformed Perplexity.ai Pro and GPT-4o-Web across three key objectives: factuality, relevance, and supportiveness.
Closed-Set Evaluation on the LLaVA-W benchmark. We use GPT-4o (0806) for evaluation. Naive search here denotes the VLM with Google image search.
@article{zhang2024visionsearchassistantempower,
title={Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines},
author={Zhang, Zhixin and Zhang, Yiyuan and Ding, Xiaohan and Yue, Xiangyu},
journal={arXiv preprint arXiv:2410.21220},
year={2024}
}