clova-tool.github.io - CLOVA

Description: CLOVA

Example domain paragraphs

Leveraging large language models (LLMs) to integrate off-the-shelf tools (e.g., visual models and image processing functions) is a promising research direction to build powerful visual assistants for solving diverse visual tasks. However, the learning capability is rarely explored in existing methods, as they freeze the used tools after deployment, thereby limiting the generalization to new environments requiring specific knowledge. In this paper, we propose CLOVA, a Closed-Loop Visual Assistant to address

CLOVA has three phases: inference, reflection, and learning, as shown Figure 2. In the inference phase, CLOVA uses LLMs to generate programs and executes corresponding tools to solve the task. The reflection phase introduces a multimodal global-local reflection scheme that uses LLMs to generate critiques, identifying which tool needs to be updated. During learning, we employ three manners to collect training data and use a training-validation prompt tuning scheme to update the tools.

Our inference phase is based on VISPROG, while the difference is that CLOVA first uses LLMs to generate plans and then generates programs based on the plans, instead of directly generating programs. Plans can be seen as intermediate reasoning chains that benefit the inference and reflection phases. Given a task, CLOVA selects in-context examples from a demonstration pool, including correct examples and incorrect examples with error critiques. These examples are used to create prompts that are then sent to L