

- Introducing the Multimodal Conversable Agent and the LLaVA Agent to enhance LMM functionalities.
- Users can input text and images simultaneously using the
<img img_path>
tag to specify image loading. - Demonstrated through the GPT-4V notebook.
- Demonstrated through the LLaVA notebook.
Introduction
Large multimodal models (LMMs) augment large language models (LLMs) with the ability to process multi-sensory data. This blog post and the latest AutoGen update concentrate on visual comprehension. Users can input images, pose questions about them, and receive text-based responses from these LMMs. We support thegpt-4-vision-preview
model from OpenAI and LLaVA
model from Microsoft now.
Here, we emphasize the Multimodal Conversable Agent and the LLaVA Agent due to their growing popularity.
GPT-4V represents the forefront in image comprehension, while LLaVA is an efficient model, fine-tuned from LLama-2.
Installation
Incorporate thelmm
feature during AutoGen installation:
Usage
A simple syntax has been defined to incorporate both messages and images within a single string. Example of an in-context learning prompt:MultimodalConversableAgent
interprets the input prompt, extracting images from local or internet sources.
Advanced Usage
Similar to other AutoGen agents, multimodal agents support multi-round dialogues with other agents, code generation, factual queries, and management via a GroupChat interface. For example, theFigureCreator
in our GPT-4V notebook and LLaVA notebook integrates two agents: a coder (an AssistantAgent) and critics (a multimodal agent).
The coder drafts Python code for visualizations, while the critics provide insights for enhancement. Collaboratively, these agents aim to refine visual outputs.
With human_input_mode=ALWAYS
, you can also contribute suggestions for better visualizations.