<img img_path>
tag to specify image loading.gpt-4-vision-preview
model from OpenAI and LLaVA
model from Microsoft now.
Here, we emphasize the Multimodal Conversable Agent and the LLaVA Agent due to their growing popularity.
GPT-4V represents the forefront in image comprehension, while LLaVA is an efficient model, fine-tuned from LLama-2.
lmm
feature during AutoGen installation:
MultimodalConversableAgent
interprets the input prompt, extracting images from local or internet sources.
FigureCreator
in our GPT-4V notebook and LLaVA notebook integrates two agents: a coder (an AssistantAgent) and critics (a multimodal agent).
The coder drafts Python code for visualizations, while the critics provide insights for enhancement. Collaboratively, these agents aim to refine visual outputs.
With human_input_mode=ALWAYS
, you can also contribute suggestions for better visualizations.