Composed Agents
Combine grounding models with any LLM for computer-use capabilities
Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning.
Use the format "grounding_model+thinking_model"
to create a composed agent with any vision-enabled LiteLLM-compatible model.
How Composed Agents Work
- Planning Phase: The thinking model (LLM) analyzes the task and decides what actions to take (e.g.,
click("find the login button")
,type("username")
) - Grounding Phase: The grounding model converts element descriptions to precise coordinates
- Execution: Actions are performed using the predicted coordinates
Supported Grounding Models
Any model that supports predict_click()
can be used as the grounding component:
omniparser
(OSS set-of-marks model)huggingface-local/HelloKKMe/GTA1-7B
(OSS grounding model)huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B
(OSS unified model)claude-3-5-sonnet-20241022
(Anthropic CUA)openai/computer-use-preview
(OpenAI CUA)
Supported Thinking Models
Any vision-enabled LiteLLM-compatible model can be used as the thinking component:
- Anthropic:
anthropic/claude-3-5-sonnet-20241022
,anthropic/claude-3-opus-20240229
- OpenAI:
openai/gpt-4o
,openai/gpt-4-vision-preview
- Google:
gemini/gemini-1.5-pro
,vertex_ai/gemini-pro-vision
- Local models: Any Hugging Face vision-language model
Usage Examples
GTA1 + Claude 3.5 Sonnet
Combine state-of-the-art grounding with powerful reasoning:
agent = ComputerAgent(
"huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022",
tools=[computer]
)
async for _ in agent.run("Open Firefox, navigate to github.com, and search for 'computer-use'"):
pass
# Success! 🎉
# - Claude 3.5 Sonnet plans the sequence of actions
# - GTA1-7B provides precise click coordinates for each UI element
GTA1 + Gemini Pro
Use Google's Gemini for planning with specialized grounding:
agent = ComputerAgent(
"huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro",
tools=[computer]
)
async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"):
pass
UI-TARS + GPT-4o
Combine two different vision models for enhanced capabilities:
agent = ComputerAgent(
"huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B+openai/gpt-4o",
tools=[computer]
)
async for _ in agent.run("Help me fill out this form with my personal information"):
pass
Benefits of Composed Agents
- Specialized Grounding: Use models optimized for click prediction accuracy
- Flexible Planning: Choose any LLM for task reasoning and planning
- Cost Optimization: Use smaller grounding models with larger planning models only when needed
- Performance: Leverage the strengths of different model architectures
Capabilities
Composed agents support both capabilities:
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022")
# Full computer-use agent capabilities
async for _ in agent.run("Complete this online form"):
pass
# Direct click prediction (uses grounding model only)
coords = agent.predict_click("find the submit button")
For more information on individual model capabilities, see Computer-Use Agents and Grounding Models.