Composed Agents
Combine grounding models with any LLM for computer-use capabilities
Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning.
Use the format "grounding_model+planning_model"
to create a composed agent with any vision-enabled LiteLLM-compatible model.
How Composed Agents Work
- Planning Phase: The planning model (LLM) analyzes the task and decides what actions to take (e.g.,
click("find the login button")
,type("username")
) - Grounding Phase: The grounding model converts element descriptions to precise coordinates
- Execution: Actions are performed using the predicted coordinates
Supported Grounding Models
Any model that supports predict_click()
can be used as the grounding component. See the full list on Grounding Models.
- OpenCUA:
huggingface-local/xlangai/OpenCUA-{7B,32B}
- GTA1 family:
huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}
- Holo 1.5 family:
huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}
- InternVL 3.5 family:
huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}
- UI‑TARS 1.5:
huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B
(also supports full CU) - OmniParser (OCR):
omniparser
(requires combination with a LiteLLM vision model)
Supported Planning Models
Any vision-enabled LiteLLM-compatible model can be used as the planning component:
- Any All‑in‑one CUA (planning-capable). See All‑in‑one CUAs.
- Any VLM via LiteLLM providers:
anthropic/*
,openai/*
,openrouter/*
,gemini/*
,vertex_ai/*
,huggingface-local/*
,mlx/*
, etc. - Examples:
- Anthropic:
anthropic/claude-3-5-sonnet-20241022
,anthropic/claude-opus-4-1-20250805
- OpenAI:
openai/gpt-5
,openai/gpt-o3
,openai/gpt-4o
- Google:
gemini/gemini-1.5-pro
,vertex_ai/gemini-pro-vision
- Local models: Any Hugging Face vision-language model
- Anthropic:
Usage Examples
GTA1 + GPT-5
Use Google's Gemini for planning with specialized grounding:
agent = ComputerAgent(
"huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-5",
tools=[computer]
)
async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"):
pass
GTA1 + Claude 3.5 Sonnet
Combine state-of-the-art grounding with powerful reasoning:
agent = ComputerAgent(
"huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022",
tools=[computer]
)
async for _ in agent.run("Open Firefox, navigate to github.com, and search for 'computer-use'"):
pass
# Success! 🎉
# - Claude 3.5 Sonnet plans the sequence of actions
# - GTA1-7B provides precise click coordinates for each UI element
UI-TARS + GPT-4o
Combine two different vision models for enhanced capabilities:
agent = ComputerAgent(
"huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B+openai/gpt-4o",
tools=[computer]
)
async for _ in agent.run("Help me fill out this form with my personal information"):
pass
Benefits of Composed Agents
- Specialized Grounding: Use models optimized for click prediction accuracy
- Flexible Planning: Choose any LLM for task reasoning and planning
- Cost Optimization: Use smaller grounding models with larger planning models only when needed
- Performance: Leverage the strengths of different model architectures
Capabilities
Composed agents support both capabilities:
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022")
# Full computer-use agent capabilities
async for _ in agent.run("Complete this online form"):
pass
# Direct click prediction (uses grounding model only)
coords = agent.predict_click("find the submit button")
For more information on individual model capabilities, see Computer-Use Agents and Grounding Models.