Supported Agents
Grounding Models
Models that support click prediction with ComputerAgent.predict_click()
These models specialize in UI element grounding and click prediction. They can identify precise coordinates for UI elements based on natural language descriptions, but cannot perform autonomous task planning.
Use ComputerAgent.predict_click()
to get coordinates for specific UI elements.
All models that support ComputerAgent.run()
also support ComputerAgent.predict_click()
. See All‑in‑one CUAs.
Anthropic CUAs
- Claude 4.1:
claude-opus-4-1-20250805
- Claude 4:
claude-opus-4-20250514
,claude-sonnet-4-20250514
- Claude 3.7:
claude-3-7-sonnet-20250219
- Claude 3.5:
claude-3-5-sonnet-20241022
OpenAI CUA Preview
- Computer-use-preview:
computer-use-preview
UI-TARS 1.5 (Unified VLM with grounding support)
huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B
huggingface/ByteDance-Seed/UI-TARS-1.5-7B
(requires TGI endpoint)
Specialized Grounding Models
These models are optimized specifically for click prediction and UI element grounding:
OpenCUA
huggingface-local/xlangai/OpenCUA-{7B,32B}
GTA1 Family
huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}
Holo 1.5 Family
huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}
InternVL 3.5 Family
huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}
OmniParser (OCR)
OCR-focused set-of-marks model that requires an LLM for click prediction:
omniparser
(requires combination with any LiteLLM vision model)
Usage Examples
# Using any grounding model for click prediction
agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
# Predict coordinates for specific elements
login_coords = agent.predict_click("find the login button")
search_coords = agent.predict_click("locate the search text field")
menu_coords = agent.predict_click("find the hamburger menu icon")
print(f"Login button: {login_coords}")
print(f"Search field: {search_coords}")
print(f"Menu icon: {menu_coords}")
# OmniParser is just for OCR, so it requires an LLM for predict_click
agent = ComputerAgent("omniparser+anthropic/claude-3-5-sonnet-20241022", tools=[computer])
# Predict click coordinates using composed agent
coords = agent.predict_click("find the submit button")
print(f"Click coordinates: {coords}") # (450, 320)
# Note: Cannot use omniparser alone for click prediction
# This will raise an error:
# agent = ComputerAgent("omniparser", tools=[computer])
# coords = agent.predict_click("find button") # Error!
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
# Predict click coordinates for UI elements
coords = agent.predict_click("find the submit button")
print(f"Click coordinates: {coords}") # (450, 320)
# Note: GTA1 cannot perform autonomous task planning
# This will raise an error:
# agent.run("Fill out the form and submit it")
For information on combining grounding models with planning capabilities, see Composed Agents and All‑in‑one CUAs.