Grounding Models
Models that support click prediction with ComputerAgent.predict_click()
These models specialize in UI element grounding and click prediction. They can identify precise coordinates for UI elements based on natural language descriptions, but cannot perform autonomous task planning.
Use ComputerAgent.predict_click()
to get coordinates for specific UI elements.
All Computer-Use Agents
All models that support ComputerAgent.run()
also support ComputerAgent.predict_click()
:
Anthropic CUAs
- Claude 4.1:
claude-opus-4-1-20250805
- Claude 4:
claude-opus-4-20250514
,claude-sonnet-4-20250514
- Claude 3.7:
claude-3-7-sonnet-20250219
- Claude 3.5:
claude-3-5-sonnet-20240620
OpenAI CUA Preview
- Computer-use-preview:
computer-use-preview
UI-TARS 1.5
huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B
huggingface/ByteDance-Seed/UI-TARS-1.5-7B
(requires TGI endpoint)
Specialized Grounding Models
These models are optimized specifically for click prediction and UI element grounding:
OmniParser
OCR-focused set-of-marks model that requires an LLM for click prediction:
omniparser
(requires combination with any LiteLLM vision model)
GTA1-7B
State-of-the-art grounding model from the GUI Agent Grounding Leaderboard:
huggingface-local/HelloKKMe/GTA1-7B
Usage Examples
# Using any grounding model for click prediction
agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
# Predict coordinates for specific elements
login_coords = agent.predict_click("find the login button")
search_coords = agent.predict_click("locate the search text field")
menu_coords = agent.predict_click("find the hamburger menu icon")
print(f"Login button: {login_coords}")
print(f"Search field: {search_coords}")
print(f"Menu icon: {menu_coords}")
# OmniParser is just for OCR, so it requires an LLM for predict_click
agent = ComputerAgent("omniparser+anthropic/claude-3-5-sonnet-20241022", tools=[computer])
# Predict click coordinates using composed agent
coords = agent.predict_click("find the submit button")
print(f"Click coordinates: {coords}") # (450, 320)
# Note: Cannot use omniparser alone for click prediction
# This will raise an error:
# agent = ComputerAgent("omniparser", tools=[computer])
# coords = agent.predict_click("find button") # Error!
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
# Predict click coordinates for UI elements
coords = agent.predict_click("find the submit button")
print(f"Click coordinates: {coords}") # (450, 320)
# Note: GTA1 cannot perform autonomous task planning
# This will raise an error:
# agent.run("Fill out the form and submit it")
For information on combining grounding models with planning capabilities, see Composed Agents.