Grounding Models

These models specialize in UI element grounding and click prediction. They can identify precise coordinates for UI elements based on natural language descriptions, but cannot perform autonomous task planning.

Use ComputerAgent.predict_click() to get coordinates for specific UI elements.

All models that support ComputerAgent.run() also support ComputerAgent.predict_click(). See All‑in‑one CUAs.

Anthropic CUAs

Claude 4.1: claude-opus-4-1-20250805
Claude 4: claude-opus-4-20250514, claude-sonnet-4-20250514
Claude 3.7: claude-3-7-sonnet-20250219
Claude 3.5: claude-3-5-sonnet-20241022

OpenAI CUA Preview

Computer-use-preview: computer-use-preview

UI-TARS 1.5 (Unified VLM with grounding support)

huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B
huggingface/ByteDance-Seed/UI-TARS-1.5-7B (requires TGI endpoint)

Specialized Grounding Models

These models are optimized specifically for click prediction and UI element grounding:

OpenCUA

huggingface-local/xlangai/OpenCUA-{7B,32B}

GTA1 Family

huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}

Holo 1.5 Family

huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}

InternVL 3.5 Family

huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}

OmniParser (OCR)

OCR-focused set-of-marks model that requires an LLM for click prediction:

omniparser (requires combination with any LiteLLM vision model)

Usage Examples

# Using any grounding model for click prediction
agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])

# Predict coordinates for specific elements
login_coords = agent.predict_click("find the login button")
search_coords = agent.predict_click("locate the search text field")
menu_coords = agent.predict_click("find the hamburger menu icon")

print(f"Login button: {login_coords}")
print(f"Search field: {search_coords}")
print(f"Menu icon: {menu_coords}")

# OmniParser is just for OCR, so it requires an LLM for predict_click
agent = ComputerAgent("omniparser+anthropic/claude-3-5-sonnet-20241022", tools=[computer])

# Predict click coordinates using composed agent
coords = agent.predict_click("find the submit button")
print(f"Click coordinates: {coords}")  # (450, 320)

# Note: Cannot use omniparser alone for click prediction
# This will raise an error:
# agent = ComputerAgent("omniparser", tools=[computer])
# coords = agent.predict_click("find button")  # Error!

agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])

# Predict click coordinates for UI elements
coords = agent.predict_click("find the submit button")
print(f"Click coordinates: {coords}")  # (450, 320)

# Note: GTA1 cannot perform autonomous task planning
# This will raise an error:
# agent.run("Fill out the form and submit it")

For information on combining grounding models with planning capabilities, see Composed Agents and All‑in‑one CUAs.

On this page