LogoCua Documentation
Supported Agents

Grounding Models

Models that support click prediction with ComputerAgent.predict_click()

These models specialize in UI element grounding and click prediction. They can identify precise coordinates for UI elements based on natural language descriptions, but cannot perform autonomous task planning.

Use ComputerAgent.predict_click() to get coordinates for specific UI elements.

All Computer-Use Agents

All models that support ComputerAgent.run() also support ComputerAgent.predict_click():

Anthropic CUAs

  • Claude 4.1: claude-opus-4-1-20250805
  • Claude 4: claude-opus-4-20250514, claude-sonnet-4-20250514
  • Claude 3.7: claude-3-7-sonnet-20250219
  • Claude 3.5: claude-3-5-sonnet-20240620

OpenAI CUA Preview

  • Computer-use-preview: computer-use-preview

UI-TARS 1.5

  • huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B
  • huggingface/ByteDance-Seed/UI-TARS-1.5-7B (requires TGI endpoint)

Specialized Grounding Models

These models are optimized specifically for click prediction and UI element grounding:

OmniParser

OCR-focused set-of-marks model that requires an LLM for click prediction:

  • omniparser (requires combination with any LiteLLM vision model)

GTA1-7B

State-of-the-art grounding model from the GUI Agent Grounding Leaderboard:

  • huggingface-local/HelloKKMe/GTA1-7B

Usage Examples

# Using any grounding model for click prediction
agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])

# Predict coordinates for specific elements
login_coords = agent.predict_click("find the login button")
search_coords = agent.predict_click("locate the search text field")
menu_coords = agent.predict_click("find the hamburger menu icon")

print(f"Login button: {login_coords}")
print(f"Search field: {search_coords}")
print(f"Menu icon: {menu_coords}")
# OmniParser is just for OCR, so it requires an LLM for predict_click
agent = ComputerAgent("omniparser+anthropic/claude-3-5-sonnet-20241022", tools=[computer])

# Predict click coordinates using composed agent
coords = agent.predict_click("find the submit button")
print(f"Click coordinates: {coords}")  # (450, 320)

# Note: Cannot use omniparser alone for click prediction
# This will raise an error:
# agent = ComputerAgent("omniparser", tools=[computer])
# coords = agent.predict_click("find button")  # Error!
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])

# Predict click coordinates for UI elements
coords = agent.predict_click("find the submit button")
print(f"Click coordinates: {coords}")  # (450, 320)

# Note: GTA1 cannot perform autonomous task planning
# This will raise an error:
# agent.run("Fill out the form and submit it")

For information on combining grounding models with planning capabilities, see Composed Agents.