LogoCua Documentation
Supported Agents

Composed Agents

Combine grounding models with any LLM for computer-use capabilities

Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning.

Use the format "grounding_model+planning_model" to create a composed agent with any vision-enabled LiteLLM-compatible model.

How Composed Agents Work

  1. Planning Phase: The planning model (LLM) analyzes the task and decides what actions to take (e.g., click("find the login button"), type("username"))
  2. Grounding Phase: The grounding model converts element descriptions to precise coordinates
  3. Execution: Actions are performed using the predicted coordinates

Supported Grounding Models

Any model that supports predict_click() can be used as the grounding component. See the full list on Grounding Models.

  • OpenCUA: huggingface-local/xlangai/OpenCUA-{7B,32B}
  • GTA1 family: huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}
  • Holo 1.5 family: huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}
  • InternVL 3.5 family: huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}
  • UI‑TARS 1.5: huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B (also supports full CU)
  • OmniParser (OCR): omniparser (requires combination with a LiteLLM vision model)

Supported Planning Models

Any vision-enabled LiteLLM-compatible model can be used as the planning component:

  • Any All‑in‑one CUA (planning-capable). See All‑in‑one CUAs.
  • Any VLM via LiteLLM providers: anthropic/*, openai/*, openrouter/*, gemini/*, vertex_ai/*, huggingface-local/*, mlx/*, etc.
  • Examples:
    • Anthropic: anthropic/claude-3-5-sonnet-20241022, anthropic/claude-opus-4-1-20250805
    • OpenAI: openai/gpt-5, openai/gpt-o3, openai/gpt-4o
    • Google: gemini/gemini-1.5-pro, vertex_ai/gemini-pro-vision
    • Local models: Any Hugging Face vision-language model

Usage Examples

GTA1 + GPT-5

Use Google's Gemini for planning with specialized grounding:

agent = ComputerAgent(
    "huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-5",
    tools=[computer]
)

async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"):
    pass

GTA1 + Claude 3.5 Sonnet

Combine state-of-the-art grounding with powerful reasoning:

agent = ComputerAgent(
    "huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022", 
    tools=[computer]
)

async for _ in agent.run("Open Firefox, navigate to github.com, and search for 'computer-use'"):
    pass
# Success! 🎉
# - Claude 3.5 Sonnet plans the sequence of actions
# - GTA1-7B provides precise click coordinates for each UI element

UI-TARS + GPT-4o

Combine two different vision models for enhanced capabilities:

agent = ComputerAgent(
    "huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B+openai/gpt-4o",
    tools=[computer]
)

async for _ in agent.run("Help me fill out this form with my personal information"):
    pass

Benefits of Composed Agents

  • Specialized Grounding: Use models optimized for click prediction accuracy
  • Flexible Planning: Choose any LLM for task reasoning and planning
  • Cost Optimization: Use smaller grounding models with larger planning models only when needed
  • Performance: Leverage the strengths of different model architectures

Capabilities

Composed agents support both capabilities:

agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022")

# Full computer-use agent capabilities
async for _ in agent.run("Complete this online form"):
    pass

# Direct click prediction (uses grounding model only)
coords = agent.predict_click("find the submit button")

For more information on individual model capabilities, see Computer-Use Agents and Grounding Models.