LogoCua Documentation
Supported Agents

Composed Agents

Combine grounding models with any LLM for computer-use capabilities

Composed agents combine the best of both worlds: specialized grounding models for precise click prediction and powerful LLMs for task planning and reasoning.

Use the format "grounding_model+thinking_model" to create a composed agent with any vision-enabled LiteLLM-compatible model.

How Composed Agents Work

  1. Planning Phase: The thinking model (LLM) analyzes the task and decides what actions to take (e.g., click("find the login button"), type("username"))
  2. Grounding Phase: The grounding model converts element descriptions to precise coordinates
  3. Execution: Actions are performed using the predicted coordinates

Supported Grounding Models

Any model that supports predict_click() can be used as the grounding component:

  • omniparser (OSS set-of-marks model)
  • huggingface-local/HelloKKMe/GTA1-7B (OSS grounding model)
  • huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B (OSS unified model)
  • claude-3-5-sonnet-20241022 (Anthropic CUA)
  • openai/computer-use-preview (OpenAI CUA)

Supported Thinking Models

Any vision-enabled LiteLLM-compatible model can be used as the thinking component:

  • Anthropic: anthropic/claude-3-5-sonnet-20241022, anthropic/claude-3-opus-20240229
  • OpenAI: openai/gpt-4o, openai/gpt-4-vision-preview
  • Google: gemini/gemini-1.5-pro, vertex_ai/gemini-pro-vision
  • Local models: Any Hugging Face vision-language model

Usage Examples

GTA1 + Claude 3.5 Sonnet

Combine state-of-the-art grounding with powerful reasoning:

agent = ComputerAgent(
    "huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022", 
    tools=[computer]
)

async for _ in agent.run("Open Firefox, navigate to github.com, and search for 'computer-use'"):
    pass
# Success! 🎉
# - Claude 3.5 Sonnet plans the sequence of actions
# - GTA1-7B provides precise click coordinates for each UI element

GTA1 + Gemini Pro

Use Google's Gemini for planning with specialized grounding:

agent = ComputerAgent(
    "huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro",
    tools=[computer]
)

async for _ in agent.run("Take a screenshot, analyze the UI, and click on the most prominent button"):
    pass

UI-TARS + GPT-4o

Combine two different vision models for enhanced capabilities:

agent = ComputerAgent(
    "huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B+openai/gpt-4o",
    tools=[computer]
)

async for _ in agent.run("Help me fill out this form with my personal information"):
    pass

Benefits of Composed Agents

  • Specialized Grounding: Use models optimized for click prediction accuracy
  • Flexible Planning: Choose any LLM for task reasoning and planning
  • Cost Optimization: Use smaller grounding models with larger planning models only when needed
  • Performance: Leverage the strengths of different model architectures

Capabilities

Composed agents support both capabilities:

agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+anthropic/claude-3-5-sonnet-20241022")

# Full computer-use agent capabilities
async for _ in agent.run("Complete this online form"):
    pass

# Direct click prediction (uses grounding model only)
coords = agent.predict_click("find the submit button")

For more information on individual model capabilities, see Computer-Use Agents and Grounding Models.