LogoCua Documentation

Introduction

Overview of benchmarking in the c/ua agent framework

The c/ua agent framework uses benchmarks to test the performance of supported models and providers at various agentic tasks.

Benchmark Types

Computer-Agent benchmarks evaluate two key capabilities:

  • Plan Generation: Breaking down complex tasks into a sequence of actions
  • Coordinate Generation: Predicting precise click locations on GUI elements

Using State-of-the-Art Models

Let's see how to use the SOTA vision-language models in the c/ua agent framework.

Plan Generation + Coordinate Generation

OS-World - Benchmark for complete computer-use agents

This leaderboard tests models that can understand instructions and automatically perform the full sequence of actions needed to complete tasks.

# UI-TARS-1.5 is a SOTA unified plan generation + coordinate generation VLM
# This makes it suitable for agentic loops for computer-use
agent = ComputerAgent("huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B", tools=[computer])
agent.run("Open Firefox and go to github.com")
# Success! 🎉

Coordinate Generation Only

GUI Agent Grounding Leaderboard - Benchmark for click prediction accuracy

This leaderboard tests models that specialize in finding exactly where to click on screen elements, but needs to be told what specific action to take.

# GTA1-7B is a SOTA coordinate generation VLM
# It can only generate coordinates, not plan:
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
agent.predict_click("find the button to open the settings") # (27, 450)
# This will raise an error:
# agent.run("Open Firefox and go to github.com") 

Composed Agent

The c/ua agent framework also supports composed agents, which combine a planning model with a clicking model for the best of both worlds. Any liteLLM model can be used as the plan generation model.

# It can be paired with any LLM to form a composed agent:
# "gemini/gemini-1.5-pro" will be used as the plan generation LLM
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B+gemini/gemini-1.5-pro", tools=[computer])
agent.run("Open Firefox and go to github.com")
# Success! 🎉