Benchmarks

Computer Agent SDK benchmarks for agentic GUI tasks

The benchmark system evaluates models on GUI grounding tasks, specifically agent loop success rate and click prediction accuracy. It supports both:

Computer Agent SDK providers (using model strings like "huggingface-local/HelloKKMe/GTA1-7B")
Reference agent implementations (custom model classes implementing the ModelProtocol)

Available Benchmarks

ScreenSpot-v2 - Standard resolution GUI grounding
ScreenSpot-Pro - High-resolution GUI grounding
Interactive Testing - Real-time testing and visualization

Quick Start

# Clone the benchmark repository
git clone https://github.com/trycua/cua
cd libs/python/agent/benchmarks

# Install dependencies
pip install "cua-agent[all]"

# Run a benchmark
python ss-v2.py

Usage Tracking

How to track token usage and cost in ComputerAgent and agent loops.

Introduction

Overview of benchmarking in the c/ua agent framework

On this page

Available Benchmarks Quick Start