Benchmarks
Computer Agent SDK benchmarks for agentic GUI tasks
The benchmark system evaluates models on GUI grounding tasks, specifically agent loop success rate and click prediction accuracy. It supports both:
- Computer Agent SDK providers (using model strings like
"huggingface-local/HelloKKMe/GTA1-7B"
) - Reference agent implementations (custom model classes implementing the
ModelProtocol
)
Available Benchmarks
- ScreenSpot-v2 - Standard resolution GUI grounding
- ScreenSpot-Pro - High-resolution GUI grounding
- Interactive Testing - Real-time testing and visualization
Quick Start
# Clone the benchmark repository
git clone https://github.com/trycua/cua
cd libs/python/agent/benchmarks
# Install dependencies
pip install "cua-agent[all]"
# Run a benchmark
python ss-v2.py