LogoCua Documentation

Agent Loops

Supported computer-using agent loops and models

A corresponding Jupyter Notebook is available for this documentation.

An agent can be thought of as a loop - it generates actions, executes them, and repeats until done:

  1. Generate: Your model generates output_text, computer_call, function_call
  2. Execute: The computer safely executes those items
  3. Complete: If the model has no more calls, it's done!

To run an agent loop simply do:

from agent import ComputerAgent
from computer import Computer

computer = Computer() # Connect to a cua container

agent = ComputerAgent(
    model="anthropic/claude-3-5-sonnet-20241022",
    tools=[computer]
)

prompt = "Take a screenshot and tell me what you see"

async for result in agent.run(prompt):
    if result["output"][-1]["type"] == "message":
        print("Agent:", result["output"][-1]["content"][0]["text"])

For a list of supported models and configurations, see the Supported Agents page.

Response Format

{
    "output": [
        {
            "type": "message",
            "role": "assistant",
            "content": [{"type": "output_text", "text": "I can see..."}]
        },
        {
            "type": "computer_call",
            "action": {"type": "screenshot"},
            "call_id": "call_123"
        },
        {
            "type": "computer_call_output",
            "call_id": "call_123",
            "output": {"image_url": "data:image/png;base64,..."}
        }
    ],
    "usage": {
        "prompt_tokens": 150,
        "completion_tokens": 75,
        "total_tokens": 225,
        "response_cost": 0.01,
    }
}

Environment Variables

Use the following environment variables to configure the agent and its access to cloud computers and LLM providers:

# Computer instance (cloud)
export CUA_CONTAINER_NAME="your-container-name"
export CUA_API_KEY="your-cua-api-key"

# LLM API keys
export ANTHROPIC_API_KEY="your-anthropic-key"
export OPENAI_API_KEY="your-openai-key"

Input and output

The input prompt passed to Agent.run can either be a string or a list of message dictionaries:

messages = [
    {
        "role": "user",
        "content": "Take a screenshot and describe what you see"
    },
    {
        "role": "assistant", 
        "content": "I'll take a screenshot for you."
    }
]

The output is an AsyncGenerator that yields response chunks.

Parameters

The ComputerAgent constructor provides a wide range of options for customizing agent behavior, tool integration, callbacks, resource management, and more.

  • model (str): Default: required The LLM or agent model to use. Determines which agent loop is selected unless custom_loop is provided. (e.g., "claude-3-5-sonnet-20241022", "computer-use-preview", "omni+vertex_ai/gemini-pro")
  • tools (List[Any]): List of tools the agent can use (e.g., Computer, sandboxed Python functions, etc.).
  • custom_loop (Callable): Optional custom agent loop function. If provided, overrides automatic loop selection.
  • only_n_most_recent_images (int): If set, only the N most recent images are kept in the message history. Useful for limiting memory usage. Automatically adds ImageRetentionCallback.
  • callbacks (List[Any]): List of callback instances for advanced preprocessing, postprocessing, logging, or custom hooks. See Callbacks & Extensibility.
  • verbosity (int): Logging level (e.g., logging.INFO). If set, adds a logging callback.
  • trajectory_dir (str): Directory path to save full trajectory data, including screenshots and responses. Adds TrajectorySaverCallback.
  • max_retries (int): Default: 3 Maximum number of retries for failed API calls (default: 3).
  • screenshot_delay (float | int): Default: 0.5 Delay (in seconds) before taking screenshots (default: 0.5).
  • use_prompt_caching (bool): Default: False Enables prompt caching for repeated prompts (mainly for Anthropic models).
  • max_trajectory_budget (float | dict): If set (float or dict), adds a budget manager callback that tracks usage costs and stops execution if the budget is exceeded. Dict allows advanced options (e.g., { "max_budget": 5.0, "raise_error": True }).
  • **kwargs (any): Any additional keyword arguments are passed through to the agent loop or model provider.

Example with advanced options:

from agent import ComputerAgent
from computer import Computer
from agent.callbacks import ImageRetentionCallback

agent = ComputerAgent(
    model="anthropic/claude-3-5-sonnet-20241022",
    tools=[Computer(...)],
    only_n_most_recent_images=3,
    callbacks=[ImageRetentionCallback(only_n_most_recent_images=3)],
    verbosity=logging.INFO,
    trajectory_dir="trajectories",
    max_retries=5,
    screenshot_delay=1.0,
    use_prompt_caching=True,
    max_trajectory_budget={"max_budget": 5.0, "raise_error": True}
)

Streaming Responses

async for result in agent.run(messages, stream=True):
    # Process streaming chunks
    for item in result["output"]:
        if item["type"] == "message":
            print(item["content"][0]["text"], end="", flush=True)
        elif item["type"] == "computer_call":
            action = item["action"]
            print(f"\n[Action: {action['type']}]")

Error Handling

try:
    async for result in agent.run(messages):
        # Process results
        pass
except BudgetExceededException:
    print("Budget limit exceeded")
except Exception as e:
    print(f"Agent error: {e}")