Skip to content

Studio

The studio command launches a web-based dashboard for browsing evaluation runs, inspecting individual test results, and reviewing scores. It shows both local runs and runs synced from a remote results repository.

AgentV Studio showing evaluation runs with pass rates, targets, and experiment names
Terminal window
agentv studio

Studio auto-discovers run workspaces from .agentv/results/runs/ in the current directory and opens at http://localhost:3117.

You can also point it at a specific run workspace or index.jsonl manifest:

Terminal window
agentv studio .agentv/results/runs/2026-03-30T11-45-56-989Z/index.jsonl
# or
agentv studio .agentv/results/runs/2026-03-30T11-45-56-989Z
OptionDescription
--port, -pPort to listen on (flag > PORT env var > 3117)
--dir, -dWorking directory (default: current directory)
--multiLaunch in multi-project dashboard mode (deprecated; use auto-detect or --single)
--singleForce single-project dashboard mode
--add <path>Register a project by path
--remove <id>Unregister a project by ID
--discover <path>Scan a directory tree for repos with .agentv/
  • Recent Runs — table of all evaluation runs with source badge (local / remote), target, experiment, timestamp, test count, pass rate, and mean score
  • Experiments — group and compare runs by experiment name
  • Targets — group runs by target (model/agent)
  • Run Detail — drill into a run to see per-test results, scores, and evaluator output
  • Human Review — add feedback annotations to individual test results
  • Compare — two modes: an aggregated experiment × target matrix, and a per-run view for selecting individual runs to compare side-by-side with optional retroactive tags
  • Remote Results — sync and browse runs pushed from other machines or CI (see Remote Results)

Click any run to see a breakdown by suite, per-test scores, target, duration, and cost. The source label (local or remote) tells you where the run came from.

AgentV Studio run detail showing 100% pass rate across 5 tests with scores and duration

The Experiments tab groups runs by experiment name so you can compare the impact of changes — for example, with_skills vs without_skills.

AgentV Studio experiments tab comparing with_skills (100%) vs without_skills (60%) pass rates

The Compare tab has two modes: Aggregated for the classic experiment × target matrix, and Per run for selecting individual runs and pitting them side-by-side. Toggle between them from the mode switch on the right of the masthead.

AgentV Studio side-by-side comparison of two runs tagged improved-prompt and baseline, with per-test pass rates

The default view shows a cross-experiment, cross-target performance matrix. Numbers are colour-coded by pass rate — green (80%+), amber (50–80%), red (below 50%) — and each cell shows passed/total and the mean score. Click any cell to expand the per-test-case breakdown.

AgentV Studio aggregated compare matrix showing experiment × target pass rates

Run the same eval against multiple providers or experiment variants, then open the Compare tab:

Terminal window
agentv eval my.EVAL.yaml --target azure --experiment baseline
agentv eval my.EVAL.yaml --target azure --experiment with-caching
agentv eval my.EVAL.yaml --target gemini --experiment baseline
agentv eval my.EVAL.yaml --target gemini --experiment with-caching
agentv studio # Compare tab shows 2x2 matrix

Running the same (experiment, target) twice no longer collapses into a single cell. Switch to Per run mode to see every run as its own row, select two or more, and compare them head-to-head.

AgentV Studio per-run compare mode with a filter-by-tag chip row and individual runs listing timestamp, tags, experiment, target, and pass rate; experiment-prefixed runs surface the experiment name under the timestamp

Use per-run mode when you want to:

  • Compare back-to-back runs of the same agent + eval after a prompt or parameter tweak
  • Pit a fresh run against a tagged baseline without touching the eval YAML
  • Debug flakiness by inspecting two identical-configuration runs side-by-side

Select 2+ rows with the checkboxes and click the sticky Compare N action to open the side-by-side view. Column headers show the run’s timestamp, with any assigned tags as chips below it. The per-test breakdown reuses the same scoring and colour tones as the aggregated matrix.

Click any row’s Tags cell to tag a run after the fact. Each run can carry multiple free-form tags (max 20, up to 60 characters each); tags are stored in a tags.json sidecar next to index.jsonl in the run workspace, so they’re mutable, non-destructive, and won’t touch your eval YAML or run manifest. The chip editor supports Enter/comma to commit a new tag, Backspace to remove the last chip, and Clear all to remove every tag (deletes the sidecar). Remote runs are read-only.

Use tags to annotate ad-hoc variants, experiment cross-cuts, or status flags you didn’t plan for up front — baseline, v2-prompt, slow, after-retry-fix, regression, etc. Unlike experiment — which groups runs and is baked into the JSONL at eval-run time — tags are mutable, multi-valued, and never touch the original run data.

Once runs are tagged, a chip row appears above the compare view listing every distinct tag with a usage count. Click a chip to narrow both the aggregated matrix and the per-run table to runs carrying at least one of the selected tags (OR semantics — clicking a second chip widens the set). A Clear link resets the filter, and filter selections persist as you switch between Aggregated and Per-run modes.

The same filter is available to API consumers via GET /api/compare?tags=baseline,v2-prompt, which returns only the cells and runs whose tags intersect the query.

By default, Studio shows results for the current directory. Register multiple benchmark repos to view them from a single dashboard.

Register benchmark repos one at a time:

Terminal window
agentv studio --add /path/to/my-evals
agentv studio --add /path/to/other-evals

Each path must contain a .agentv/ directory. Registered benchmarks are stored in ~/.agentv/projects.yaml.

Scan a parent directory to find and register all benchmark repos:

Terminal window
agentv studio --discover /path/to/repos

This recursively searches (up to 2 levels deep) for directories containing .agentv/ and registers them.

Studio auto-detects the mode based on how many benchmarks are registered:

  • 0 or 1 registered: single-project view
  • 2+ registered: Benchmarks dashboard
Terminal window
agentv studio # auto-detects
agentv studio --single # force single-project view

The landing page shows a card for each benchmark with run count, pass rate, and last run time.

AgentV Studio benchmarks dashboard showing benchmark cards with pass rates

Unregister by its ID:

Terminal window
agentv studio --remove my-evals

IDs are derived from the directory name (e.g., /home/user/repos/my-evals becomes my-evals).

Studio can display runs pushed to a remote git repository by other machines or CI — alongside your local runs. Each run in the list carries a source badge: local (green) or remote (amber).

Add a results.export block to .agentv/config.yaml:

results:
export:
repo: EntityProcess/agentv-evals # GitHub repo (owner/repo or full URL)
path: runs # Directory within the repo
auto_push: true # Push automatically after every eval run
branch_prefix: eval-results # Branch naming prefix (default: eval-results)

With auto_push: true, every agentv eval run or agentv pipeline bench automatically creates a draft PR in the configured repo with a structured results table.

Uses gh CLI and git credentials already configured on the machine. If authentication is missing, AgentV warns and skips the export — the eval run itself is never blocked.

Once configured, Studio fetches remote runs on load. Use the Sync Remote Results button in the source toolbar to pull the latest. The toolbar also shows when results were last synced and the configured repo.

Use the All Sources / Local Only / Remote Only filter to narrow the run list by origin.