Loading…

Dataset summary (across all EXPs)

Experiments table

Click a row to load that EXP and switch to its Tasks view.
Keyboard: ←/→ navigate · Esc back

Question

Context files (input data)

Gold answer + script

Output (gold.csv)

Script (gold_scripts/<task_id>.py)


    

Agent runs

Agent A

Prediction
Final answer

        
Tool trace

Agent B

Prediction
Final answer

        
Tool trace

Local vs LB accuracy (all EXPs)

Each point = one EXP. X = local accuracy (binary mean from this EDA's scoring). Y = leaderboard accuracy (manually entered below or in artifacts/eda/lb_scores.json). Diagonal = perfect agreement; below = LB worse than local; above = LB better. Hover for run_id. Click to open that EXP.

Color: v3 paired runs · Copilot Public-50 EXPs · grey = LB missing

LB score input (per EXP)

Edit values inline. Saved to browser (localStorage) and reflected in the plot above. Click Download lb_scores.json to get the merged file → commit to artifacts/eda/lb_scores.json for persistence.

run_idlabeldatasetlocal accLB accnote

EXP overview

Accuracy breakdown (this EXP)

Agent A config + prompt

Agent B config + prompt

Tool catalog