Loading…

🌐 Cross-dataset accuracy by EXP base (Public × Syn v3 × Ext v1)

Dataset summary (across all EXPs)

Experiments table

Click a row to load that EXP and switch to its Tasks view.
💡 hover Question column to expand · Recent K column shows pass/fail dots for the last K experiments on each task (★ = currently selected EXP)
Compare with:  

Question / 質問

Context files (input data) / 入力データ

Gold answer + script / 正解と解法スクリプト

Output (gold.csv) / 期待される出力

Script (gold_scripts/<task_id>.py) / 解法スクリプト


    

Predictions (model outputs) / 各モデルの予測

selected EXP — prediction

comparison EXP — prediction

Side-by-side dialogue (LINE-style) / 対話比較

selected EXP

comparison EXP

Keyboard: ←/→ navigate · Esc back

Question

Context files (input data)

Gold answer + script

Output (gold.csv)

Script (gold_scripts/<task_id>.py)


    

Flow comparison (Gold vs Agents)

Side-by-side: gold script's logical steps (left) vs each agent's tool-call distillation. Same color = matched pair (the agent reproduced that gold step). Gold steps with grey border = the agent did not perform that step. Agent steps with grey border = exploration or extra computation not in gold.

Agent runs

Agent A

Prediction
Final answer

        
Tool trace

Agent B

Prediction
Final answer

        
Tool trace

Local vs LB accuracy (all EXPs)

Each point = one EXP. X = local accuracy (binary mean from this EDA's scoring). Y = leaderboard accuracy (manually entered below or in artifacts/eda/lb_scores.json). Diagonal = perfect agreement; below = LB worse than local; above = LB better. Hover for run_id. Click to open that EXP.

Color: v3 paired runs · Copilot Public-50 EXPs · grey = LB missing

LB score input (per EXP)

Edit values inline. Saved to browser (localStorage) and reflected in the plot above. Click Download lb_scores.json to get the merged file → commit to artifacts/eda/lb_scores.json for persistence.

run_idlabeldatasetlocal accLB accnote

EXP overview

Accuracy breakdown (this EXP)

Agent A config + prompt

Agent B config + prompt

Tool catalog