EDA · KDD Cup 2026 multi-EXP

▾ 🌐 Cross-dataset scatter (Public × Syn v3)

▾ 📊 Experiments — unified table (auto-aggregated by EXP base)

Mode: Search:

Click a row to load that EXP and switch to its Tasks view.

Difficulty: Domain: Category: Agent A: Agent B: Search:

💡 hover Question column to expand · Recent K column shows pass/fail dots for the last K experiments on each task (★ = currently selected EXP)

🇯🇵 日本語訳 Compare with:

▾ Question / 質問

▾ Context files (input data) / 入力データ

▾ Gold answer + script / 正解と解法スクリプト

Output (gold.csv) / 期待される出力

Script (gold_scripts/<task_id>.py) / 解法スクリプト

▾ Predictions (model outputs) / 各モデルの予測

selected EXP — prediction

comparison EXP — prediction

▾ Side-by-side dialogue (LINE-style) / 対話比較

selected EXP

comparison EXP

Keyboard: ←/→ navigate · Esc back

▾ Question

▾ Context files (input data)

▾ Gold answer + script

Output (gold.csv)

Script (gold_scripts/<task_id>.py)

▾ Flow comparison (Gold vs Agents)

Side-by-side: gold script's logical steps (left) vs each agent's tool-call distillation. Same color = matched pair (the agent reproduced that gold step). Gold steps with grey border = the agent did not perform that step. Agent steps with grey border = exploration or extra computation not in gold.

▾ Agent runs

Agent A

Prediction

Final answer

Tool trace

Agent B

Prediction

Final answer

Tool trace

▾ Local vs LB accuracy (all EXPs)

Each point = one EXP. X = local accuracy (binary mean from this EDA's scoring). Y = leaderboard accuracy (manually entered below or in artifacts/eda/lb_scores.json). Diagonal = perfect agreement; below = LB worse than local; above = LB better. Hover for run_id. Click to open that EXP.

Color: v3 paired runs · Copilot Public-50 EXPs · grey = LB missing

▾ LB score input (per EXP)

Edit values inline. Saved to browser (localStorage) and reflected in the plot above. Click Download lb_scores.json to get the merged file → commit to artifacts/eda/lb_scores.json for persistence.

run_id	label	dataset	local acc	LB acc	note