Blind Arena

Two anonymized answers to the same real task. Pick the one you'd ship — no logos, no hype. After enough votes, Crucible reveals the models and builds your ranking.

0votes

Keyboard: 1 left wins · 2 right wins · T tie · S skip

Weighted Leaderboard

Public benchmarks weight everything equally. You don't. Set what matters and the ranking recomputes live.

Profile of the leader

Build Tasks

Real, small, gradeable jobs. Open one to read every model's answer side by side with per-dimension scores.

The Field

Six contenders. Operational numbers are illustrative; capability scores come from the task set.

Bring Your Own

Paste a task and real outputs from models you actually use. Crucible runs entirely in your browser — nothing is uploaded, nothing leaves this tab.

What Crucible is

Most LLM leaderboards hand you one number and ask you to trust it. But the model that tops MMLU might write code your team can't maintain, and the "best" model on paper might be the wrong call when latency and cost actually bite.

Crucible flips the test. It judges models the way you really judge them — by looking at what they build:

Blind Arena — you compare anonymized answers to real tasks and vote. A Bradley-Terry / Elo update runs after every vote, so your ranking emerges from your own taste, not a logo you already trust.
The reveal — once you've voted enough, Crucible unmasks the models and shows the gap between your blind ranking and the dimension-weighted scores. That gap is the point: it's where your gut and the spec disagree.
Weighted Leaderboard — drag sliders for correctness, readability, efficiency, security, robustness, cost, and speed. The board recomputes instantly. There is no single "best" — only best for what you weight.
Import — paste outputs from the models you actually pay for and evaluate them the same way, fully offline.

Honesty notes

The bundled dataset is illustrative. Model names are fictional archetypes so we never put invented numbers next to a real product. Scores are hand-authored to make each model's character legible — one ships a subtle bug, one is verbose-but-safe, one is terse. The tool is real; trust the bundled scores only as a demo. For real signal, import real outputs.

No network calls. No analytics. No accounts. No data leaves your browser. Your votes and weights live in localStorage and you can clear them anytime.

Method

Blind ranking uses online Elo (K=24, base 1500) updated per pairwise vote; ties count as half. The weighted score normalizes every dimension to 0–100, inverts cost and maps speed onto the same scale, then takes a weighted mean. Divergence is the Spearman-style rank gap between the two orderings.