How to A/B test a Claude Skill in your browser
A SKILL.md is a system prompt with metadata. The body of the file gets prepended to every conversation Claude has with the user once the Skill loads — which means subtle wording changes there propagate to every output, often in ways you cannot predict by reading the diff. The only honest way to know whether your edit made the Skill better is to run the same user prompt with and without the new wording, side by side, and read the structural differences. This page shows the 30-second method, in the browser, with no account.
The wrong way (and why it fails)
Run the Skill once, eyeball the output, decide it looks fine, ship it. Three reasons this breaks:
- Variance. The same prompt against the same Skill produces measurably different outputs run-to-run. Output length alone drifts by 20–30% across ten runs of a well-behaved Skill; for an under-specified Skill it drifts by 50% or more. One sample is noise.
- Comparison-blind reading. When you read a single output, your brain auto-corrects for missing structure. You see a paragraph and think "fine." Put that paragraph next to the same prompt's output with a tighter Skill, and the missing headings, missing line citations, missing verdict become obvious. Structure, not vocabulary, is what Skills actually change.
- Cost of ship-and-pray. A Skill that drifts in production costs more than a Skill that drifts in your sandbox. Catch it before it hits your users.
The right way: parallel streaming with and without the Skill
To compare Claude system prompts honestly you need three things on screen at once: the user prompt, the baseline (no Skill loaded), and the skill-loaded run. Both runs use the same model, the same temperature, the same prompt. Only the system prompt differs. Run them in parallel so neither has the unfair advantage of going second.
Step 1 — Pick a prompt that represents the real use case
Not a toy prompt. If your Skill reviews pull requests, paste a real PR diff. If it triages emails, paste a real email. The reason: short toy prompts under-exercise the Skill's instructions — they're easy for both baseline and skill-loaded runs to nail, which hides the contribution the Skill is actually making. The longer and more realistic the prompt, the clearer the structural delta becomes.
Step 2 — Run it twice in parallel
Same prompt. Once with an empty system prompt (baseline). Once with your SKILL.md loaded as the system prompt. Same model, same max-tokens, same temperature. In Skillbench's Compare mode this is a single click — the two streams render side by side and finish at slightly different times because LLM streaming is genuinely non-deterministic.
Step 3 — Read the structural diff, not the wording
Once both runs finish, count: number of headings, number of list items, total length, presence of a verdict line, presence of cited line numbers. The Skill's contribution lives in those structural choices. If the baseline produces three vague paragraphs and the skill-loaded run produces five numbered findings plus a verdict, the Skill is doing real work. If the two outputs are structurally identical and only the vocabulary differs, your Skill is decorative — strip it back to the constraints that actually moved structure.
A worked example: reviewing a PR diff
Open the code-reviewer skill in Skillbench's builder. The Skill is in the gallery — one click loads it. Then in the sandbox, paste this prompt:
Review this PR — it adds cookie-based session auth.
With the baseline run (no Skill, empty system prompt), Claude produces something like this:
Baseline · no Skill
Sure — this PR adds cookie auth. Looks broadly fine. Some thoughts: consider httpOnly. You may want CSRF protection. Tests look thin.
Skill-loaded · code-reviewer.md
L. 24 — `httpOnly` is missing → token readable from JS, **bug**. L. 31 — no `SameSite=Lax`, CSRF surface. L. 47 — silent catch swallows DB errors → security. Missing test: expired session should 401. **Verdict: no-ship.**
The vocabulary overlap is high — both mention httpOnly, both mention CSRF, both mention tests. The structural difference is total. Baseline produces a four-sentence paragraph with hedged language ("consider", "may want"). The skill-loaded run produces line-cited findings, categorizes each issue, and ends with a binary verdict. Same model, same prompt, same temperature — the SKILL.md is the only thing that changed.
This is the kind of output that makes the case for shipping the Skill into your team's workflow. It also makes the case for keeping the Skill tight: every constraint in the body of the SKILL.md ("cite line numbers", "categorize each issue", "end with a verdict") corresponds to something visible in the diff.
When to escalate to Measure mode
A/B compare answers the question "does the Skill produce better output than no Skill." It does not answer "does the Skill produce the same output run after run." For that, run N iterations of the same prompt against the same Skill and look at the distribution.
Measure mode in Skillbench runs the Skill 10 times (configurable up to 20) on a single prompt, plots output-length distribution as a small histogram, shows median, stddev, and per-run latency. The first run is dropped as warmup because token-streaming latency on a cold connection skews everything. Here's the rule of thumb: if stddev exceeds 15% of the mean, your Skill is non-deterministic in a way that will bite you in production. Tighten the body — more specific output format requirements, more concrete examples of the desired structure, fewer adjectives. Re-run Measure. Repeat until stddev drops below 15%.
For the code-reviewer example above, ten runs produce outputs between 270 and 312 characters with a stddev of about 12. That's roughly 4% of the mean — a tight, predictable Skill. For comparison, a Skill that says "be helpful and thorough" with no structural constraints will produce stddevs of 40% or higher because every run is free to choose a different length.
What you cannot A/B test with Skillbench
Three things are out of scope, and it's worth being honest about them:
- Tool calls. Skills can declare
allowed-toolslikeread_file,grep,code_execution. Those tools live in Claude Code (or the agent runtime that loads the Skill), not in the browser. Skillbench shows you what Claude would do with the Skill loaded as text; it does not execute the tool calls. If your Skill's value is mostly in orchestrating tools, A/B-test it inside Claude Code instead. - Retrieval / RAG. Skillbench doesn't load external corpora into context. If your Skill depends on a vector store of documents, test it where that vector store lives.
- Long-running multi-turn. Each run is single-shot — one user message, one assistant response. Skills that depend on a multi-turn back-and-forth (the agent asks the user a follow-up, the user replies, the agent acts) need a multi-turn harness; Skillbench is for the single-message contract.
For everything else — and especially for the 80% of Skills that are essentially a tight system prompt with structural constraints — A/B compare in the browser catches the majority of regressions before they reach production.