Grader bot

ADDDDD MEEEEEE!!!

Using claude code to run experiments and grade itself using sub agents!

Top marks #

"Eval" is overloaded — a grader is the specific, useful thing: it scores one model’s output against a rubric
Use one for subjective, usually creative tasks (prose, images, translation) where "good" is hard to check programmatically
It works best with examples and a specific, numerically scored rubric typically using a 0-10 scale
Cheapest first move: hand the original agent the rubric and ask it to score itself — often fixes the output in the same turn
Don’t grade with the same model and prompt that produced the output — the bias compounds (use a subagent)
The real work is the feedback loop: expect 3+ rounds, and re-check when the model changes

Can we call it a grader? #

"Eval" has become the catch-all term in AI for everything from benchmark suites to vibe checks to production monitoring. It’s so overloaded it barely means anything.

A grader is a specific, useful thing: a system (usually an LLM, sometimes a script) that scores the output of another LLM against a rubric. You know what a grader does. You had one in school. It looks at your work, checks it against criteria, and gives you a score with feedback.

An eval is a category. A grader is a tool you can build and iterate on.

When you need one #

Any AI task where "good" is subjective or hard to define programmatically
Image generation, prose writing, code review, translation, summarization, content moderation
Any task where you plan to iterate on the prompt — you need a way to know if the new version is actually better
Probably don’t need a grader if:
- The output is deterministically checkable (JSON schema validation, unit tests, exact string match)
- The output is temporary/intermediate: EG brainstorming, riffing/vibing, other collaboration

What goes wrong #

No examples: Asking an LLM to grade without showing it what "good" looks like produces random scores
Vague rubric: "Score originality from 1-10" means little. What’s the difference between a 5 and a 7? If you can’t describe it, the grader can’t score it.
Vague or incomplete prompts: If the original prompt producing the output is not well fleshed out, the grader just confirms the output is bad. Fix the prompt first.
Grading your own homework: Using the same model and same system prompt to both produce and grade output. The biases compound — the model will be systematically generous toward outputs that match its own patterns.¹

Try self-scoring first #

Before you build a separate grader and a full iteration loop, try the cheapest thing: give the original agent the rubric and ask it to score its own output.

Often the model can see the gap once the criteria are explicit in front of it
A low self-score frequently triggers a better second attempt in the same turn — no separate grader needed
This can save several iterations of building and tuning a dedicated grader
It works precisely because most bad output comes from a vague task, not a weak model — the rubric supplies the missing definition of "good"

The caveat is self-enhancement bias: a model grading its own work tends to be generous.¹ Self-scoring is a fast first pass, not a substitute for an independent grader on anything that matters.

What you actually need #

2+ examples with scores and short explanations of why each got that score
Specific rubric on a 0-10 scale with descriptions at key anchor points (what does a 0 look like? a 5? an 8? a 10?)
Clear system prompt that includes the task context, the rubric, and the examples
Separate user prompt for each item being graded — the thing you’re actually scoring, e.g. the blog post text
A different model - not always possible or necessary, but a totally different model from a totally different provider can help keep things objective

An example: our blog-post rubric #

Here’s the rubric we grade wrunk.dev posts against — this is the canonical version. Each dimension scores 0–10:

One clear point, led with — a single takeaway, stated up front, not buried
- ✓ Opens with "Use a window manager — here’s the 5-minute setup"
- ✗ Three paragraphs of backstory before the actual advice
- ✗ Two competing theses fighting for the same post
Concise & information-dense — no filler; survives a "cut 30%" pass
- ✓ "Sonnet is usually plenty for writing"
- ✗ "It is also worth bearing in mind that, in many cases, one might consider…"
- ✗ A 400-word section a 6-bullet list would cover
Scannable — H2 signposting, short paragraphs, bullets, disciplined emphasis
- ✓ You can grasp the whole post from the H2s alone
- ✗ A wall of 8-sentence paragraphs, no headings
- ✗ Half the sentences bolded "for emphasis"
Accurate & current — claims correct, sourced where load-bearing, nothing outdated
- ✓ "Meta added an AI-assisted interview (Oct 2025)" — with a link
- ✗ "GPT-4 is the latest model" (outdated)
- ✗ A confident statistic with no source
Voice — friendly, technical-but-readable, neutral; consistent with the rest of the site
- ✓ "It had a good run, but I don’t want to install Rosetta"
- ✗ Hype or snark ("this changes EVERYTHING")
- ✗ Stiff corporate tone ("leveraging synergies to…")

The gate — a hard cap. Some flaws are disqualifying no matter how good everything else is. If a post has a factual error, a load-bearing claim with no source, or an obviously outdated take, its score is capped at 4/10 — even if all five dimensions would otherwise be 9s and 10s. The reason: a polished post that’s wrong costs more trust than a plain post that’s right, so no amount of style buys back a credibility miss.

It’s deliberately about the quality bar, not the topic. The ✓/✗ above are illustrative anchors; a production grader still needs 5+ fully scored example posts before it judges reliably.

Another example: grading vector drawings #

LLMs are still shaky at vector art — the SVG infographics in these posts took real iteration. A grader makes that iteration tractable. The dimensions, each 0–10:

Subject match — recognizably the thing asked for; a "dog" that reads as a dog
Legibility at size — clear at a glance and at thumbnail size, not a tangle of paths
Composition — balanced and aligned; uses the canvas, nothing crammed or floating in dead space
Palette — on-brand colors, enough contrast, not garish
Text (if any) — labels fit and align; nothing clipped or overflowing its box
Technical validity — valid SVG, renders the same across viewers, a sane path count

The gate — a hard cap. If it doesn’t render, or you can’t tell what it’s supposed to be, it’s capped low — nothing else matters.

The tell that this rubric works: scores track the squint test. Shrink the drawing to a thumbnail — a 9 still reads, a 3 turns to mush.

The same brief — "a house" — drawn three ways: a 3 with clashing colors, a crooked roof, and a floating door; a 6 that's recognizable but plain and slightly off; and a 9 that's clean, balanced, and on-brand.

The feedback loop is the real work #

A four-step cycle: score a batch, review where you disagree, refine the rubric and examples, re-score the same batch, and repeat until the grader's scores match your judgment.

Grader scores a batch of outputs
You review the scores — especially the ones you disagree with
You refine the rubric and examples based on what the grader got wrong
Re-run the grader on the same outputs
Repeat until the grader’s scores match your judgment consistently

Expect 5-10 iterations before a grader is useful. The rubric and examples are living documents. Update them as you learn what "good" actually means for your use case.

When the underlying model changes (and it will — regularly), re-run your grader on a fixed set of outputs. Scores will drift.

Practice is the thing people skip #

Building a good grader requires you to deeply understand what you want. Most people skip this step and then wonder why their AI output is mediocre.

The act of writing examples forces you to define "good" concretely
The act of writing rubric entries forces you to articulate the differences between good, acceptable, and bad
This is true for image generation, prose, code, translation — anything creative or subjective
The grader is a byproduct. The real output is your own clarity about the task.

Footnotes #

Known as "self-enhancement bias" in the LLM-as-judge literature. Models systematically rate their own output higher than output from other models. Position bias (preferring the first option in a comparison) is another well-documented issue. Using a different model or different prompting strategy for grading helps mitigate both. ↩