Grader bot

Grader bot

Can we please stop using the word 'eval' for everything?

aistrategy

ADDDDD MEEEEEE!!!

Using claude code to run experiments and grade itself using sub agents!

Top marks #

Can we call it a grader? #

"Eval" has become the catch-all term in AI for everything from benchmark suites to vibe checks to production monitoring. It’s so overloaded it barely means anything.

A grader is a specific, useful thing: a system (usually an LLM, sometimes a script) that scores the output of another LLM against a rubric. You know what a grader does. You had one in school. It looks at your work, checks it against criteria, and gives you a score with feedback.

An eval is a category. A grader is a tool you can build and iterate on.

When you need one #

What goes wrong #

Try self-scoring first #

Before you build a separate grader and a full iteration loop, try the cheapest thing: give the original agent the rubric and ask it to score its own output.

The caveat is self-enhancement bias: a model grading its own work tends to be generous.1 Self-scoring is a fast first pass, not a substitute for an independent grader on anything that matters.

What you actually need #

An example: our blog-post rubric #

Here’s the rubric we grade wrunk.dev posts against — this is the canonical version. Each dimension scores 0–10:

  1. One clear point, led with — a single takeaway, stated up front, not buried
    • ✓ Opens with "Use a window manager — here’s the 5-minute setup"
    • ✗ Three paragraphs of backstory before the actual advice
    • ✗ Two competing theses fighting for the same post
  2. Concise & information-dense — no filler; survives a "cut 30%" pass
    • ✓ "Sonnet is usually plenty for writing"
    • ✗ "It is also worth bearing in mind that, in many cases, one might consider…"
    • ✗ A 400-word section a 6-bullet list would cover
  3. Scannable — H2 signposting, short paragraphs, bullets, disciplined emphasis
    • ✓ You can grasp the whole post from the H2s alone
    • ✗ A wall of 8-sentence paragraphs, no headings
    • ✗ Half the sentences bolded "for emphasis"
  4. Accurate & current — claims correct, sourced where load-bearing, nothing outdated
    • ✓ "Meta added an AI-assisted interview (Oct 2025)" — with a link
    • ✗ "GPT-4 is the latest model" (outdated)
    • ✗ A confident statistic with no source
  5. Voice — friendly, technical-but-readable, neutral; consistent with the rest of the site
    • ✓ "It had a good run, but I don’t want to install Rosetta"
    • ✗ Hype or snark ("this changes EVERYTHING")
    • ✗ Stiff corporate tone ("leveraging synergies to…")

The gate — a hard cap. Some flaws are disqualifying no matter how good everything else is. If a post has a factual error, a load-bearing claim with no source, or an obviously outdated take, its score is capped at 4/10 — even if all five dimensions would otherwise be 9s and 10s. The reason: a polished post that’s wrong costs more trust than a plain post that’s right, so no amount of style buys back a credibility miss.

It’s deliberately about the quality bar, not the topic. The ✓/✗ above are illustrative anchors; a production grader still needs 5+ fully scored example posts before it judges reliably.

Another example: grading vector drawings #

LLMs are still shaky at vector art — the SVG infographics in these posts took real iteration. A grader makes that iteration tractable. The dimensions, each 0–10:

  1. Subject match — recognizably the thing asked for; a "dog" that reads as a dog
  2. Legibility at size — clear at a glance and at thumbnail size, not a tangle of paths
  3. Composition — balanced and aligned; uses the canvas, nothing crammed or floating in dead space
  4. Palette — on-brand colors, enough contrast, not garish
  5. Text (if any) — labels fit and align; nothing clipped or overflowing its box
  6. Technical validity — valid SVG, renders the same across viewers, a sane path count

The gate — a hard cap. If it doesn’t render, or you can’t tell what it’s supposed to be, it’s capped low — nothing else matters.

The tell that this rubric works: scores track the squint test. Shrink the drawing to a thumbnail — a 9 still reads, a 3 turns to mush.

The same brief — "a house" — drawn three ways: a 3 with clashing colors, a crooked roof, and a floating door; a 6 that's recognizable but plain and slightly off; and a 9 that's clean, balanced, and on-brand.

The feedback loop is the real work #

A four-step cycle: score a batch, review where you disagree, refine the rubric and examples, re-score the same batch, and repeat until the grader's scores match your judgment.

  1. Grader scores a batch of outputs
  2. You review the scores — especially the ones you disagree with
  3. You refine the rubric and examples based on what the grader got wrong
  4. Re-run the grader on the same outputs
  5. Repeat until the grader’s scores match your judgment consistently

Expect 5-10 iterations before a grader is useful. The rubric and examples are living documents. Update them as you learn what "good" actually means for your use case.

When the underlying model changes (and it will — regularly), re-run your grader on a fixed set of outputs. Scores will drift.

Practice is the thing people skip #

Building a good grader requires you to deeply understand what you want. Most people skip this step and then wonder why their AI output is mediocre.


Footnotes #

  1. Known as "self-enhancement bias" in the LLM-as-judge literature. Models systematically rate their own output higher than output from other models. Position bias (preferring the first option in a comparison) is another well-documented issue. Using a different model or different prompting strategy for grading helps mitigate both.