Stop Trusting the Leaderboard: Evaluate AI on Your Own Work

The leaderboard is not your job.

A new frontier model has landed almost every month this year. Claude's Fable line arrived and was then pulled from sale for foreign nationals by a government order. OpenAI made GPT-5.5 Instant the ChatGPT default. Google pushed Gemini 3.5 Flash as an agent default. Microsoft shipped seven in-house MAI models. Mistral put frontier-class weights on four GPUs. Every one of those launches came with a chart showing it beating the others on some benchmark. If you are trying to choose an AI tool for real work, that chart is close to useless, and the speed of the churn means whatever you pick on benchmark scores will be "beaten" within weeks anyway.

The way out of this is not to track the leaderboard more closely. It is to stop trusting it, and to evaluate any tool the only way that actually predicts whether it will work for you: on your own tasks, against your own standard, before you commit. That sounds like more effort than reading a scoreboard. It is, by a few hours, and those few hours are the cheapest insurance you will buy on an AI decision.

It is worth being honest about where AI buyer's remorse actually comes from. It is rarely a tool that scored badly on a benchmark. It is the tool that demoed beautifully, got bought on the strength of that demo, and then quietly underperformed on the real, repetitive work nobody tested it against. The gap between the demo and the day job is where the money is lost. An evaluation built from your own tasks is simply the act of testing the day job before you pay for it, instead of after.

Why the scoreboard cannot carry your decision

The benchmarks vendors quote are weaker evidence than their confidence suggests. A large review titled "Measuring what Matters: Construct Validity in Large Language Model Benchmarks", carried out by a team of 29 reviewers and reported in November 2025, examined 445 separate LLM benchmarks from leading AI conferences. The finding was damning in its plainness: "almost all articles have weaknesses in at least one area" that undermine the performance claims built on them. Twenty-seven per cent relied on convenience sampling, reusing existing benchmark data or human exam questions rather than anything representative. Only 16 per cent used uncertainty estimates or statistical tests to compare results at all.

445 AI benchmarks were reviewed, and almost all carried a methodological weakness that undermines the performance claims built on them — Measuring what Matters reviewed 445 benchmarks. Only 16 percent used statistical tests to compare results.

There is a second, more basic problem: contamination. Models train on enormous slices of the internet, which increasingly include the benchmark questions and answers themselves. When a test's questions sit in a model's training data, a high score measures memory, not capability. The review notes this directly, pointing to widely used tests whose results are undermined when their questions and answers appear in the model's pre-training data.

Even a clean benchmark would not settle your decision, because a public score is measured under conditions that are not yours. The prompt, the scaffolding, the tools the model could call, the temperature, all of it differs from how the tool will run in your environment on your data. A leaderboard tells you something about a model in a lab. It tells you almost nothing about how it will handle the specific, messy, repetitive task you actually need done. And the vendor demo is worse, because it was built to succeed.

None of this makes benchmarks worthless. They have a real job: helping researchers track whether the field is making progress over time, and giving vendors a shared yardstick to aim at. That job is legitimate. The mistake is borrowing a research instrument and treating it as procurement evidence, as if a score designed to compare models in the abstract could tell you whether a particular tool will do your particular work. It was never built to answer that question, and it does not.

The shift: evidence has to come from your work

The principle is simple once stated. The only evidence that should move your money is the tool's performance on tasks that look like your tasks, judged by your definition of good. Everything else is marketing, including the parts that look like science.

This reframes the whole buying decision. You are not asking "which model is best", a question with no stable answer in a market that turns over monthly. You are asking "which tool does my work well enough, reliably enough, safely enough, to be worth what it costs", a question only your own evaluation can answer, and one whose answer stays useful even as the leaderboard churns underneath it.

The method: build a private evaluation Project

You do not need a research team for this. You need a small, honest test built from your real work, the discipline to run it the same way for every candidate, and somewhere to keep it. That somewhere is a dedicated Project. Both ChatGPT (Projects) and Claude (Projects) let you create a contained workspace with its own standing instructions and its own uploaded files, separate from your general chat history. That container is exactly what an evaluation needs: a fixed brief, a fixed task set, and a fixed scoring rubric that you run identically against every candidate, instead of re-typing the ground rules into a fresh chat each time and quietly letting the test drift.

A standing note before any of the prompts below. Never paste real personal, claim, health or incident data into a model that is not an approved enterprise instance. The tool you are evaluating is, by definition, not yet approved. De-identify every task before it goes near it: swap real values for placeholder tokens such as [EMPLOYEENAME], [CLAIMNUMBER], [INCIDENTID], [TEAM], [ROLE], [SITE] and [DATE]. For an evaluation set, public source material (a published regulator release, a public annual report) is ideal, because it carries no confidentiality risk at all and still exercises the tool on work that looks like yours.

Setting up the evaluation Project

Create a new Project and paste the following block into its custom instructions or project description. It fixes the model's role for the whole evaluation so that every candidate is briefed identically.

Prompt

ROLE: You are an evaluation assistant. You are one of several AI tools being
tested against the same fixed task set and the same scoring rubric. You are not
being asked to impress me. You are being measured on accuracy, correct format,
and correct failure behaviour.

STANDING RULES:
- Answer only from the source material I provide in each task. Do not use
  outside knowledge to fill gaps.
- If the source does not contain the answer, say "Not stated in the source."
  Never invent a fact, a figure, a citation, a clause or a reference.
- When you cite, quote the exact text and name the document and section it
  came from. If you cannot point to it, do not cite it.
- Flag uncertainty explicitly. A wrong answer given confidently scores worse
  than a correct "I am not sure, here is why".
- Keep to the output format each task specifies. Do not add commentary,
  preamble or sign-off unless asked.

DATA RULE: All material I give you is de-identified or public. If you ever see
what looks like a real name, claim number or personal detail, stop and tell me
before continuing.

A two-panel split: a generic public benchmark leaderboard on one side, a private evaluation Project built from your own de-identified tasks on the other — Illustrative ChatGPT interface mockup: a Project named Tool Evaluation with the standing evaluation instructions pasted in, ready to brief every candidate tool identically.

Files to upload to the Project

Keep the evaluation self-contained. Upload these so every candidate works from the same material, and so the test does not drift between runs:

The task set. Ten to thirty de-identified real tasks, each with its source material attached or pasted. This is the heart of the evaluation.
The scoring rubric. Your written definition of good for each task type (accuracy, format, tone, and the right failure behaviour).
Known-good answers. Where a task has a correct answer, record it in a separate reference file you do not give the model, so you can mark against it.
A source pack. The published documents the tasks draw from (for the worked example below, the relevant APRA and ASIC releases), so the tool answers from the same evidence every time.

The prompt library

Four prompts run the whole evaluation: define the tasks, write the rubric, run the shortlist blind, then score against the rubric with kill criteria. Run each candidate tool through prompts three and four identically. Remember the standing note above: de-identify everything, and prefer public source material where you can.

1. Define the real tasks.

Prompt

Help me build an evaluation task set for an AI tool I am considering buying.

Here is what I would actually use the tool for: [DESCRIBE THE REAL WORK, e.g.
"summarise regulatory updates and pull out the obligations that changed"].

Draft 15 evaluation tasks that represent the REAL distribution of this work,
not clever edge cases. For each task give me: a short task name, the exact
instruction I would give the tool, and the source material it needs (I will
attach de-identified or public documents). Bias the set toward the repetitive
day-job work, not the showpiece. Flag any task where a wrong answer would do
real damage in a regulated setting, so I can weight it.

2. Write the scoring rubric.

Prompt

Using the 15-task set above, write a scoring rubric I can apply to any tool's
output, BEFORE I see the output.

For each task type, define what "good" looks like on four axes:
- Accuracy: is it correct against the source / known answer?
- Format: is it in the exact structure I asked for?
- Tone: is it in the register my work needs?
- Failure behaviour: does it say "Not stated in the source" when it should,
  refuse what it should refuse, and never fabricate a citation or requirement?

Make each axis a 0-2 score with a one-line description of what 0, 1 and 2 mean.
Mark failure behaviour as the highest-weighted axis for any task I flagged as
high-damage. Output the rubric as a table I can reuse for every candidate.

3. Run the shortlist blind.

Prompt

You are Candidate [A / B / C]. Complete every task in the attached task set,
one at a time, in order.

For each task: follow the instruction exactly, answer only from the attached
source, use the output format the task specifies, and apply the standing rules
in this Project (cite exact text, say "Not stated in the source" when the
answer is not there, flag uncertainty, never fabricate).

Number your answers to match the task numbers. Add no preamble. I will record
your answers against an unlabelled candidate ID so the scoring is blind to which
tool you are.

4. Score against the rubric with kill criteria.

Prompt

Here are the unlabelled outputs from three candidate tools (Candidate A, B, C)
for the same 15-task set, plus the rubric and the known-good answers.

Score each candidate task-by-task on the four rubric axes. Then produce:
- A scoreboard: total and per-axis score for each candidate.
- A fabrication log: every instance where a candidate invented a citation,
  figure, clause or requirement not in the source. List each one.
- A recommendation against these KILL CRITERIA: reject any candidate that
  fabricated a citation on a high-damage task, OR scored 0 on failure
  behaviour on any task, regardless of how fluent it was elsewhere.

Tell me which candidate, if any, is fit to take to a bounded pilot, and what
its remaining weaknesses are.

The five-step logic underneath these prompts holds even if you run the whole thing by hand. Define the real tasks. Write down what good looks like before you test. Keep the eval set private and out of any tool's training path. Test the shortlist the same way, blind where you can. Then run a bounded pilot with explicit kill criteria, because a pilot without a stop condition is just a slow, unmonitored purchase that nobody ever decides to end.

A worked example: a bank compliance team evaluating a regulatory-update summariser

An Australian bank's compliance team wants an assistant to help summarise regulatory updates. The leaderboard says one tool is on top; the vendor demo is slick. Instead of buying on either, they build the evaluation properly.

Setup. A team member creates a Project called "Reg-Update Summariser Eval", pastes in the standing instructions, and uploads a source pack of recent public APRA and ASIC releases. Nothing confidential goes in, because the source material is already public. The team assembles fifteen real tasks, each a version of "summarise this release and list the obligations that changed", drawn from genuine recent work but using only the published source. They write the rubric first: a summary has to be accurate, cite the right document and section, flag where it is unsure, and never invent a reference or a requirement that is not in the source. Fabricating a citation is flagged as the high-damage failure and weighted highest.

Run. They run three tools through the same fifteen tasks, recording each tool's answers under an unlabelled candidate ID so the person scoring cannot see which model produced what. The blind-run prompt is identical for every candidate.

Illustrative output (Candidate A, one task). Asked to summarise a public ASIC release and list the changed obligations, the leaderboard leader returns fluent, confident prose, and on this task includes the line: "Entities must lodge the updated attestation under section 14B within 30 days."* The team's reference check finds no section 14B and no 30-day attestation requirement anywhere in that release. The tool invented a clause and a deadline. A quieter candidate, on the same task, writes less elegantly but ends with: "The release does not state a specific lodgement timeframe. Not stated in the source."*

The human decision gate. This is the step that cannot be delegated. A senior compliance analyst reads the fabrication log, not the marketing. Candidate A scored highest on fluency and would have won a demo. But it fabricated a regulatory obligation on a high-damage task, which trips the kill criterion the team set in advance: reject any candidate that invents a citation on a high-damage task, regardless of polish. The analyst records the decision in writing, rejects Candidate A, and selects the quieter tool that flagged uncertainty correctly every time. No tool output makes this call. A named person does, against criteria fixed before anyone saw the results.

The team then pilots the chosen tool with a clear kill criterion and a fixed end date, and keeps the fifteen-task set to re-run the moment the next regulatory update lands or the vendor ships a model update. The demo would never have surfaced the fabrication. The benchmark could not have. Their evaluation did, in an afternoon.

The hype check, both ways

Two cautions, because the goal is a useful test, not a research project.

Do not over-engineer it. You do not need a formal eval harness, a statistics team or a hundred test cases to make a far better decision than the leaderboard gives you. Twenty real tasks, a clear rubric and a single Project will tell you more than any public benchmark. The point is to test on your work at all, not to do it perfectly. A rough evaluation run honestly beats a polished benchmark trusted blindly, every time.

And do not treat the result as permanent. The same product changes underneath you. A model update can shift behaviour, sometimes improving it, sometimes quietly breaking a task that used to work. The tool you evaluated in March is not guaranteed to be the tool you are running in September. Keep your evaluation Project, and re-run it when a major update lands or on a regular cadence. Your eval set is not a one-time gate. It is a permanent instrument for watching a thing that does not hold still.

Why this matters more in regulated work

If you work in a regulated environment, the stakes on this are higher and the failure modes are specific. A general buyer cares whether the tool is good. A bank, a general insurer or a super fund also has to care whether it fails safely: whether it refuses to produce something it should not, whether it leaks information across a boundary, whether it fabricates a citation or a clause with total confidence. Those behaviours rarely show up on a capability leaderboard, and they are exactly what your own evaluation set should probe. Build the tests that matter to your risk, not just the ones that flatter the tool. Reading a vendor's published safety card is a useful start, but a card describes intent. Your evaluation measures behaviour, and behaviour is what you are accountable for.

So the move this week is small and concrete. Open a new Project and write down ten tasks you would actually hand an AI tool, de-identified, drawn where you can from public material. That list is the first version of your evaluation set, and it is already a better basis for a decision than every benchmark chart you have been sent this year. The leaderboard belongs to the vendors. The evaluation belongs to you, and it is the only one that knows what your work looks like.

TheAICommand. Intelligence, At Your Command.

Why the scoreboard cannot carry your decision

The shift: evidence has to come from your work

The method: build a private evaluation Project

Setting up the evaluation Project

Files to upload to the Project

The prompt library

A worked example: a bank compliance team evaluating a regulatory-update summariser

The hype check, both ways

Why this matters more in regulated work

Read next

ChatGPT Just Got Better at Health. Mind the Boundary.

Business Teams Can Now Build Their Own AI Agents

The OWASP Agentic Top 10: A Defence Playbook for the Agents You Are Deploying