The leaderboard is not your job.
A new frontier model has landed almost every month this year. Claude's Fable line arrived and was then pulled from sale for foreign nationals by a government order. OpenAI made GPT-5.5 Instant the ChatGPT default. Google pushed Gemini 3.5 Flash as an agent default. Microsoft shipped seven in-house MAI models. Mistral put frontier-class weights on four GPUs. Every one of those launches came with a chart showing it beating the others on some benchmark. If you are trying to choose an AI tool for real work, that chart is close to useless, and the speed of the churn means whatever you pick on benchmark scores will be "beaten" within weeks anyway.
The way out of this is not to track the leaderboard more closely. It is to stop trusting it, and to evaluate any tool the only way that actually predicts whether it will work for you: on your own tasks, against your own standard, before you commit. That sounds like more effort than reading a scoreboard. It is, by a few hours, and those few hours are the cheapest insurance you will buy on an AI decision.
It is worth being honest about where AI buyer's remorse actually comes from. It is rarely a tool that scored badly on a benchmark. It is the tool that demoed beautifully, got bought on the strength of that demo, and then quietly underperformed on the real, repetitive work nobody tested it against. The gap between the demo and the day job is where the money is lost. An evaluation built from your own tasks is simply the act of testing the day job before you pay for it, instead of after.
Why the scoreboard cannot carry your decision
The benchmarks vendors quote are weaker evidence than their confidence suggests. A large review titled "Measuring what Matters: Construct Validity in Large Language Model Benchmarks", carried out by a team of 29 reviewers and reported in November 2025, examined 445 separate LLM benchmarks from leading AI conferences. The finding was damning in its plainness: "almost all articles have weaknesses in at least one area" that undermine the performance claims built on them. Twenty-seven per cent relied on convenience sampling, reusing existing benchmark data or human exam questions rather than anything representative. Only 16 per cent used uncertainty estimates or statistical tests to compare results at all.

There is a second, more basic problem: contamination. Models train on enormous slices of the internet, which increasingly include the benchmark questions and answers themselves. When a test's questions sit in a model's training data, a high score measures memory, not capability. The review notes this directly, pointing to widely used tests whose results are undermined when their questions and answers appear in the model's pre-training data.
Even a clean benchmark would not settle your decision, because a public score is measured under conditions that are not yours. The prompt, the scaffolding, the tools the model could call, the temperature, all of it differs from how the tool will run in your environment on your data. A leaderboard tells you something about a model in a lab. It tells you almost nothing about how it will handle the specific, messy, repetitive task you actually need done. And the vendor demo is worse, because it was built to succeed.
None of this makes benchmarks worthless. They have a real job: helping researchers track whether the field is making progress over time, and giving vendors a shared yardstick to aim at. That job is legitimate. The mistake is borrowing a research instrument and treating it as procurement evidence, as if a score designed to compare models in the abstract could tell you whether a particular tool will do your particular work. It was never built to answer that question, and it does not.
The shift: evidence has to come from your work
The principle is simple once stated. The only evidence that should move your money is the tool's performance on tasks that look like your tasks, judged by your definition of good. Everything else is marketing, including the parts that look like science.
This reframes the whole buying decision. You are not asking "which model is best", a question with no stable answer in a market that turns over monthly. You are asking "which tool does my work well enough, reliably enough, safely enough, to be worth what it costs", a question only your own evaluation can answer, and one whose answer stays useful even as the leaderboard churns underneath it.
The method: build a private evaluation Project
You do not need a research team for this. You need a small, honest test built from your real work, the discipline to run it the same way for every candidate, and somewhere to keep it. That somewhere is a dedicated Project. Both ChatGPT (Projects) and Claude (Projects) let you create a contained workspace with its own standing instructions and its own uploaded files, separate from your general chat history. That container is exactly what an evaluation needs: a fixed brief, a fixed task set, and a fixed scoring rubric that you run identically against every candidate, instead of re-typing the ground rules into a fresh chat each time and quietly letting the test drift.
A standing note before any of the prompts below. Never paste real personal, claim, health or incident data into a model that is not an approved enterprise instance. The tool you are evaluating is, by definition, not yet approved. De-identify every task before it goes near it: swap real values for placeholder tokens such as [EMPLOYEENAME], [CLAIMNUMBER], [INCIDENTID], [TEAM], [ROLE], [SITE] and [DATE]. For an evaluation set, public source material (a published regulator release, a public annual report) is ideal, because it carries no confidentiality risk at all and still exercises the tool on work that looks like yours.
Setting up the evaluation Project
Create a new Project and paste the following block into its custom instructions or project description. It fixes the model's role for the whole evaluation so that every candidate is briefed identically.

Files to upload to the Project
Keep the evaluation self-contained. Upload these so every candidate works from the same material, and so the test does not drift between runs:
- The task set. Ten to thirty de-identified real tasks, each with its source material attached or pasted. This is the heart of the evaluation.
- The scoring rubric. Your written definition of good for each task type (accuracy, format, tone, and the right failure behaviour).
- Known-good answers. Where a task has a correct answer, record it in a separate reference file you do not give the model, so you can mark against it.
- A source pack. The published documents the tasks draw from (for the worked example below, the relevant APRA and ASIC releases), so the tool answers from the same evidence every time.
The prompt library
Four prompts run the whole evaluation: define the tasks, write the rubric, run the shortlist blind, then score against the rubric with kill criteria. Run each candidate tool through prompts three and four identically. Remember the standing note above: de-identify everything, and prefer public source material where you can.
1. Define the real tasks.
2. Write the scoring rubric.
3. Run the shortlist blind.
4. Score against the rubric with kill criteria.

The five-step logic underneath these prompts holds even if you run the whole thing by hand. Define the real tasks. Write down what good looks like before you test. Keep the eval set private and out of any tool's training path. Test the shortlist the same way, blind where you can. Then run a bounded pilot with explicit kill criteria, because a pilot without a stop condition is just a slow, unmonitored purchase that nobody ever decides to end.
A worked example: a bank compliance team evaluating a regulatory-update summariser
An Australian bank's compliance team wants an assistant to help summarise regulatory updates. The leaderboard says one tool is on top; the vendor demo is slick. Instead of buying on either, they build the evaluation properly.
Setup. A team member creates a Project called "Reg-Update Summariser Eval", pastes in the standing instructions, and uploads a source pack of recent public APRA and ASIC releases. Nothing confidential goes in, because the source material is already public. The team assembles fifteen real tasks, each a version of "summarise this release and list the obligations that changed", drawn from genuine recent work but using only the published source. They write the rubric first: a summary has to be accurate, cite the right document and section, flag where it is unsure, and never invent a reference or a requirement that is not in the source. Fabricating a citation is flagged as the high-damage failure and weighted highest.
Run. They run three tools through the same fifteen tasks, recording each tool's answers under an unlabelled candidate ID so the person scoring cannot see which model produced what. The blind-run prompt is identical for every candidate.
Illustrative output (Candidate A, one task). Asked to summarise a public ASIC release and list the changed obligations, the leaderboard leader returns fluent, confident prose, and on this task includes the line: "Entities must lodge the updated attestation under section 14B within 30 days."* The team's reference check finds no section 14B and no 30-day attestation requirement anywhere in that release. The tool invented a clause and a deadline. A quieter candidate, on the same task, writes less elegantly but ends with: "The release does not state a specific lodgement timeframe. Not stated in the source."*
The human decision gate. This is the step that cannot be delegated. A senior compliance analyst reads the fabrication log, not the marketing. Candidate A scored highest on fluency and would have won a demo. But it fabricated a regulatory obligation on a high-damage task, which trips the kill criterion the team set in advance: reject any candidate that invents a citation on a high-damage task, regardless of polish. The analyst records the decision in writing, rejects Candidate A, and selects the quieter tool that flagged uncertainty correctly every time. No tool output makes this call. A named person does, against criteria fixed before anyone saw the results.
The team then pilots the chosen tool with a clear kill criterion and a fixed end date, and keeps the fifteen-task set to re-run the moment the next regulatory update lands or the vendor ships a model update. The demo would never have surfaced the fabrication. The benchmark could not have. Their evaluation did, in an afternoon.
The hype check, both ways
Two cautions, because the goal is a useful test, not a research project.
Do not over-engineer it. You do not need a formal eval harness, a statistics team or a hundred test cases to make a far better decision than the leaderboard gives you. Twenty real tasks, a clear rubric and a single Project will tell you more than any public benchmark. The point is to test on your work at all, not to do it perfectly. A rough evaluation run honestly beats a polished benchmark trusted blindly, every time.
And do not treat the result as permanent. The same product changes underneath you. A model update can shift behaviour, sometimes improving it, sometimes quietly breaking a task that used to work. The tool you evaluated in March is not guaranteed to be the tool you are running in September. Keep your evaluation Project, and re-run it when a major update lands or on a regular cadence. Your eval set is not a one-time gate. It is a permanent instrument for watching a thing that does not hold still.
Why this matters more in regulated work
If you work in a regulated environment, the stakes on this are higher and the failure modes are specific. A general buyer cares whether the tool is good. A bank, a general insurer or a super fund also has to care whether it fails safely: whether it refuses to produce something it should not, whether it leaks information across a boundary, whether it fabricates a citation or a clause with total confidence. Those behaviours rarely show up on a capability leaderboard, and they are exactly what your own evaluation set should probe. Build the tests that matter to your risk, not just the ones that flatter the tool. Reading a vendor's published safety card is a useful start, but a card describes intent. Your evaluation measures behaviour, and behaviour is what you are accountable for.
So the move this week is small and concrete. Open a new Project and write down ten tasks you would actually hand an AI tool, de-identified, drawn where you can from public material. That list is the first version of your evaluation set, and it is already a better basis for a decision than every benchmark chart you have been sent this year. The leaderboard belongs to the vendors. The evaluation belongs to you, and it is the only one that knows what your work looks like.
TheAICommand. Intelligence, At Your Command.



