Ground the Model, Do Not Trust Its Memory, practitioner guidance from TheAICommand
← AI News
Capability

Ground the Model, Do Not Trust Its Memory

A weak model with the right tool beat a strong model working from memory. Anthropic's biology-agent benchmark put numbers on it: accuracy ran as low as 16.9% from recall and cleared 99.7% once grounded in the authoritative source. The lesson holds for every regulated workflow that turns on a rule, rate or standard. Ground the model, do not trust its memory.

·TheAICommand

Quick answer

Grounding an AI agent in the authoritative source beats raw model capability. In Anthropic's biology benchmark, accuracy ran as low as 16.9% when models answered from memory and reached 99.7% once given a deterministic tool over the real database. For regulated work, paste in the rule, rate or standard rather than trusting recall.

A weak model with the right tool beat a strong model working from memory.

That single sentence is the most useful thing to carry out of Anthropic putting AI for science at the centre of its week. On 30 June 2026 the company ran a briefing built around its science work. The finding worth taking back to ordinary office work was not a cure for anything. It was a measurement of how unreliable a capable model is when you let it answer from memory, and how completely that changes when you ground it in the real source.

What actually happened

In research published on 8 June, Paving the way for AI agents in biology, Anthropic and a group of collaborators tested something narrow and concrete: how reliably an AI agent can pull viral sequence data out of public scientific databases. They built a benchmark for it, VirBench, made of 120 realistic queries across 40 pathogens, each with a manually verified correct answer. The matching paper, Deterministic access to global viral sequence data enables robust agentic scientific discovery, went up on arXiv on 4 June.

The first results were poor, and worse than poor, they were erratic. Across a set of frontier models, mean accuracy ran from 16.9% at the bottom to 91.3% at the top. The same model, asked the same question more than once, often gave a different answer. For a task with a single correct count, that is not a near miss. It is a system you cannot rely on.

Then the researchers gave the agents a deterministic retrieval tool, gget virus, built with researchers at the United States National Center for Biotechnology Information. A deterministic tool does the same thing every time: ask it the question, it queries the authoritative database and returns the real answer, with no guessing in between. With that tool in hand, accuracy rose above 90% for every agent tested and peaked at 99.7%. The run-to-run wobble largely disappeared.

A single large accuracy figure inside a soft gold halo over deep navy, with two small comparison ticks below it
From recall to the real record: accuracy ran as low as 16.9% from memory and cleared 99.7% once grounded in the source.

Anthropic's own read is the part to underline. The bottleneck was not the models' reasoning. It was the absence of a reliable, deterministic way to query the data. Adding that retrieval layer, in their words, "made model choice much less important". The cheaper model with the tool beat the better model without it.

What grounding actually means

Strip out the virology and you are left with a clean statement about how these systems behave. A large language model is very good at language and pattern, and genuinely unreliable as a store of specific facts. When you ask it to recall a precise figure, a clause, a count or a rule from memory, you are asking the thing it is worst at, and you get an answer that sounds confident whether it is right or not. The 16.9% floor is what that looks like when you measure it.

Grounding is the fix, and it is not the same as a better prompt or a bigger model. It means wiring the model to the authoritative source and a reliable path to query it, so the answer comes from the record rather than from the model's impression of the record. This is adjacent to the context-engineering point we have made before, that what you put in front of the model matters more than how you phrase the request, but it goes a step further. Context engineering curates what sits in the window. Grounding makes the model fetch the truth from outside itself, deterministically, every time.

Why this lands on regulated Australian work

Almost every high-stakes use of AI in compliance, workers compensation, human resources and audit turns on a precise instrument. A modern award rate. A section of the SRC Act. A prudential standard. A clause in your own policy. These are the facts a model is worst at recalling, and the facts you can least afford it to get wrong.

A side by side split, the left half scattered uncertain marks for a model answering from memory, the right half a single anchored mark for a model grounded in the source
Answering from memory versus answering from the record. The difference is reliability, not intelligence.

So the practical move is the one VirBench demonstrates. Do not ask the model what the award rate is. Give it the award and ask it to read the figure out. Do not ask it to recall a section of the SRC Act. Paste the section in. Do not let it summarise a standard from memory when you can hand it the standard. We made the narrower version of this case in our piece on how AI can read a modern award but the award, not the model, sets the pay. VirBench is the general rule underneath it. The model is the reader, the instrument is the authority, and the reliability comes from keeping those two roles apart.

There is a governance line here too. Once you accept that grounding is what makes an AI answer trustworthy, "what is this grounded in" becomes a question you can audit. For any AI use that touches a rule or a number, you should be able to name the authoritative source it draws from and show how you know the model actually used it. An answer with no traceable source is not a compliant answer, however polished it reads.

Ground one workflow this Monday

You do not need a data engineering project to act on this. Pick one workflow and change how it gets its facts.

Five labelled nodes left to right joined as a single gold path, from naming the source to tracing the answer
Grounding one workflow, step by step. Name the source, paste the record, ask from it, test for consistency, trace before you rely.
  1. Open the one AI workflow you run most often that depends on a rule, a rate, a section or a standard. Name the single authoritative source it should be reading from.
  2. Find that source in full: the award, the section text, the standard, the policy. Have the exact words in front of you, not a summary.
  3. Start a fresh chat in ChatGPT, Claude or equivalent. Paste the authoritative text in, or attach the document, so the model is reading the record and not its memory.
  4. Ask your question with an instruction to answer only from the pasted source and to quote the words it relies on. The first prompt below does exactly that.
  5. Run the same question three or four times. If the answers drift, you have a grounding problem, and the fix is a tighter source, not a cleverer prompt.
  6. Before you rely on the output, trace every figure and claim back to the source text yourself. If a claim has no line in the source, treat it as a draft, not a finding.

Two prompts to paste

The first prompt forces the model to read from the source you give it. The second checks an answer before you rely on it. Both are tool agnostic and use square bracket placeholders you replace.

Prompt
You are assisting a [ROLE] in a regulated Australian workplace. I am pasting the
authoritative source text below. Answer only from the text I provide. Do not use
prior knowledge or your memory of [TOPIC]. If the answer is not in the pasted text,
say "Not stated in the source provided" and stop.

Authoritative source:
[PASTE THE EXACT CLAUSE, RATE, SECTION OR STANDARD]

Question:
[YOUR SPECIFIC QUESTION]

For your answer, quote the exact words from the source that support it, then give
the plain English reading. Do not summarise beyond what the quoted words say.
Prompt
Review the AI answer below before I rely on it. For each factual claim it makes
about a rule, rate, section or standard, tell me:
1. Which exact words in the source support the claim.
2. Any claim that has no support in the source, flagged as unverified.
3. Any figure or date that does not match the source.

Authoritative source:
[PASTE SOURCE]

AI answer to review:
[PASTE THE AI OUTPUT]

Return a short list. Do not add new information. If a claim cannot be traced to the
source, mark it "no source".

The five question grounding audit

Add these five questions to how you sign off any AI answer that touches a rule or a number. If you cannot answer them, the output is a draft.

  • What is the single authoritative source this answer is grounded in, and can you name it?
  • Did the model read that source in this session, or is it answering from memory?
  • Can every figure, date and claim be traced to specific words in the source?
  • Does the answer hold when you run the same question three or four times?
  • If the source is silent, does the model say so, or does it fill the gap with a confident guess?

A worked example

Picture an HR adviser checking the Saturday overtime rate for a casual classified at [CLASSIFICATION] under [AWARDNAME], before it goes into a letter for [EMPLOYEENAME].

The wrong move is to ask the assistant, from a cold start, what the Saturday overtime rate is for that classification. That is a recall question, and recall is exactly where the 16.9% floor lives.

The grounded move: open the current award, copy the overtime and penalty clauses for that classification, and paste them into a fresh chat with the first prompt above.

What came back: the model quoted the exact penalty clause, gave the multiplier as a plain English reading, and, for one edge case the adviser had not pasted, returned "Not stated in the source provided" rather than inventing a figure.

What the human verified and decided: the adviser checked the quoted words against the clause on screen, confirmed the classification matched, ran the question a second time to confirm the answer held, and only then used the figure. The missing edge case was resolved by pasting the relevant clause and asking again. The adviser, not the model, signed off the rate.

Hype check

This is not AI solving science, and the briefing was a positioning moment as much as a research one. The honest version is smaller and more useful than the headline. AI agents are only as reliable as the data layer beneath them, and a deterministic source closes a gap that raw capability cannot. The percentages belong to a biology benchmark and do not transfer to your domain. The pattern does. Treat the numbers as a vivid illustration, not a promise about your own workflow.

The week's loud story was which lab shipped what. The quieter, more durable one is this. A model answering from memory is a guess, and a model answering from the source is a tool. Build for the second.

TheAICommand. Intelligence, At Your Command.

Frequently asked questions

What does grounding an AI model actually mean?
Grounding means wiring the model to the authoritative source and a reliable path to query it, so the answer comes from the record rather than from the model's impression of the record. It is not a better prompt or a bigger model. In practice it means pasting in the exact rule, rate, section or standard, or connecting the model to that source, rather than asking it to recall the fact from memory.
What were the VirBench numbers, and do they apply to my work?
On Anthropic's VirBench benchmark of 120 factual retrieval queries, frontier models answering from memory ran a mean accuracy of 16.9% to 91.3%, with answers drifting between runs. Once given a deterministic tool over the real database, every agent cleared 90% and the top reached 99.7%. The exact percentages belong to a biology task and do not transfer to your domain. The pattern does.
How do I ground a model in a rule or award rather than trusting its memory?
Open the authoritative source in full, start a fresh chat in ChatGPT, Claude or equivalent, and paste the exact text in or attach the document. Ask your question with an instruction to answer only from the pasted source and to quote the words it relies on. The model becomes the reader and the instrument stays the authority.
How can I tell if I have a grounding problem?
Run the same question three or four times. If the answers drift, you have a grounding problem, not a prompt problem, and the fix is a tighter source rather than cleverer wording. A second signal is an answer you cannot trace back to specific words in a named source. If nothing anchors the output, treat it as a draft, not a finding.
Does grounding replace human sign-off?
No. Grounding raises the reliability of the input, but a person still checks the quoted words against the source, confirms the context matches, and makes the decision. For regulated work the rule holds that AI assists and the accountable person decides. Grounding makes that sign-off faster and more defensible because there is a traceable source behind the answer.

Tags

groundingai-agentsreliabilitydata-governanceanthropic
← Back to AI News