A weak model with the right tool beat a strong model working from memory.
That single sentence is the most useful thing to carry out of Anthropic putting AI for science at the centre of its week. On 30 June 2026 the company ran a briefing built around its science work. The finding worth taking back to ordinary office work was not a cure for anything. It was a measurement of how unreliable a capable model is when you let it answer from memory, and how completely that changes when you ground it in the real source.
What actually happened
In research published on 8 June, Paving the way for AI agents in biology, Anthropic and a group of collaborators tested something narrow and concrete: how reliably an AI agent can pull viral sequence data out of public scientific databases. They built a benchmark for it, VirBench, made of 120 realistic queries across 40 pathogens, each with a manually verified correct answer. The matching paper, Deterministic access to global viral sequence data enables robust agentic scientific discovery, went up on arXiv on 4 June.
The first results were poor, and worse than poor, they were erratic. Across a set of frontier models, mean accuracy ran from 16.9% at the bottom to 91.3% at the top. The same model, asked the same question more than once, often gave a different answer. For a task with a single correct count, that is not a near miss. It is a system you cannot rely on.
Then the researchers gave the agents a deterministic retrieval tool, gget virus, built with researchers at the United States National Center for Biotechnology Information. A deterministic tool does the same thing every time: ask it the question, it queries the authoritative database and returns the real answer, with no guessing in between. With that tool in hand, accuracy rose above 90% for every agent tested and peaked at 99.7%. The run-to-run wobble largely disappeared.

Anthropic's own read is the part to underline. The bottleneck was not the models' reasoning. It was the absence of a reliable, deterministic way to query the data. Adding that retrieval layer, in their words, "made model choice much less important". The cheaper model with the tool beat the better model without it.
What grounding actually means
Strip out the virology and you are left with a clean statement about how these systems behave. A large language model is very good at language and pattern, and genuinely unreliable as a store of specific facts. When you ask it to recall a precise figure, a clause, a count or a rule from memory, you are asking the thing it is worst at, and you get an answer that sounds confident whether it is right or not. The 16.9% floor is what that looks like when you measure it.
Grounding is the fix, and it is not the same as a better prompt or a bigger model. It means wiring the model to the authoritative source and a reliable path to query it, so the answer comes from the record rather than from the model's impression of the record. This is adjacent to the context-engineering point we have made before, that what you put in front of the model matters more than how you phrase the request, but it goes a step further. Context engineering curates what sits in the window. Grounding makes the model fetch the truth from outside itself, deterministically, every time.
Why this lands on regulated Australian work
Almost every high-stakes use of AI in compliance, workers compensation, human resources and audit turns on a precise instrument. A modern award rate. A section of the SRC Act. A prudential standard. A clause in your own policy. These are the facts a model is worst at recalling, and the facts you can least afford it to get wrong.

So the practical move is the one VirBench demonstrates. Do not ask the model what the award rate is. Give it the award and ask it to read the figure out. Do not ask it to recall a section of the SRC Act. Paste the section in. Do not let it summarise a standard from memory when you can hand it the standard. We made the narrower version of this case in our piece on how AI can read a modern award but the award, not the model, sets the pay. VirBench is the general rule underneath it. The model is the reader, the instrument is the authority, and the reliability comes from keeping those two roles apart.
There is a governance line here too. Once you accept that grounding is what makes an AI answer trustworthy, "what is this grounded in" becomes a question you can audit. For any AI use that touches a rule or a number, you should be able to name the authoritative source it draws from and show how you know the model actually used it. An answer with no traceable source is not a compliant answer, however polished it reads.
Ground one workflow this Monday
You do not need a data engineering project to act on this. Pick one workflow and change how it gets its facts.

- Open the one AI workflow you run most often that depends on a rule, a rate, a section or a standard. Name the single authoritative source it should be reading from.
- Find that source in full: the award, the section text, the standard, the policy. Have the exact words in front of you, not a summary.
- Start a fresh chat in ChatGPT, Claude or equivalent. Paste the authoritative text in, or attach the document, so the model is reading the record and not its memory.
- Ask your question with an instruction to answer only from the pasted source and to quote the words it relies on. The first prompt below does exactly that.
- Run the same question three or four times. If the answers drift, you have a grounding problem, and the fix is a tighter source, not a cleverer prompt.
- Before you rely on the output, trace every figure and claim back to the source text yourself. If a claim has no line in the source, treat it as a draft, not a finding.
Two prompts to paste
The first prompt forces the model to read from the source you give it. The second checks an answer before you rely on it. Both are tool agnostic and use square bracket placeholders you replace.
The five question grounding audit
Add these five questions to how you sign off any AI answer that touches a rule or a number. If you cannot answer them, the output is a draft.
- What is the single authoritative source this answer is grounded in, and can you name it?
- Did the model read that source in this session, or is it answering from memory?
- Can every figure, date and claim be traced to specific words in the source?
- Does the answer hold when you run the same question three or four times?
- If the source is silent, does the model say so, or does it fill the gap with a confident guess?
A worked example
Picture an HR adviser checking the Saturday overtime rate for a casual classified at [CLASSIFICATION] under [AWARDNAME], before it goes into a letter for [EMPLOYEENAME].
The wrong move is to ask the assistant, from a cold start, what the Saturday overtime rate is for that classification. That is a recall question, and recall is exactly where the 16.9% floor lives.
The grounded move: open the current award, copy the overtime and penalty clauses for that classification, and paste them into a fresh chat with the first prompt above.
What came back: the model quoted the exact penalty clause, gave the multiplier as a plain English reading, and, for one edge case the adviser had not pasted, returned "Not stated in the source provided" rather than inventing a figure.
What the human verified and decided: the adviser checked the quoted words against the clause on screen, confirmed the classification matched, ran the question a second time to confirm the answer held, and only then used the figure. The missing edge case was resolved by pasting the relevant clause and asking again. The adviser, not the model, signed off the rate.
Hype check
This is not AI solving science, and the briefing was a positioning moment as much as a research one. The honest version is smaller and more useful than the headline. AI agents are only as reliable as the data layer beneath them, and a deterministic source closes a gap that raw capability cannot. The percentages belong to a biology benchmark and do not transfer to your domain. The pattern does. Treat the numbers as a vivid illustration, not a promise about your own workflow.
The week's loud story was which lab shipped what. The quieter, more durable one is this. A model answering from memory is a guess, and a model answering from the source is a tool. Build for the second.
TheAICommand. Intelligence, At Your Command.



