Two-million-token multimodal context is here. Most teams should still build RAG.
GPT-5, Gemini 2.5 Pro and Claude Opus 4.7 now all support context windows around 2 million tokens with image and video understanding inside the context (OpenAI, Google AI, Anthropic, accessed April 2026). The marketing positions this as the end of retrieval-augmented generation. The production data does not.
This is the short version.
The headline finding
Long multimodal context is genuinely valuable for a small number of specific workflows and unhelpful for the majority. The default assumption that "longer context replaces RAG" is wrong for most enterprise use cases. The decision is workload-dependent.
Context
Long context windows have been steadily growing for two years. The shift in 2026 is twofold. First, the windows are now big enough to fit genuinely large documents (a 5,000-page contract, an hour of video, an entire 800,000-line codebase). Second, the multimodal capability is real: the model can reason about images and video frames inside the context, not just text descriptions of them.
The pricing is the catch. A 1.8 million token prompt on GPT-5 enterprise costs around USD 14 per request before output. On Gemini 2.5 Pro, around USD 2.20. On Claude Opus 4.7, around USD 27. The headline rate is one thing. The cost per useful answer is the only number that matters.
Why it matters
The decision rule is now reasonably clear after six months of production data across teams we have seen.
Long context pays off when you have a small number of large, complex documents that need holistic reasoning, where the cost per query is not the dominant constraint, and where the question requires the model to find subtle cross-references that retrieval would miss.
Long context does not pay off when you have many users asking many questions against a large but structured corpus, where freshness matters (the corpus changes daily), or where cost per query is the dominant constraint.
Three workflows where it pays off:
- Complex contract review. A single 1,500-page contract with hundreds of cross-references. Loading the full document beats retrieval because the model can spot inconsistencies that no chunking strategy will surface.
- Video evidence review. An hour of footage where the question is "what happened, why, and what was inconsistent". Multimodal context puts the visual and audio reasoning in one pass.
- Large codebase reasoning. An entire repository in context lets the model answer architectural questions that fragmentary retrieval misses.
Three workflows where well-built RAG still wins:
- Customer support knowledge bases. Thousands of articles, hundreds of users, freshness sensitive. RAG is faster, cheaper and more current.
- High-volume document Q&A. Same query pattern, different document each time. Long context burns money. RAG with a good index does not.
- Cost-sensitive consumer products. USD 14 per query is fine for an internal compliance tool. It is not fine for a feature serving 100,000 monthly users.

The cost per useful answer math
Per-token rate cards mislead. The number that matters is dollars per accepted answer. Worked example. A compliance team asks 200 questions per week against a 1,200-page master agreement. Long-context route: USD 12 per query loading the full document, ~95 percent acceptance after human review = USD 12.6 per useful answer, USD 2,520 per week. RAG route on the same corpus with a tuned chunker and reranker: ~USD 0.18 per query, ~88 percent acceptance = USD 0.20 per useful answer, USD 41 per week. The long-context route gets you marginally better recall on cross-references; the question is whether the extra USD 2,479 per week is worth the seven additional questions answered correctly. Almost always no.
The exception is when the cross-reference catch matters disproportionately. Missing one indemnity caveat in a buy-side M&A review is a different magnitude of error than missing one FAQ pointer. Score the workload by error cost, not by query volume.
Caching changes the picture
Both Anthropic's prompt caching and OpenAI's prompt caching cut the cost of repeat-context queries by 75 to 90 percent against the same long document. If your workload is "ask 50 questions of one 1,500-page document in a single session", cached long context lands close to RAG on cost while keeping the holistic reasoning advantage. If the workload is "one query each against 1,000 different documents", caching does not help and RAG wins clearly. The shape of the access pattern, not the token budget, decides the architecture.
The quiet failure modes at length
Three failure modes that surface only above the 500K-token mark and that the model cards do not advertise. First, attention degradation. Multiple independent benchmarks now show that recall on facts buried in the middle of very long contexts drops 20 to 30 percent versus facts in the first or last 100K tokens. The "needle in haystack" tests vendors publish are easier than realistic multi-fact reasoning at length. Second, latency. A 1.8M-token request on GPT-5 takes 90 to 180 seconds to first token. That is fine for a batch compliance review. It kills any conversational UX. Third, hallucination at length. The longer the context, the more confidently the model invents content that looks consistent with what it has read. Human review effort scales with context length, not with response length, because the assertion surface to verify grows.
Bottom line
- Treat long multimodal context as a complement to RAG, not a replacement. The two answer different questions.
- The cost per useful answer is the only sensible benchmark. Per-token pricing is misleading.
- Cache aggressively when access patterns repeat. Caching narrows the long-context cost gap to RAG on the right workloads.
- For internal tooling on small numbers of complex documents, long context is now genuinely better than building a retrieval pipeline. The engineering effort saved is real.
- For anything customer-facing or high-volume, RAG remains the right architecture. The long-context option is a fallback for hard queries, not a default path.
- Audit attention behaviour at the length you actually use. The marketing benchmarks understate the recall drop in the middle of very long contexts.
- Vendors selling "you don't need RAG anymore" are selling you a misalignment between the demo workload and your actual workload.
The right question is not "should we use long context or RAG". It is "which workloads sit on which architecture, and what is the cost per useful answer for each, with caching modelled in".
TheAICommand. Intelligence, At Your Command.



