2M-Token Multimodal Contexts: Where They Actually Pay Off

Two-million-token multimodal context is here. Most teams should still build RAG.

GPT-5, Gemini 2.5 Pro and Claude Opus 4.7 now all support context windows around 2 million tokens with image and video understanding inside the context (OpenAI, Google AI, Anthropic, accessed April 2026). The marketing positions this as the end of retrieval-augmented generation. The production data does not.

This is the short version.

The headline finding

Long multimodal context is genuinely valuable for a small number of specific workflows and unhelpful for the majority. The default assumption that "longer context replaces RAG" is wrong for most enterprise use cases. The decision is workload-dependent.

Context

Long context windows have been steadily growing for two years. The shift in 2026 is twofold. First, the windows are now big enough to fit genuinely large documents (a 5,000-page contract, an hour of video, an entire 800,000-line codebase). Second, the multimodal capability is real: the model can reason about images and video frames inside the context, not just text descriptions of them.

The pricing is the catch. A 1.8 million token prompt on GPT-5 enterprise costs around USD 14 per request before output. On Gemini 2.5 Pro, around USD 2.20. On Claude Opus 4.7, around USD 27. The headline rate is one thing. The cost per useful answer is the only number that matters.

Why it matters

The decision rule is now reasonably clear after six months of production data across teams we have seen.

Long context pays off when you have a small number of large, complex documents that need holistic reasoning, where the cost per query is not the dominant constraint, and where the question requires the model to find subtle cross-references that retrieval would miss.

Long context does not pay off when you have many users asking many questions against a large but structured corpus, where freshness matters (the corpus changes daily), or where cost per query is the dominant constraint.

Three workflows where it pays off:

Complex contract review. A single 1,500-page contract with hundreds of cross-references. Loading the full document beats retrieval because the model can spot inconsistencies that no chunking strategy will surface.
Video evidence review. An hour of footage where the question is "what happened, why, and what was inconsistent". Multimodal context puts the visual and audio reasoning in one pass.
Large codebase reasoning. An entire repository in context lets the model answer architectural questions that fragmentary retrieval misses.

Three workflows where well-built RAG still wins:

Customer support knowledge bases. Thousands of articles, hundreds of users, freshness sensitive. RAG is faster, cheaper and more current.
High-volume document Q&A. Same query pattern, different document each time. Long context burns money. RAG with a good index does not.
Cost-sensitive consumer products. USD 14 per query is fine for an internal compliance tool. It is not fine for a feature serving 100,000 monthly users.

Bottom line

Treat long multimodal context as a complement to RAG, not a replacement. The two answer different questions.
The cost per useful answer is the only sensible benchmark. Per-token pricing is misleading.
For internal tooling on small numbers of complex documents, long context is now genuinely better than building a retrieval pipeline. The engineering effort saved is real.
For anything customer-facing or high-volume, RAG remains the right architecture. The long-context option is a fallback for hard queries, not a default path.
Vendors selling "you don't need RAG anymore" are selling you a misalignment between the demo workload and your actual workload.

The right question is not "should we use long context or RAG". It is "which workloads sit on which architecture, and what is the cost per useful answer for each".

TheAICommand. Intelligence, At Your Command.