Context window

The context window is the total token budget a model can process in one inference call. Every token you put in (the system prompt, conversation history, retrieved documents, tool outputs) and every token the model produces in reply draws from this shared pool. Misunderstand that and you will design products that hit invisible walls in production: the model does not “remember” your conversation across sessions because it has no memory outside the context window. Every call starts fresh. What looks like memory is just previous messages being re-fed each turn.

The input/output asymmetry most PM explanations skip

The headline number is the input window. The output cap is separate and often much tighter. As of mid-2026:

Claude Opus 4.7: 1M input tokens / 128k max output
Claude Sonnet 4.6: 1M input tokens / 64k max output
Claude Haiku 4.5: 200k input tokens / 64k max output
GPT-5.4: 1M long-context, with a documented 272k tier threshold above which premium pricing applies
Llama 4 Maverick: 1M input
Llama 4 Scout: 10M input (the largest commercially accessible window as of mid-2026)

These numbers come from base model specs. Vendor deployments often impose tighter limits: Amazon Bedrock and Google Vertex regularly cap at lower thresholds than the underlying model supports and charge premium rates above certain tiers. A PM who quotes “1M token context” without checking the cloud provider’s actual limit is quoting marketing copy, not production reality.

Multimodal adds another wrinkle: images and audio consume tokens differently depending on the provider, sometimes drawn from a separate allocation, sometimes directly from the shared pool.

Why bigger context does not automatically mean better

Three distinct failure modes matter for product decisions:

Lost in the middle. Performance degrades when relevant content sits in the middle of a long context. Llama 3.1 405B shows meaningful degradation around 32k tokens in practice, despite a larger rated window. In multi-turn adversarial settings with conflicting information added mid-conversation, accuracy dropped 39% on average across models; OpenAI o3 fell from 98.1% to 64.1% (Microsoft/Salesforce research). Long context is not equivalent to long attention.

Cost and latency compound. Input tokens cost money on every call. Stuffing a 1M-token window with everything you have is not a product strategy; it is a cost blowup waiting to happen. GPT-5.4 charges differently above 272k tokens. Claude’s output costs more per token than input. A product that naively concatenates conversation history will erode margin every time a user stays in a session longer than expected.

Tool overload in agents. Context management is not just text. Llama 3.1 8B failed a function-calling task when given 46 tools but succeeded with 19. Dynamic tool selection, exposing only the tools relevant to the current step, improved performance 44%, execution speed 77%, and power usage 18% according to the Berkeley Function-Calling Leaderboard. The PM lesson: the information diet the agent sees is as important as the model’s rated capability.

The production case that changes how you think about this

A support-ticket routing system loaded roughly 140k tokens per query by including full ticket history, knowledge base, and policy docs. Accuracy was 70%. After engineering the context down to around 6k tokens (surfacing only what was relevant to the specific ticket type), accuracy rose to 90%+ with latency measured in seconds. The model did not change. The context discipline did. Window size was never the constraint; information selection was.

Context compression tools like Provence can compress context up to 95% while retaining the information the model needs. Anthropic’s extended thinking approach showed up to 54% improvement on some agent benchmarks for intermediate reasoning. The implication for product design: the value is in what you choose to include and what you discard, not in whether you technically have headroom to include more.

RAG vs long-context: the decision that actually matters

RAG retrieves relevant documents at query time and injects only those. Long-context loads large amounts of material in full. Neither is universally better.

Use long-context when you need cross-chunk reasoning on a single artifact, the content is bounded (a contract, a code file, a report), and latency and cost are acceptable. Use RAG when the corpus is unbounded, freshness matters, or cost-per-query at scale is a constraint.

The common mistake is treating long context as a substitute for retrieval architecture. A 1M-token window does not make RAG irrelevant: it shifts which problems benefit from raw context and which benefit from selective retrieval. For most production support, knowledge-base, and document-processing products, RAG plus compaction beats naive long-context on both accuracy and cost.

For the full decision frame, see RAG vs fine-tuning vs prompting.

What this means for product decisions

Context window size shapes four concrete PM decisions:

Document upload limits: what file sizes and types can users attach, given your provider’s effective limit and token cost?
Conversation length: how long can a chat session run before history truncation or summarization kicks in, and what does the user experience when that happens?
Cost model: what is the token cost per session at P90 usage, and does that support your pricing tier?
Agent architecture: which tools, documents, and history does each agent step actually need? More is not more.

In 2026, a 10M-token context window exists. Feasibility is not the question. The question is what viable and lovable context management looks like: users never see truncation errors, never re-explain their history, their documents just work, while your cost per query supports the business. That is an information diet problem, not a window size problem.

Interview answer

strong

"The context window is the total token budget shared by everything the model sees and produces in one call: system prompt, conversation history, retrieved documents, tool outputs, and the reply itself. The headline number is the input window; the output cap is separate and tighter. Claude Sonnet 4.6 is 1M input but only 64k output. GPT-5.4 charges premium rates above 272k tokens even though 1M is advertised. In practice, vendor deployments on Bedrock or Vertex often cap below the base model spec, which is a common production trap. The bigger PM insight is that bigger windows do not automatically mean better results. Accuracy dropped 39% on average in multi-turn adversarial benchmarks, and a support routing system went from 70% to 90%+ accuracy by cutting context from 140k tokens to 6k by selecting only what mattered. So my product decisions around context windows are about information diet: which documents, tools, and history does each step actually need, and what does that cost per query at scale? I'd evaluate with latency benchmarks at P90 session length, token cost projections by usage tier, and retrieval evals checking that the right content is being surfaced, not just included."

weak

"The context window is basically the model's working memory, how much text it can hold at once. Bigger context windows mean the model can handle more information." This stops at the analogy without touching the asymmetry between input and output caps, the cost tiers, the degradation at scale, or any product decision it informs. A candidate who gives this answer at an AI-native company signals they read a definition but have never had to design around a token budget in production.

What the follow-up tests

Interviewers at AI-native companies follow context window questions with: “How would you design a product that handles long documents without hitting context limits?” or “What is the difference between RAG and long-context, and when would you use each?” Both questions test whether you can translate a technical constraint into a product architecture and a cost model, not just recite the definition.

The answer chain they want: window size is a budget, not a ceiling to ignore; context discipline (what you include, what you retrieve, what you compress or summarize) drives quality and cost more than raw window size; RAG and long-context are complements with specific conditions of use, not alternatives. For the cost-per-query lens that connects this to pricing decisions, see LLM unit economics.