What are the main AI context delivery patterns?

The main AI context delivery patterns are retrieval-augmented generation, static context injection, conversation summarization, tool-based on-demand retrieval, and hierarchical context layering. Each pattern solves different problems: RAG handles large knowledge bases, static injection handles fixed rules, and summarization manages conversation history within token limits.

What is retrieval-augmented generation?

Retrieval-augmented generation is an AI context delivery pattern where relevant documents are retrieved from a knowledge base and injected into the prompt at inference time. It allows models to access information beyond their training data without fine-tuning. RAG combines search relevance with generative capability for grounded, up-to-date responses.

How do you decide which context delivery pattern to use?

Choose AI context delivery patterns based on data characteristics: use RAG for large, frequently updated knowledge bases; static injection for small, stable reference data; summarization for long conversations; tool-based retrieval for real-time data; and hierarchical layering for complex multi-source contexts. Consider latency, cost, and freshness requirements.

How do you prioritize context when the window is limited?

Prioritize AI context delivery by relevance scoring, recency, and information density. Put the most critical context closest to the user query. Summarize low-priority context rather than dropping it entirely. Use hierarchical layering where high-priority information gets full text and lower-priority information gets compressed summaries.

What is hierarchical context layering?

Hierarchical context layering is an AI context delivery pattern that organizes information in tiers of detail. The most relevant context gets full detail, moderately relevant context gets summaries, and background context gets brief mentions. This maximizes information density within token limits while ensuring the model has access to the full picture.

How does context delivery affect AI output quality?

AI context delivery is the single biggest lever for output quality after the model itself. Models can only use information in their context window. Delivering relevant, well-structured context dramatically reduces hallucination, improves factual accuracy, and enables domain-specific responses. Poor context delivery wastes tokens and degrades output quality.

MCP & AI InfrastructureDeep Dives

Context Delivery Patterns: Feeding AI the Right Information

Static context, dynamic context, lazy loading, and context windowing.

The Prompt Engineering Project February 27, 2025 11 min read

Quick Answer

AI context delivery patterns define how external data reaches a language model at inference time. The primary patterns are retrieval-augmented generation for dynamic knowledge, static context injection for fixed reference data, conversation summarization for history management, tool-based retrieval for on-demand data fetching, and hierarchical context layering that prioritizes information by relevance. Choosing the right pattern depends on data freshness needs, volume, and latency constraints.

The instructions you give an AI model are only half the equation. The other half is the information you feed it -- the context. A perfectly written system prompt paired with the wrong context produces wrong answers. A mediocre prompt paired with precisely the right context often produces good ones. Context is not supplementary to prompting. It is co-equal with it, and in many production systems, it is the harder problem.

The challenge is that context delivery is not one problem. It is at least four distinct problems, each with its own architecture, tradeoffs, and failure modes. Treating them as interchangeable -- dumping everything into the system prompt and calling it done -- is the most common and most expensive mistake in production AI architecture.

This article covers four patterns for context delivery: Static Context, Dynamic Context, Lazy Loading, and Context Windowing. For each, we walk through the architecture, the tradeoffs, when to use it, and how to implement it in TypeScript.

Pattern 1: Static Context

Static context is information baked directly into the system prompt. It is present in every request, regardless of user input. This is the simplest pattern and the right default for information that is always relevant: role definitions, behavioral rules, output format specifications, company policies, and domain knowledge that applies universally.

static-context-architecture.txt

┌─────────────────────────────────────────────────┐
│                  System Prompt                    │
│                                                   │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────┐ │
│  │    Role      │  │  Rules &     │  │  Output  │ │
│  │  Definition  │  │  Constraints │  │  Format  │ │
│  └─────────────┘  └──────────────┘  └──────────┘ │
│                                                   │
│  ┌──────────────────────────────────────────────┐ │
│  │         Domain Knowledge (always present)     │ │
│  └──────────────────────────────────────────────┘ │
│                                                   │
├─────────────────────────────────────────────────── │
│  User Message                                     │
└───────────────────────────────────────────────────┘
                        │
                        ▼
                   [ LLM Response ]

The strength of static context is predictability. Every request gets the same information, so behavior is consistent. The weakness is cost. Every token in the system prompt is billed on every request. A 2,000-token system prompt costs nothing when you make 10 requests a day. When you make 100,000 requests a day, those tokens add up to real money.

The other weakness is relevance. Static context cannot adapt to the user's question. If your system prompt contains detailed pricing information but the user asks a technical question, those pricing tokens are pure waste -- they consume context window space without contributing to the response.

static-context.ts

// Static context: baked into the system prompt at build time
const systemPrompt = `
You are a customer support agent for Acme SaaS.

## Product Knowledge
- Free tier: 1,000 API calls/month, 1 user, community support
- Pro tier: 50,000 API calls/month, 10 users, email support, $49/month
- Enterprise: unlimited calls, unlimited users, dedicated CSM, custom pricing

## Support Policies
- Refunds available within 30 days of purchase
- SLA: 4-hour response for Enterprise, 24-hour for Pro, best-effort for Free
- Escalation path: L1 agent -> L2 specialist -> engineering on-call

## Behavioral Rules
- Never share internal pricing formulas or discount structures
- Always confirm the customer's tier before quoting capabilities
- If unsure about a technical issue, escalate rather than guess
`

async function handleRequest(userMessage: string) {
  return await callModel({
    system: systemPrompt,  // Same every time
    messages: [{ role: 'user', content: userMessage }],
  })
}

Audit your static context quarterly. Information that was relevant six months ago may now be outdated, and every stale token degrades both cost efficiency and response quality.

Pattern 2: Dynamic Context

Dynamic context is information injected into the prompt at request time based on the user's input. This is the pattern behind RAG (Retrieval-Augmented Generation), database lookups, API calls, and any system that fetches relevant information before calling the model. It is the most common context delivery pattern in production AI systems because it solves the fundamental limitation of static context: relevance.

dynamic-context-architecture.txt

                   User Message
                        │
                        ▼
              ┌───────────────────┐
              │   Context Router   │
              │   (query analysis) │
              └─────────┬─────────┘
                        │
           ┌────────────┼────────────┐
           ▼            ▼            ▼
     ┌──────────┐ ┌──────────┐ ┌──────────┐
     │  Vector  │ │ Database │ │ External │
     │  Search  │ │  Lookup  │ │   API    │
     └────┬─────┘ └────┬─────┘ └────┬─────┘
          │            │            │
          └────────────┼────────────┘
                       ▼
              ┌───────────────────┐
              │  Context Assembly  │
              │  (rank, trim,     │
              │   format)         │
              └─────────┬─────────┘
                        │
                        ▼
    ┌─────────────────────────────────────┐
    │  System Prompt + Assembled Context  │
    │  + User Message                     │
    └─────────────────────────────────────┘
                        │
                        ▼
                   [ LLM Response ]

The implementation has three stages: retrieval, ranking, and assembly. Retrieval fetches candidate context from one or more sources. Ranking scores each candidate by relevance to the user's query and selects the top results. Assembly formats the selected context and injects it into the prompt at the right location.

dynamic-context.ts

interface ContextChunk {
  content: string
  source: string
  relevanceScore: number
  tokenCount: number
}

async function buildDynamicContext(
  userMessage: string,
  tokenBudget: number
): Promise<string> {
  // Stage 1: Retrieve candidates from multiple sources
  const [vectorResults, dbResults] = await Promise.all([
    searchVectorStore(userMessage, { limit: 10 }),
    searchDatabase(userMessage, { limit: 5 }),
  ])

  // Stage 2: Rank by relevance, deduplicate
  const allChunks: ContextChunk[] = [...vectorResults, ...dbResults]
    .sort((a, b) => b.relevanceScore - a.relevanceScore)

  // Stage 3: Assemble within token budget
  let usedTokens = 0
  const selectedChunks: ContextChunk[] = []

  for (const chunk of allChunks) {
    if (usedTokens + chunk.tokenCount > tokenBudget) break
    selectedChunks.push(chunk)
    usedTokens += chunk.tokenCount
  }

  return selectedChunks
    .map(c => `[Source: ${c.source}]\n${c.content}`)
    .join('\n\n---\n\n')
}

async function handleRequest(userMessage: string) {
  const context = await buildDynamicContext(userMessage, 4000)

  return await callModel({
    system: systemPrompt,
    messages: [
      { role: 'user', content: `Context:\n${context}\n\nQuestion: ${userMessage}` },
    ],
  })
}

The critical tradeoff with dynamic context is latency versus relevance. Every retrieval step adds time before the model can start generating. Vector search typically adds 50-200ms. Database queries add 10-100ms. External API calls can add seconds. For real-time applications, the retrieval pipeline must be fast, which means investing in index optimization, caching, and parallel fetching.

The quality of your AI system is bounded by the quality of your retrieval pipeline. A model cannot reason about information it never received.

Pattern 3: Lazy Loading

Lazy loading inverts the dynamic context pattern. Instead of retrieving context before calling the model, you start with minimal context and let the model request additional information as it needs it, via tool calls. The model decides what information is relevant, not the retrieval pipeline.

This pattern is central to how AI agents work. The agent receives a task, considers what information it needs, calls the appropriate tools to fetch that information, reasons about the results, and repeats until it has enough context to produce a final answer.

lazy-loading-architecture.txt

         User Message (minimal initial context)
                        │
                        ▼
                   ┌─────────┐
            ┌─────>│   LLM   │──────┐
            │      └─────────┘      │
            │           │           │
            │    Tool call needed?  │
            │      │          │     │
            │     YES         NO ───┼──> Final Response
            │      │                │
            │      ▼                │
            │ ┌──────────┐         │
            │ │ Execute   │         │
            │ │ Tool Call │         │
            │ └────┬─────┘         │
            │      │               │
            │      ▼               │
            │ ┌──────────┐         │
            │ │ Return    │         │
            └─┤ Results   │         │
              └──────────┘         │
                                   │
       (loop until model has       │
        enough context)  ◄─────────┘

lazy-loading.ts

const tools = [
  {
    name: 'get_customer_profile',
    description: 'Retrieve customer details by ID or email',
    parameters: {
      type: 'object',
      properties: {
        identifier: { type: 'string', description: 'Customer ID or email' },
      },
    },
  },
  {
    name: 'search_knowledge_base',
    description: 'Search product documentation and FAQs',
    parameters: {
      type: 'object',
      properties: {
        query: { type: 'string', description: 'Search query' },
      },
    },
  },
  {
    name: 'get_recent_tickets',
    description: 'Get recent support tickets for a customer',
    parameters: {
      type: 'object',
      properties: {
        customerId: { type: 'string' },
        limit: { type: 'number', description: 'Max tickets to return' },
      },
    },
  },
]

async function handleWithLazyLoading(userMessage: string) {
  const messages = [{ role: 'user' as const, content: userMessage }]

  // Loop: let the model request context as needed
  while (true) {
    const response = await callModel({
      system: systemPrompt,
      messages,
      tools,
    })

    // If the model wants to call a tool, execute it
    if (response.toolCalls && response.toolCalls.length > 0) {
      for (const call of response.toolCalls) {
        const result = await executeTool(call.name, call.arguments)
        messages.push(
          { role: 'assistant' as const, content: response.content },
          { role: 'tool' as const, content: JSON.stringify(result) },
        )
      }
      continue  // Let the model reason about the new context
    }

    // No more tool calls: model has enough context
    return response.content
  }
}

The advantage of lazy loading is token efficiency. The model only retrieves information it actually needs, which can be dramatically less than what a pre-retrieval pipeline would inject. A customer asks "what is my account status?" -- the model calls one tool and gets one result. The same question through a dynamic context pipeline might have injected five documents from the knowledge base that were never needed.

The disadvantage is latency. Each tool call is a round trip: the model generates a tool call, your server executes it, the result goes back to the model, and the model generates again. Two tool calls means three model invocations instead of one. For simple questions this overhead is wasteful. For complex questions that genuinely need multiple pieces of information, it is the right architecture.

60%

Fewer tokens vs. pre-retrieval

2-3x

Added latency per tool call

1-5

Typical tool calls per request

Pattern 4: Context Windowing

Context windowing addresses a problem unique to multi-turn conversations: the conversation history grows with every message, eventually exceeding the model's context window. Without management, the system either truncates the history (losing important early context) or fails entirely when the token limit is hit.

The pattern maintains a sliding window of recent messages at full fidelity, while older messages are compressed into summaries. The model always sees the full detail of recent exchanges and a condensed version of earlier ones, keeping the total token count within budget.

context-windowing-architecture.txt

Turn 1  ─┐
Turn 2   │  Older messages:
Turn 3   ├─ compressed into      ┌──────────────────────┐
Turn 4   │  a running summary ──>│ Summary: "Customer   │
Turn 5  ─┘                       │ asked about billing,  │
                                 │ was upgraded to Pro,  │
                                 │ had API key issue     │
                                 │ resolved..."          │
                                 └──────────┬───────────┘
                                            │
Turn 6  ─┐                                 │
Turn 7   │  Recent messages:               │
Turn 8   ├─ kept at full fidelity          │
Turn 9   │                                 │
Turn 10 ─┘                                 │
                                            │
         ┌──────────────────────────────────┘
         │
         ▼
  ┌────────────────────────────────────────┐
  │  System Prompt                          │
  │  + Conversation Summary (older turns)  │
  │  + Full Messages (recent turns)        │
  │  + Current User Message                │
  └────────────────────────────────────────┘
                    │
                    ▼
              [ LLM Response ]

context-windowing.ts

interface Message {
  role: 'user' | 'assistant'
  content: string
  tokenCount: number
}

interface ConversationWindow {
  summary: string
  summaryTokens: number
  recentMessages: Message[]
  recentTokens: number
}

async function manageConversationWindow(
  allMessages: Message[],
  maxTokens: number,
  systemPromptTokens: number,
  reserveForResponse: number = 2000
): Promise<ConversationWindow> {
  const budget = maxTokens - systemPromptTokens - reserveForResponse

  // Start from the most recent message and work backwards
  const recent: Message[] = []
  let recentTokens = 0

  for (let i = allMessages.length - 1; i >= 0; i--) {
    const msg = allMessages[i]
    if (recentTokens + msg.tokenCount > budget * 0.7) break  // 70% for recent
    recent.unshift(msg)
    recentTokens += msg.tokenCount
  }

  // Summarize older messages that did not fit
  const olderMessages = allMessages.slice(0, allMessages.length - recent.length)
  let summary = ''
  let summaryTokens = 0

  if (olderMessages.length > 0) {
    summary = await summarizeMessages(olderMessages, budget * 0.3)
    summaryTokens = countTokens(summary)
  }

  return { summary, summaryTokens, recentMessages: recent, recentTokens }
}

async function summarizeMessages(
  messages: Message[],
  maxTokens: number
): Promise<string> {
  const transcript = messages
    .map(m => `${m.role}: ${m.content}`)
    .join('\n')

  const result = await callModel({
    system: 'Summarize this conversation history concisely. Preserve key facts, decisions, and unresolved questions. Omit pleasantries and redundant exchanges.',
    messages: [{ role: 'user', content: transcript }],
    maxTokens,
  })

  return result
}

The critical implementation detail is the summarization quality. A bad summary loses important context and causes the model to ask questions the user already answered, which is the single most frustrating experience in a conversational AI product. The summary must preserve: factual decisions made, user preferences expressed, unresolved questions, and any context the model will need to continue the conversation coherently.

Never silently truncate conversation history without summarization. Users who repeat themselves because the model "forgot" will lose trust faster than users who experience any other failure mode.

Choosing the Right Pattern

These four patterns are not mutually exclusive. Most production systems use two or three in combination. A typical architecture uses static context for rules and role definitions, dynamic context for RAG retrieval, and context windowing for conversation history management. Lazy loading is added when the agent needs access to tools that fetch specialized information.

The decision framework is straightforward.

Is the information always relevant?

Use static context. Rules, role definitions, output formats, and behavioral constraints belong in the system prompt. They apply to every request.

Does relevance depend on the user's input?

Use dynamic context. RAG retrieval, database lookups, and API calls fetch information specific to the current request. This is the right pattern when you have a large knowledge base but only a small portion is relevant to any given query.

Is the information needed unpredictable?

Use lazy loading. When you cannot anticipate what information the model will need, let it decide. This is the right pattern for agent-style interactions where the task is complex and the information requirements emerge during reasoning.

Is the conversation long-running?

Use context windowing. Any conversation that exceeds 10-15 turns needs a strategy for managing history. Summarize older messages, keep recent ones at full fidelity, and never silently truncate.

Key Takeaways

Context delivery is co-equal with prompt design. A perfect prompt with wrong context produces wrong answers. Treat context architecture as a first-class engineering problem.

Static context is cheapest to implement but most expensive per-token at scale. Audit it regularly and move variable information to dynamic retrieval.

Dynamic context (RAG) is the most common production pattern. Its quality depends on the retrieval pipeline -- invest in ranking, deduplication, and token budgeting.

Lazy loading saves tokens by letting the model request only what it needs, at the cost of added latency from tool call round trips. Best for agent-style interactions.

Context windowing is essential for any conversation longer than 10 turns. Summarize older messages rather than truncating them, and preserve key facts and decisions in the summary.

Frequently Asked Questions

Common questions about this topic

AI Agent Workflow Patterns: Fan-Out, Pipeline, and Orchestration MCP Tool Naming: Why Names Are Your Most Important API Decision

MCP & AI Infrastructure