MCP & AI InfrastructureDeep Dives

The AI Search Stack: How to Build Search That Actually Works

Traditional search meets AI re-ranking and semantic understanding.

The Prompt Engineering Project February 24, 2025 11 min read

Quick Answer

An AI search architecture combines vector embeddings, traditional keyword search, and neural reranking into a hybrid retrieval pipeline. Documents are chunked and embedded into a vector database, then queries use both semantic similarity and BM25 scoring. A reranking model sorts the merged candidates by relevance, delivering results that understand user intent rather than just matching keywords.

Most search systems are broken in the same way. They use one technique -- usually keyword matching -- and expect it to handle every query a user can construct. The user searches for "how to fix authentication errors" and gets results that contain those exact words but answer the wrong question. Or they search for "login problems" and get nothing because the documentation says "authentication failure" instead. Single-layer search fails because language is ambiguous, intent is complex, and no one retrieval method handles both precision and recall well.

The solution is a three-layer architecture that combines the strengths of traditional search, semantic search, and AI re-ranking into a system where each layer compensates for the weaknesses of the others. Traditional search provides broad recall and exact matching. Semantic search provides meaning-based relevance. AI re-ranking provides precision by scoring results against the actual intent of the query. Together, they produce search results that feel like the system actually understood what the user wanted.

Traditional search is keyword-based retrieval using algorithms like BM25 and inverted indexes. It has been the backbone of search systems for decades, and for good reason: it is fast, predictable, and handles exact matches perfectly. When a user searches for an error code, a product name, or a specific phrase, traditional search returns the right documents immediately because it is matching literal strings.

BM25 is the standard algorithm. It scores documents based on term frequency (how often the query terms appear in a document), inverse document frequency (how rare those terms are across all documents), and document length normalization. Documents that contain rare query terms frequently and are not padded with irrelevant content score highest. The math is simple, the implementation is well-understood, and the performance is excellent.

traditional-search.ts
interface SearchResult {
  id: string
  score: number
  content: string
  metadata: Record<string, any>
}

async function traditionalSearch(
  query: string,
  index: SearchIndex,
  options: { limit: number; minScore: number }
): Promise<SearchResult[]> {
  // Tokenize and normalize the query
  const tokens = tokenize(query)
    .map((t) => t.toLowerCase())
    .filter((t) => !STOP_WORDS.has(t))

  // Score each document using BM25
  const scores: Map<string, number> = new Map()
  const k1 = 1.2 // term frequency saturation
  const b = 0.75 // length normalization
  const avgDocLength = index.averageDocumentLength

  for (const token of tokens) {
    const idf = Math.log(
      (index.totalDocuments - index.docFrequency(token) + 0.5) /
        (index.docFrequency(token) + 0.5) +
        1
    )

    for (const doc of index.documentsContaining(token)) {
      const tf = doc.termFrequency(token)
      const docLength = doc.length
      const score =
        idf *
        ((tf * (k1 + 1)) /
          (tf + k1 * (1 - b + b * (docLength / avgDocLength))))

      scores.set(
        doc.id,
        (scores.get(doc.id) ?? 0) + score
      )
    }
  }

  return Array.from(scores.entries())
    .filter(([, score]) => score >= options.minScore)
    .sort((a, b) => b[1] - a[1])
    .slice(0, options.limit)
    .map(([id, score]) => ({
      id,
      score,
      content: index.getContent(id),
      metadata: index.getMetadata(id),
    }))
}
<10ms
Typical latency
$0
Per-query cost
High
Recall for exact terms
Low
Semantic understanding

The role of traditional search in the three-layer stack is broad recall. It casts a wide net, retrieving every document that contains relevant terms. It will include false positives -- documents that match on keywords but not on intent -- and that is acceptable. The subsequent layers will filter them out. What matters at this stage is that no relevant documents are missed. A typical configuration retrieves the top 100 to 200 candidates.

Semantic search uses vector embeddings to find documents that are similar in meaning, regardless of the specific words used. The query and all documents are converted into high-dimensional vectors using an embedding model. Documents whose vectors are close to the query vector -- measured by cosine similarity or dot product -- are considered relevant, even if they share no keywords with the query.

This solves the vocabulary mismatch problem that cripples traditional search. A user searching for "login problems" will find documents about "authentication failures" because the embedding model understands that these phrases are semantically equivalent. A search for "how to deploy to production" will find documents about "release management" and "shipping to prod" because the model captures the underlying concept, not just the surface text.

semantic-search.ts
import { generateEmbedding } from './embedding-model'

async function semanticSearch(
  query: string,
  candidates: SearchResult[],
  options: { limit: number; minSimilarity: number }
): Promise<SearchResult[]> {
  // Generate embedding for the query
  const queryEmbedding = await generateEmbedding(query)

  // Score each candidate by cosine similarity
  const scored = await Promise.all(
    candidates.map(async (candidate) => {
      const docEmbedding = await getStoredEmbedding(
        candidate.id
      )
      const similarity = cosineSimilarity(
        queryEmbedding,
        docEmbedding
      )
      return {
        ...candidate,
        semanticScore: similarity,
        combinedScore:
          candidate.score * 0.3 + similarity * 0.7,
      }
    })
  )

  return scored
    .filter((r) => r.semanticScore >= options.minSimilarity)
    .sort((a, b) => b.combinedScore - a.combinedScore)
    .slice(0, options.limit)
}

function cosineSimilarity(a: number[], b: number[]): number {
  let dotProduct = 0
  let normA = 0
  let normB = 0
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i]
    normA += a[i] * a[i]
    normB += b[i] * b[i]
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB))
}

// Using pgvector for storage and retrieval
async function getStoredEmbedding(
  docId: string
): Promise<number[]> {
  const result = await db.query(
    'SELECT embedding FROM documents WHERE id = $1',
    [docId]
  )
  return result.rows[0].embedding
}

Traditional search asks: does this document contain these words? Semantic search asks: does this document mean the same thing?

For storage, pgvector is the pragmatic choice. It adds vector operations to PostgreSQL, which means you can store embeddings alongside your relational data without introducing a separate vector database. Create an embedding column, build an index using HNSW or IVFFlat, and query with the <=> operator for cosine distance. For most applications under ten million documents, pgvector performs well enough that a dedicated vector database is unnecessary complexity.

The role of semantic search in the stack is relevance filtering. It takes the broad set of candidates from Layer 1 and narrows it to the documents that are actually about what the user is asking. A typical configuration reduces 200 candidates to the top 20 to 30. These are the documents that pass both the keyword test and the meaning test.

Pre-compute and store embeddings at indexing time, not query time. Generating embeddings for thousands of candidate documents per query is too slow and too expensive. Only the query embedding should be generated at request time.

Layer 3: AI Re-ranking

The final layer is the most expensive and the most powerful. An LLM examines each candidate result alongside the original query and produces a relevance score that accounts for nuance, context, and intent that neither keyword matching nor vector similarity can capture. The model reads both the query and the candidate and answers the question: is this actually what the user is looking for?

This is qualitatively different from the first two layers. BM25 counts words. Embeddings measure geometric distance in vector space. An LLM reasons about whether the content answers the question. It can determine that a document is topically related but does not actually address the specific problem the user described. It can determine that a document from a different domain contains the exact answer because it understands analogy and transfer. It can determine that the third-ranked document is actually the best result because it is the most actionable, even if it scores lower on keyword and semantic metrics.

ai-reranker.ts
interface RerankResult extends SearchResult {
  relevanceScore: number
  reasoning: string
}

async function aiRerank(
  query: string,
  candidates: SearchResult[],
  options: { limit: number; model: string }
): Promise<RerankResult[]> {
  const reranked: RerankResult[] = []

  // Process in batches to manage token costs
  const batchSize = 5
  for (let i = 0; i < candidates.length; i += batchSize) {
    const batch = candidates.slice(i, i + batchSize)

    const response = await llm.complete({
      model: options.model,
      system: `You are a search relevance evaluator.
Given a user query and a list of candidate documents,
score each document from 0.0 to 1.0 based on how well
it answers the user's actual question.

Consider:
- Does the document directly answer the query?
- Is the information actionable and specific?
- Would a user be satisfied finding this result?

Return JSON: { results: [{ index, score, reasoning }] }`,
      messages: [
        {
          role: 'user',
          content: `Query: "${query}"

Candidates:
${batch.map((c, idx) => `[${idx}] ${c.content.slice(0, 500)}`).join('\n\n')}`,
        },
      ],
    })

    const scores = JSON.parse(response.content)
    for (const scored of scores.results) {
      reranked.push({
        ...batch[scored.index],
        relevanceScore: scored.score,
        reasoning: scored.reasoning,
      })
    }
  }

  return reranked
    .sort((a, b) => b.relevanceScore - a.relevanceScore)
    .slice(0, options.limit)
}
1-3s
Typical latency
$0.01-0.05
Per-query cost
Very high
Precision
20-30
Max candidates

The reason AI re-ranking is the final layer rather than the only layer is cost. Running an LLM evaluation on every document in your index for every query is economically impossible. A corpus of one hundred thousand documents, even at a modest per-evaluation cost, would make each search query cost hundreds of dollars. The three-layer architecture solves this by using cheap, fast methods to progressively narrow the candidate set until the expensive method only needs to evaluate twenty to thirty documents.

The Full Pipeline

With all three layers in place, the search pipeline works as a funnel. Each layer narrows the candidate set while increasing the quality of the remaining results.

search-pipeline.ts
async function search(
  query: string,
  config: SearchConfig
): Promise<RerankResult[]> {
  // Layer 1: Traditional search - broad recall
  // 100,000 documents -> top 200 candidates
  const keywords = await traditionalSearch(
    query,
    config.index,
    { limit: 200, minScore: 0.1 }
  )

  // Layer 2: Semantic search - relevance filter
  // 200 candidates -> top 25
  const semantic = await semanticSearch(
    query,
    keywords,
    { limit: 25, minSimilarity: 0.6 }
  )

  // Layer 3: AI re-ranking - precision
  // 25 candidates -> top 5
  const reranked = await aiRerank(
    query,
    semantic,
    { limit: 5, model: 'claude-sonnet-4-20250514' }
  )

  return reranked
}

The numbers at each stage are tunable and depend on your corpus size, query patterns, and quality requirements. The principle is constant: each layer should reduce the candidate set by roughly an order of magnitude while increasing relevance. Layer 1 goes from the full corpus to hundreds. Layer 2 goes from hundreds to tens. Layer 3 goes from tens to the final results.

Use cheap, fast methods to get close. Use expensive, slow methods to get precise. Never use the expensive method on the full corpus.

The Cost-Quality Tradeoff

Each layer increases result quality and increases cost. The question every team must answer is: which layers do you actually need?

For internal documentation search where queries are keyword-heavy and the corpus is small, Layer 1 alone might be sufficient. For customer-facing search where users phrase queries in natural language, Layers 1 and 2 are the minimum. For high-stakes applications where surfacing the wrong result has real consequences -- medical information, legal research, financial analysis -- all three layers are justified.

The cost of AI re-ranking is dominated by token usage. Each candidate document that the model evaluates consumes input tokens. You can control this cost by limiting the content sent to the re-ranker -- send the first 500 characters of each document rather than the full text. You can also batch candidates into a single prompt rather than evaluating each one individually, which reduces the overhead of system prompt tokens and per-request latency.

A practical middle ground is to use AI re-ranking selectively. If Layer 2 produces a top result with a semantic similarity score above a high threshold, skip Layer 3 and return it directly. Only invoke the re-ranker when the top candidates are close in score and the system cannot confidently determine which result is best. This reduces re-ranking costs by sixty to eighty percent in practice while preserving quality for ambiguous queries.

Measure your search quality with real user queries, not synthetic benchmarks. The gap between what test queries reveal and what production traffic exposes is where search systems fail.

Search is not a feature you build once. It is a system you tune continuously against real user behavior. The three-layer architecture gives you the knobs to turn: adjust the recall threshold in Layer 1, tune the similarity cutoff in Layer 2, refine the scoring prompt in Layer 3. Each adjustment has a measurable impact on the quality of results and the cost of serving them.

Start with Layer 1. Add Layer 2 when keyword matching is not enough. Add Layer 3 when precision matters more than speed. And measure everything -- because in search, the only opinion that matters is whether users are finding what they need.

Key Takeaways

1

Single-layer search fails because no one retrieval method handles both precision and recall. Combine traditional, semantic, and AI re-ranking into a progressive funnel.

2

Traditional search with BM25 provides fast, cheap, broad recall. Use it to retrieve the initial candidate set of one to two hundred documents.

3

Semantic search with vector embeddings handles vocabulary mismatch and intent matching. Use it to filter candidates down to the top twenty to thirty.

4

AI re-ranking uses an LLM to score results against actual query intent. It is expensive but dramatically improves precision on the final result set.

5

Control costs by narrowing the candidate set at each layer. Never run the expensive method on the full corpus. Use selective re-ranking to skip Layer 3 when confidence is already high.

Frequently Asked Questions

Common questions about this topic

Token-Based Theming: Why It Matters for AI-Generated UISecuring AI Tool Use: A Practical Guide

Related Articles

MCP & AI Infrastructure

Context Delivery Patterns: Feeding AI the Right Information

The information you feed an AI agent matters as much as the instructions you give it. Four patterns for getting context ...

MCP & AI Infrastructure

The MCP Pattern: Giving AI Tools It Can Actually Use

What problem MCP solves, how our server works, and the tool design patterns that make AI agents actually useful.

MCP & AI Infrastructure

AI Agent Workflow Patterns: Fan-Out, Pipeline, and Orchestration

Fan-out for parallel work, pipeline for sequential processing, orchestration for dynamic routing. Architecture diagrams ...

All Articles