What is a system prompt in AI?

A system prompt is the hidden instruction set provided to a language model before any user message. It defines the model's role, tone, constraints, and output format. System prompt design determines how the model behaves for every subsequent interaction in a session.

What sections should a system prompt include?

A well-designed system prompt includes an identity block, behavioral rules, output format specifications, context handling instructions, and edge-case guardrails. The ordering matters because models weight earlier instructions more heavily. Each section should be clearly delineated.

How long should a system prompt be?

System prompts typically range from 200 to 2,000 tokens depending on complexity. Shorter prompts work for simple tasks, but production systems often need detailed instructions. The key is density: every token should earn its place. Remove redundancy and test aggressively.

Does the order of instructions in a system prompt matter?

Yes, instruction order matters significantly in system prompt design. Models tend to follow instructions at the beginning and end of prompts more reliably. Place your most critical constraints and identity definitions first. Put output format specifications near the end where they are freshest in context.

How do you test a system prompt?

Test system prompts with diverse inputs including adversarial cases, edge cases, and typical use cases. Create an evaluation suite with expected outputs and grade responses against rubrics. Track metrics like instruction adherence, format compliance, and hallucination rate across at least 50-100 test cases.

Prompt Engineering CraftDeep Dives

Anatomy of a System Prompt

Dissecting a production system prompt, section by section.

The Prompt Engineering Project March 12, 2025 14 min read

Quick Answer

Effective system prompt design follows a layered architecture: identity and role definition first, then behavioral constraints, output format specifications, context handling rules, and finally edge-case instructions. Each section serves a distinct function. Well-structured system prompts reduce hallucination, improve consistency, and make outputs predictable across thousands of interactions.

A system prompt is the most important piece of code in any LLM-powered application. It is the interface contract between your intent and the model's behavior. It defines personality, sets boundaries, specifies output formats, and handles the edge cases that separate a demo from a production system. Yet most system prompts are written as afterthoughts -- a paragraph of instructions tacked onto the beginning of a conversation and never revisited.

This article dissects a production-grade system prompt section by section. We will examine eight distinct components, explain why each exists, show real code examples, and identify the failure modes that emerge when any section is missing or poorly written. By the end, you will have a mental model for structuring system prompts that holds up under the pressure of real-world inputs.

The prompt we are dissecting is for a hypothetical code review assistant -- a tool that analyzes pull requests and provides structured feedback. This is a realistic production use case with enough complexity to exercise every section of a well-designed system prompt.

The Eight Sections

Before we dive into each section, here is the architectural overview. A production system prompt is not a monolithic block of text. It is a structured document with distinct sections, each serving a specific purpose. The ordering matters -- models attend to information differently based on position, and earlier sections establish context that later sections depend on.

system-prompt-structure.txt

SYSTEM PROMPT ARCHITECTURE
==========================

[1] ROLE DEFINITION
    Who the AI is and is not

[2] CONTEXT BOUNDARIES
    What the AI knows and does not know

[3] BEHAVIORAL CONSTRAINTS
    Rules, guardrails, things to never do

[4] OUTPUT FORMAT SPECIFICATION
    JSON, markdown, structured output requirements

[5] EXAMPLE INJECTION (Few-Shot)
    Concrete examples of desired behavior

[6] EDGE CASE HANDLING
    What to do when uncertain or outside scope

[7] TONE AND STYLE
    Voice, formality, personality parameters

[8] TOOL USE INSTRUCTIONS
    When and how to use available tools

Each section builds on the previous ones. Role definition establishes identity. Context boundaries establish knowledge. Constraints establish limits. Formats establish structure. Examples establish patterns. Edge cases establish fallbacks. Tone establishes voice. Tool instructions establish capabilities. Remove any one section and the prompt degrades predictably.

1. Role Definition

The role definition is the identity layer of your system prompt. It answers a deceptively simple question: who is this AI? The answer should be specific enough to constrain behavior but general enough to handle the breadth of expected inputs. A vague role produces vague outputs. An overly narrow role produces brittle refusals.

Critically, a good role definition includes both what the AI is and what it is not. The negative definition prevents the model from drifting into adjacent behaviors that feel helpful but violate the system's purpose. A code review assistant should not start writing code. A medical information system should not start diagnosing patients. Defining the boundary explicitly is more reliable than hoping the model infers it.

role-definition.txt

You are CodeReview, a senior code review assistant specialized in
TypeScript and React codebases. You analyze pull requests and provide
structured, actionable feedback focused on correctness, performance,
security, and maintainability.

You ARE:
- A code reviewer who identifies issues and suggests improvements
- An educator who explains WHY something is a problem, not just WHAT
- A prioritizer who distinguishes critical issues from style preferences

You ARE NOT:
- A code generator. Do not write implementation code unless explicitly
  asked for a specific fix suggestion.
- A project manager. Do not comment on timelines, scope, or priorities.
- A style enforcer for subjective preferences. Flag only patterns that
  impact correctness, performance, or maintainability.

The "You ARE / You ARE NOT" pattern is one of the most effective structures for role definition. It gives the model both a positive identity to lean into and negative boundaries to respect. Models follow explicit negations more reliably than implied ones.

2. Context Boundaries

Context boundaries define the model's epistemic state -- what it knows, what it does not know, and how it should handle the gap. This section prevents two common failure modes: hallucinating knowledge the model does not have, and refusing to help with things it actually can help with.

In a production system, context boundaries interact with your retrieval pipeline. If you inject documentation, codebase context, or user data into the prompt, the context boundaries section tells the model what that injected context represents and how to treat it. Without this, models make unpredictable assumptions about the provenance and reliability of the information they receive.

context-boundaries.txt

CONTEXT YOU HAVE ACCESS TO:
- The full diff of the current pull request (provided below)
- The repository's TypeScript configuration and ESLint rules
- File-level context for each changed file (surrounding 50 lines)
- The PR description and any linked issue descriptions

CONTEXT YOU DO NOT HAVE:
- The full repository codebase (only changed files and their context)
- Runtime behavior or test results
- Previous review history or team conventions not in the config files
- The author's intent beyond what is stated in the PR description

IMPORTANT: If your review depends on information you do not have (e.g.,
"this might conflict with another module"), explicitly state the
assumption and mark the feedback as conditional.

The last paragraph is the most important. It establishes a protocol for how the model should behave at the boundary of its knowledge. Rather than silently guessing or silently omitting, it instructs the model to be transparent about uncertainty. This pattern -- explicitly scripting boundary behavior -- is what separates production prompts from prototypes.

3. Behavioral Constraints

Behavioral constraints are the guardrails of your system. They define what the AI must always do, what it must never do, and the operational limits within which it functions. If the role definition is the identity and the context boundaries are the knowledge, constraints are the rules of engagement.

Effective constraints are specific, testable, and ordered by severity. A constraint that says "be helpful" is useless -- it is too vague to influence behavior. A constraint that says "never suggest removing error handling code, even if it appears redundant" is specific enough to be actionable and testable enough to be evaluated.

behavioral-constraints.txt

RULES (ordered by priority):

CRITICAL (never violate):
- Never approve code that contains known security vulnerabilities
  (SQL injection, XSS, path traversal, etc.)
- Never suggest removing error handling, input validation, or
  authentication checks
- Never include secrets, API keys, or credentials in any output
- Always flag breaking changes to public APIs

IMPORTANT (violate only with explicit justification):
- Limit each review to a maximum of 10 issues to avoid overwhelming
  the author
- Prioritize correctness bugs over style issues
- When multiple issues exist in the same block, group them into a
  single comment

GUIDELINES (follow unless context suggests otherwise):
- Prefer suggesting specific fixes over describing the problem abstractly
- Reference documentation or established patterns when available
- Acknowledge what the PR does well before listing issues

The three-tier constraint model -- critical, important, guidelines -- gives the model a decision framework for resolving conflicts between constraints. Without explicit priority ordering, models resolve constraint conflicts unpredictably, sometimes following the most recently stated rule, sometimes the most specific one, and sometimes neither.

4. Output Format Specification

Output format specification is where prompt engineering intersects with systems engineering. Your downstream code needs to parse the model's output. If that output is free-form text, parsing is fragile. If it is structured data with a defined schema, parsing is reliable. This section defines the schema.

The tension in output formatting is between strictness and quality. Overly rigid schemas can cause models to prioritize format compliance over reasoning quality -- producing perfectly structured outputs that contain shallow analysis. The solution is to structure the output at the right level of granularity: constrain the shape of the data, but leave room for the model to express nuanced reasoning within each field.

output-schema.json

{
  "summary": "One-paragraph overall assessment of the PR",
  "verdict": "approve | request_changes | comment",
  "issues": [
    {
      "severity": "critical | warning | suggestion",
      "file": "path/to/file.ts",
      "line": 42,
      "title": "Short issue title (under 10 words)",
      "description": "Detailed explanation of the issue and WHY it matters",
      "suggestion": "Specific code or approach to fix the issue (optional)"
    }
  ],
  "positives": [
    "Things the PR does well (include at least one)"
  ]
}

Notice the schema includes both constrained fields (severity as an enum, line as a number) and open fields (description as free text). This hybrid approach gets the benefits of structure where parsing needs it and the benefits of natural language where reasoning needs it. The comments within the schema itself act as inline formatting instructions -- a technique that is surprisingly effective because models process the schema as part of their instructions.

5. Example Injection (Few-Shot)

Few-shot examples are the most powerful calibration tool in prompt engineering. Instructions tell the model what to do. Examples show the model what "doing it well" looks like. When instructions and examples conflict, models tend to follow the examples. This makes examples both powerful and dangerous -- they calibrate behavior precisely, but a poorly chosen example can override explicit instructions.

The key principles of effective few-shot injection: examples should be representative of the real input distribution, not cherry-picked easy cases. They should demonstrate edge case handling, not just the happy path. They should be minimal -- showing exactly the behavior you want without unnecessary complexity. And they should match the output format specification exactly, reinforcing the schema through demonstration.

few-shot-example.txt

EXAMPLE INPUT:
---
// PR diff: auth/middleware.ts
- if (token) {
+ if (token && token !== "undefined") {
    const user = await verifyToken(token)
    req.user = user
  }
---

EXAMPLE OUTPUT:
{
  "summary": "Fixes a token validation edge case where the string
    'undefined' was being treated as a valid token. The fix is
    directionally correct but could be more robust.",
  "verdict": "request_changes",
  "issues": [
    {
      "severity": "warning",
      "file": "auth/middleware.ts",
      "line": 2,
      "title": "String comparison is fragile for token validation",
      "description": "Checking against the literal string 'undefined'
        fixes the immediate bug but does not protect against other
        falsy-but-truthy string values like 'null' or empty strings.
        A more robust check would validate the token format.",
      "suggestion": "Consider: if (token && typeof token === 'string'
        && token.length > 0 && token !== 'undefined' && token !== 'null')
        or, better, validate against an expected token format (e.g.,
        JWT regex pattern)."
    }
  ],
  "positives": [
    "Identified a real authentication edge case that could cause
    unauthorized access with a malformed token."
  ]
}

A single well-chosen example is worth more than a hundred words of instruction. Examples anchor model behavior in concrete patterns rather than abstract rules.

The number of examples matters. One example establishes the pattern. Two examples confirm it. Three or more examples begin to consume context budget that may be better spent elsewhere. For most production use cases, one to two carefully crafted examples provide the best balance of calibration quality and token efficiency.

6. Edge Case Handling

Edge case handling is the section most prompts are missing and most prompts need. It defines behavior for situations that fall outside the normal operating parameters: ambiguous inputs, conflicting instructions, missing context, requests that are adjacent to but outside the defined scope. Without explicit edge case handling, models improvise -- and improvisation in production is another word for unpredictable behavior.

The structure of this section follows a conditional pattern: "When X happens, do Y." Each condition should be specific enough to match real situations and each response should be concrete enough to produce consistent behavior across invocations.

edge-case-handling.txt

EDGE CASES:

When the diff is too large to analyze completely:
- Focus on the files with the highest risk (auth, database, API routes)
- State explicitly: "This review covers [N] of [M] changed files,
  prioritized by risk. The remaining files were not reviewed."

When you are unsure whether something is a bug or intentional:
- Flag it as a "question" rather than an "issue"
- Frame it as: "Is this intentional? If so, a comment explaining
  the reasoning would help future maintainers."

When the PR contains changes in a language or framework you lack
deep expertise in:
- Limit feedback to general principles (error handling, naming, structure)
- State the limitation: "I have limited context for [framework]-specific
  patterns. Consider requesting a specialist review for these files."

When the PR description is empty or unhelpful:
- Note this as an issue: "The PR description does not explain the
  motivation or scope of these changes."
- Proceed with review based on the code diff alone, but flag
  any assumptions you are making.

Each of these edge cases was identified from real production failures. The "too large to analyze" case prevents the model from silently truncating its review. The "unsure" case prevents false positives. The "unknown framework" case prevents hallucinated expertise. The "empty description" case handles sloppy but common input. Every edge case you do not handle explicitly becomes a case the model handles implicitly -- and implicit handling is unreliable handling.

7. Tone and Style

Tone and style control how the model communicates, not what it communicates. In code review, tone is particularly important -- reviews that are technically correct but condescending or aggressive damage team dynamics. Reviews that are friendly but vague fail to communicate the severity of real issues. The tone section calibrates the balance.

Effective tone instructions work on two levels: they describe the desired voice in abstract terms ("direct but respectful") and they provide concrete linguistic patterns that embody that voice. Abstract descriptions set the direction. Concrete patterns anchor the execution.

tone-and-style.txt

VOICE AND TONE:

Style: Direct, technical, respectful. Write like a senior engineer
talking to a peer, not a teacher grading a student.

DO:
- Use "Consider..." or "This could be improved by..." rather than
  "You should..." or "You need to..."
- Explain the reasoning: "This allocates on every render because..."
  not just "This is inefficient"
- Acknowledge tradeoffs: "This is simpler but trades off [X]"

DO NOT:
- Use exclamation marks
- Use hedging language ("maybe", "perhaps", "I think")
  -- be direct about your assessment
- Use emoji or decorative formatting
- Apologize for providing feedback ("Sorry, but..." / "I hate to
  say this, but...")

SEVERITY LANGUAGE:
- Critical: "This will cause [specific failure mode]."
- Warning: "This could lead to [specific risk] under [conditions]."
- Suggestion: "Consider [alternative] for [specific benefit]."

The "severity language" sub-section is a particularly useful pattern. By providing sentence templates for each severity level, you ensure that the model's language accurately signals the importance of each issue. Without this, models tend to use uniformly alarming language for all issues or uniformly tentative language -- both of which erode the reader's ability to triage.

8. Tool Use Instructions

When your system provides tools -- functions the model can call to retrieve data, execute actions, or interact with external systems -- the system prompt must include explicit instructions for when and how to use them. Without tool use instructions, models make their own decisions about tool invocation, which leads to over-calling (wasting latency and cost), under-calling (missing information they need), or calling tools with incorrect parameters.

Tool use instructions should cover three dimensions: when to use each tool (the trigger), how to use it (the parameters), and what to do with the result (the integration).

tool-use-instructions.txt

AVAILABLE TOOLS:

1. fetch_file_context(filepath: string, lines: [start, end])
   WHEN: You need to see code surrounding the diff to understand
         the change in context.
   HOW:  Request the specific file and line range. Keep ranges
         under 100 lines. Prefer targeted requests over broad ones.
   THEN: Incorporate the context into your analysis. Cite specific
         line numbers when referencing the surrounding code.

2. check_type_definitions(type_name: string)
   WHEN: A type is referenced in the diff but its definition is not
         visible, and understanding the type is necessary for your review.
   HOW:  Pass the exact type name as it appears in the code.
   THEN: Use the type definition to validate that the code handles
         all fields and variants correctly.

3. search_similar_patterns(code_pattern: string)
   WHEN: You want to check if a pattern used in the PR is consistent
         with the rest of the codebase.
   HOW:  Pass a normalized version of the pattern (remove variable names,
         keep structure).
   THEN: Note whether the pattern is consistent or divergent. If
         divergent, mention it as a style issue, not a bug.

TOOL USAGE LIMITS:
- Maximum 5 tool calls per review
- Always explain why you are calling a tool before calling it
- If a tool call fails, proceed with your best analysis and note
  the limitation

Tool use instructions are where prompt engineering and system architecture intersect most directly. The quality of your tool definitions, parameter schemas, and usage instructions determines whether tool use helps or hurts the model's performance. Poorly documented tools are worse than no tools at all -- they invite errors that contaminate the entire output.

Putting It All Together

A complete system prompt is not a collection of independent sections. It is an integrated document where each section reinforces and depends on the others. The role definition establishes what the model is. The context boundaries tell it what it knows. The constraints tell it what it must and must not do. The format tells it how to structure its output. The examples show it what good looks like. The edge cases prepare it for the unexpected. The tone tells it how to communicate. And the tool instructions tell it what capabilities are available.

When assembling a production system prompt, follow this process: write each section independently, then read the complete prompt as a single document and check for contradictions, redundancies, and gaps. Test it against a diverse set of inputs, including adversarial ones. Measure the outputs against a rubric. Iterate. Version it. Treat it as code, because that is what it is.

The sections presented here total approximately 800-1000 tokens. That is a significant portion of your context budget. But consider the alternative: a 200-token prompt that handles the happy path correctly and fails unpredictably on everything else. The investment in a thorough system prompt pays dividends in every interaction it governs.

A system prompt is not prose you write once and forget. It is an interface contract you design, test, version, and maintain -- just like any other critical piece of your system.

Key Takeaways

Structure system prompts into eight distinct sections: role definition, context boundaries, behavioral constraints, output format, few-shot examples, edge case handling, tone/style, and tool use instructions.

Use the "You ARE / You ARE NOT" pattern for role definitions and the three-tier priority model (critical, important, guidelines) for behavioral constraints.

Few-shot examples are the most powerful calibration tool available -- one well-chosen example anchors behavior more effectively than paragraphs of instructions.

Edge case handling is the most commonly missing section and the most important for production reliability. Every unhandled edge case is an implicit decision you are delegating to the model.

Treat the system prompt as code: version it, test it against diverse inputs, measure outputs against rubrics, and iterate based on failure analysis.

Frequently Asked Questions

Common questions about this topic

What Prompt Engineering Actually Is (And Isn't)Why We Built a Design System for an AI Project

Prompt Engineering Craft