Prompt Engineering CraftDeep Dives

Writing Prompts for Claude vs GPT vs Gemini: What Transfers

Model-specific nuances and universal principles.

The Prompt Engineering Project March 4, 2025 12 min read

Quick Answer

Claude and GPT respond differently to the same prompts because they have different training approaches, instruction-following patterns, and behavioral tendencies. Claude tends to be more literal with instructions and better at following complex system prompts. GPT models may require more explicit constraint repetition. Writing model-specific prompts that account for these differences significantly improves output quality and reliability.

There is a comfortable fiction in prompt engineering that a well-written prompt works everywhere. Write clear instructions, provide good examples, specify the output format, and any model will produce the right result. This is approximately true for simple tasks and dangerously false for anything production-grade. The differences between Claude, GPT, and Gemini are not minor implementation details -- they are architectural distinctions that change how prompts should be structured, what instructions the model prioritizes, and where the failure modes live.

This article maps the territory. It identifies the prompting principles that genuinely transfer across all major models, the techniques that are model-specific, and the practical strategies for teams that need to support multiple models without maintaining entirely separate prompt codebases. The goal is not model advocacy -- it is engineering clarity about what works where.

What Transfers Across Every Model

Some prompting principles are grounded in how language models work at a fundamental level, not in any specific model's training or architecture. These transfer reliably across Claude, GPT-4, Gemini, Llama, Mistral, and every other model that shares the transformer architecture.

Clear, specific instructions outperform vague ones. This is universal because it is a property of language itself, not any particular model. "Summarize this document in three bullet points, each under 20 words" produces better results than "summarize this" on every model, because specificity reduces the space of valid completions. The model has less room to interpret your intent creatively, which means less room to interpret it wrong.

Few-shot examples improve consistency. Providing two to five examples of the input-output pattern you expect anchors the model's behavior more reliably than instruction alone. This works because in-context learning -- the model's ability to generalize from examples in the prompt -- is a fundamental capability of transformer-based models, not a feature of any specific implementation.

Structured output instructions improve format compliance. Asking for JSON, specifying field names, and describing the expected structure works on every model. The compliance rate varies -- some models are more format-reliable than others -- but the technique itself transfers perfectly.

Constraints reduce errors. Telling the model what not to do is as important as telling it what to do. "Do not include information not present in the provided context." "Do not use technical jargon." "Do not exceed 200 words." Negative constraints narrow the output distribution on every model, reducing the probability of undesirable outputs.

The principles that transfer are the principles rooted in how language works, not how any particular model was trained.

Claude: XML Tags and Literal Compliance

Claude, built by Anthropic, has distinctive characteristics that reward specific prompting strategies. The most significant is its affinity for XML-structured prompts. Claude was trained to recognize XML tags as structural delimiters, and using them produces measurably better instruction following than equivalent prompts using markdown headers or plain text sections.

Claude-optimized prompt structure
<system>
You are a senior financial analyst. You provide data-driven
analysis based exclusively on the provided documents.

<constraints>
- Only reference data from the provided <documents> section
- If the data needed to answer is not in the documents, say so
- Use specific numbers and cite which document they come from
- Never speculate beyond what the data supports
</constraints>

<output_format>
Respond in JSON with the following structure:
{
  "analysis": "Your analysis text",
  "data_points": ["Specific numbers cited"],
  "confidence": "high | medium | low",
  "sources": ["Document names referenced"]
}
</output_format>

<tone>
Professional, precise, conservative. Qualify uncertain
conclusions explicitly.
</tone>
</system>

<documents>
{{retrieved_documents}}
</documents>

<query>
{{user_question}}
</query>

Claude also demonstrates notably literal instruction following. When you tell Claude "never use bullet points," it will not use bullet points -- even in situations where bullet points would clearly be the best format. This literalness is a feature when your constraints are well-designed and a liability when they are overly broad. Prompt engineers working with Claude need to be precise about scope: instead of "never use bullet points," write "in the main analysis section, use paragraph form rather than bullet points. Bullet points are acceptable in the data_points array."

Claude's extended thinking capability -- available in certain model tiers -- is another differentiator. When enabled, Claude performs chain-of-thought reasoning in a dedicated thinking block before producing the visible response. This is architecturally different from asking GPT or Gemini to "think step by step," because the thinking happens in a separate token space that the user can inspect or hide. Prompts designed for Claude's extended thinking should frame complex tasks as problems that benefit from deliberation, and should avoid requesting step-by-step reasoning in the output format (since the thinking block already provides it).

Claude handles long contexts exceptionally well. If you are working with documents over 50,000 tokens, Claude's ability to attend to information throughout the full context window -- not just the beginning and end -- is a meaningful advantage over models that exhibit stronger "lost in the middle" effects.

GPT: System Message Authority and Creative Interpretation

GPT-4 and its variants (GPT-4 Turbo, GPT-4o) have a distinctive relationship with the system message. OpenAI's models treat the system message as having higher authority than user messages -- instructions in the system message are more resistant to being overridden by user-level requests. This makes system message design particularly important for GPT-based applications.

In practice, this means that the most critical behavioral constraints should live in the system message, not in the user prompt. If you need the model to never reveal its system prompt, never produce harmful content, or always respond in a specific format, those instructions are most effective in the system message. User-level reminders can reinforce them but should not be the primary enforcement mechanism.

GPT-optimized prompt structure
{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "system",
      "content": "You are a senior financial analyst. You provide data-driven analysis based exclusively on provided documents.\n\nCRITICAL RULES (never override):\n1. Only reference data from documents provided in this conversation\n2. If data needed to answer is not in the documents, state this clearly\n3. Use specific numbers and cite source documents\n4. Never speculate beyond what data supports\n\nOUTPUT FORMAT:\nAlways respond with valid JSON:\n{\n  \"analysis\": \"string\",\n  \"data_points\": [\"string\"],\n  \"confidence\": \"high | medium | low\",\n  \"sources\": [\"string\"]\n}\n\nTONE: Professional, precise, conservative. Qualify uncertain conclusions."
    },
    {
      "role": "user",
      "content": "Based on the following documents, answer my question.\n\nDocuments:\n{{retrieved_documents}}\n\nQuestion: {{user_question}}"
    }
  ]
}

GPT models also exhibit what might be called creative interpretation of instructions. Where Claude tends toward literal compliance, GPT tends to infer intent behind instructions and act on the inferred intent. If you ask GPT to "keep responses brief," it may decide that brevity means omitting caveats and qualifications -- not because you asked it to, but because it interpreted "brief" as a higher priority than "thorough." This behavior makes GPT prompts require more explicit priority ordering: "Keep responses under 200 words, but always include caveats for uncertain conclusions even if this increases length."

OpenAI's function calling implementation is the most mature in the industry. GPT models make highly reliable tool-use decisions when given well-described function schemas. If your application relies heavily on function calling, GPT's tool-use reliability is a genuine competitive advantage that may outweigh other model differences.

Gemini: Multimodal Native and System Instructions

Google's Gemini models are architecturally multimodal -- they process text, images, audio, and video through the same model rather than through bolted-on adapters. This has practical implications for prompting. When you send an image to Gemini, the model has genuine visual understanding, not just OCR or image captioning layered on top of a text model. Prompts that reference specific visual elements, spatial relationships, or image-text interactions perform materially better on Gemini than on models where vision is an add-on capability.

Gemini's system instruction mechanism is simpler than Claude's or GPT's. System instructions are set once at the beginning of a conversation and cannot be modified mid-conversation. This is actually a security advantage -- it prevents prompt injection attacks that attempt to override system instructions through user messages. But it means your system instructions need to be comprehensive enough to handle the full range of expected interactions without runtime modification.

Gemini-optimized prompt structure
import google.generativeai as genai

model = genai.GenerativeModel(
    model_name="gemini-2.0-flash",
    system_instruction="""You are a senior financial analyst.
You provide data-driven analysis based exclusively on provided
documents.

Rules:
- Only reference data from provided documents
- If needed data is absent, state this clearly
- Use specific numbers with source citations
- Never speculate beyond data support
- Respond in JSON format:
  {"analysis": "", "data_points": [], "confidence": "", "sources": []}
- Tone: professional, precise, conservative""",
)

# Gemini handles multimodal inputs natively
response = model.generate_content([
    "Analyze the financial data in this document and chart.",
    document_text,       # Text content
    chart_image,         # PIL Image or file
    f"Question: {user_question}",
])

Gemini also has distinctive behavior around grounding. Google's grounding features allow Gemini to verify its responses against Google Search results, which can reduce hallucination for factual queries. When prompting Gemini for factual tasks, leveraging grounding produces more reliable outputs than relying solely on the model's parametric knowledge -- a capability that Claude and GPT do not offer natively.

One notable difference: Gemini's context window varies significantly across model tiers. Gemini 1.5 Pro supports up to 2 million tokens -- an order of magnitude larger than GPT-4's maximum. If your use case involves processing very long documents, entire codebases, or hours of video, Gemini's context capacity may be the deciding factor regardless of other model differences.

Model Capability Comparison

The following comparison captures the practical differences that matter most for prompt engineering decisions. These are not theoretical capabilities -- they are the observable behaviors that change how you should structure prompts for each model.

text
Capability           | Claude           | GPT-4            | Gemini
─────────────────────┼──────────────────┼──────────────────┼──────────────────
Instruction style    | XML tags         | System msg focus | Flat text / rules
Instruction follow.  | Literal          | Interpretive     | Moderate
Context window       | 200K tokens      | 128K tokens      | Up to 2M tokens
Long-context quality | Excellent        | Good (recency    | Good (varies by
                     |                  | bias in middle)  | model tier)
Structured output    | Good (XML/JSON)  | Excellent (JSON) | Good (JSON)
Function calling     | Good             | Excellent        | Good
Multimodal           | Images + docs    | Images + audio   | Native multi-modal
Extended thinking    | Built-in (paid)  | Via prompting    | Via prompting
Grounding            | Not native       | Not native       | Google Search
Tone consistency     | Very consistent  | Somewhat drifts  | Consistent
Safety behavior      | Conservative     | Moderate         | Moderate
Best for             | Long docs,       | Tool use,        | Multimodal,
                     | precise tasks,   | creative tasks,  | very long context,
                     | literal follow.  | system msg apps  | grounded answers
These comparisons reflect the state of each model family as of early 2025. Model capabilities change rapidly -- verify against current documentation before making architectural decisions.

The Same Task, Three Prompt Styles

To make the differences concrete, here is the same task -- extracting structured data from a product review -- optimized for each model. The task is identical. The prompt structure is adapted to each model's strengths.

Claude Version

xml
<task>
Extract structured data from the following product review.
</task>

<review>
{{review_text}}
</review>

<output_requirements>
Return valid JSON with these fields:
- sentiment: "positive" | "negative" | "mixed"
- rating_implied: number 1-5 (infer from tone if not stated)
- key_pros: string[] (max 3)
- key_cons: string[] (max 3)
- purchase_intent: boolean (would the reviewer buy again?)
- summary: string (one sentence, under 25 words)
</output_requirements>

<constraints>
- Only extract information explicitly stated or clearly implied
- If a field cannot be determined, use null
- Do not infer pros/cons that the reviewer did not mention
</constraints>

GPT Version

text
[System message]
You extract structured data from product reviews. Always respond
with valid JSON matching the specified schema. Never include
explanatory text outside the JSON object.

SCHEMA:
{
  "sentiment": "positive | negative | mixed",
  "rating_implied": "number 1-5, infer from tone if not stated",
  "key_pros": ["max 3 items"],
  "key_cons": ["max 3 items"],
  "purchase_intent": "boolean, would they buy again?",
  "summary": "one sentence, under 25 words"
}

RULES:
- Only extract explicitly stated or clearly implied information
- Use null for undeterminable fields
- Do not infer pros/cons not mentioned by the reviewer

[User message]
Extract data from this review:

{{review_text}}

Gemini Version

text
[System instruction]
You extract structured data from product reviews. Respond only
with valid JSON. Use null for fields that cannot be determined
from the review text.

[User message]
Extract structured data from this product review. Return JSON
with these exact fields:

- sentiment: "positive", "negative", or "mixed"
- rating_implied: number 1-5 (infer from tone if not explicit)
- key_pros: array of up to 3 strings
- key_cons: array of up to 3 strings
- purchase_intent: true or false (would they buy again?)
- summary: one sentence, under 25 words

Rules: only extract explicitly stated or clearly implied info.
Do not infer unmentioned pros or cons.

Review:
{{review_text}}

The same information appears in all three prompts. The structural differences are deliberate: XML tags for Claude, heavy system message for GPT, and flatter instruction structure for Gemini. These are not cosmetic choices -- they align with how each model processes and prioritizes instructions.

Write Once, Test Everywhere

For teams that need to support multiple models -- whether for redundancy, cost optimization, or feature-specific routing -- the practical approach is to write a canonical prompt in a model-neutral format and maintain per-model adaptations as thin wrappers.

The canonical prompt captures the intent: the task description, the constraints, the output schema, and the examples. The per-model wrappers handle structural formatting: XML tags for Claude, system message emphasis for GPT, flat instructions for Gemini. When the intent changes, you update the canonical prompt. The wrappers propagate the change with minimal per-model adjustments.

Testing must run against every supported model. A test suite that only validates against Claude tells you nothing about GPT compatibility. The minimum viable test matrix includes: format compliance (does the output match the schema?), accuracy on held-out examples (does the model get the right answers?), edge case behavior (does the model handle ambiguous inputs correctly?), and constraint compliance (does the model respect the specified boundaries?).

Prompt portability is not about identical prompts. It is about identical outcomes from model-appropriate prompts.

The decision of when to specialize versus when to generalize is ultimately economic. If a task runs exclusively on one model and there is no plan to switch, invest in model-specific optimization. If a task may need to run on multiple models -- because of cost changes, rate limits, or regional availability -- invest in the canonical-plus-wrapper approach. If you are uncertain, default to the portable approach. The cost of maintaining thin wrappers is far lower than the cost of rewriting a deeply model-specific prompt when you need to migrate.

The differences between models are real, but they are not random. They follow from architectural decisions, training methodologies, and design philosophies that each provider has documented to varying degrees. Understanding these differences does not require memorizing a compatibility matrix. It requires developing an intuition for how each model processes instructions, where it excels, and where it needs guardrails.

That intuition comes from testing. Not from reading about model differences -- from observing them firsthand on your specific tasks with your specific data. The comparison table in this article is a starting point, not a conclusion. Run the same task on all three models. Compare the outputs. Note where they diverge. Build your own understanding of which model fits which problem. That understanding is the most valuable model-specific knowledge you can develop.


Key Takeaways

1

Universal principles transfer across all models: clear instructions, few-shot examples, structured output specs, and explicit constraints. These work because they are properties of language, not any specific model.

2

Claude excels with XML-structured prompts and literal instruction following. GPT excels with system message authority and function calling. Gemini excels with native multimodal processing and massive context windows.

3

The same task should be prompted differently for each model. Structural formatting -- not content -- is what changes between model-specific prompt versions.

4

For multi-model support, write a canonical prompt capturing intent and maintain thin per-model wrappers for structural adaptation. Test against every supported model.

5

When to specialize vs generalize is an economic decision. Default to portable prompts unless single-model optimization delivers measurable gains on your specific task.

Frequently Asked Questions

Common questions about this topic

Structured Output Design: Making LLMs Return What You NeedThe Prompt Library Pattern: Structured Databases for AI Instructions

Related Articles

Prompt Engineering Craft

5 Prompt Anti-Patterns That Waste Tokens and Trust

Five specific anti-patterns with examples: vague instructions, over-constraining, context dumping, ignoring output forma...

Prompt Engineering Craft

Chain-of-Thought: When to Use It and When It Hurts

Chain-of-thought prompting improves complex reasoning but wastes tokens and adds latency on simple tasks. Here's how to ...

Prompt Engineering Craft

Structured Output Design: Making LLMs Return What You Need

The difference between useful AI output and noise is structure. Here are the patterns that make LLMs return exactly what...

All Articles