Prompt Engineering CraftPattern Posts

Chain-of-Thought: When to Use It and When It Hurts

More reasoning isn't always better reasoning.

The Prompt Engineering Project March 7, 2025 6 min read

Quick Answer

Chain of thought prompting instructs language models to show their reasoning step by step before producing a final answer. This technique dramatically improves accuracy on math, logic, multi-step reasoning, and complex analysis tasks. By making the model externalize its thinking process, you can identify reasoning errors, improve transparency, and guide the model toward more reliable conclusions.

Chain-of-thought prompting has become the default recommendation for improving AI output quality. Ask the model to "think step by step," and it reasons better. The research supports this -- on complex tasks. What the research also shows, and what most practitioners ignore, is that chain-of-thought prompting on simple tasks adds latency, burns tokens, and produces no measurable improvement in accuracy. Sometimes it makes accuracy worse.

The difference between knowing that chain-of-thought exists and knowing when to deploy it is the difference between a technique and a tool. Techniques are applied universally. Tools are selected based on the job. This article draws the line between the two categories and gives you a practical framework for deciding which side of that line your task falls on.

When Chain-of-Thought Helps

Chain-of-thought prompting excels when the task requires the model to hold intermediate results in working memory, apply sequential logic, or navigate decision trees with multiple branches. The "thinking out loud" mechanism forces the model to commit to intermediate conclusions before reaching a final answer, which reduces the chance of logical shortcuts that skip critical steps.

Multi-step mathematical reasoning. A word problem that requires setting up an equation, solving it, and interpreting the result benefits enormously from CoT. Without it, the model attempts to jump directly from problem statement to answer, and the error rate on problems requiring more than two steps increases dramatically.

Complex code generation. When generating code that involves multiple functions, data transformations, or algorithmic logic, CoT prompting produces significantly better results. The model plans the approach, identifies edge cases, and structures the solution before writing code -- the same process a skilled developer follows.

Ethical and nuanced reasoning. Questions with trade-offs, competing values, or context-dependent answers benefit from explicit reasoning. CoT forces the model to articulate the factors it is weighing, which both improves the quality of the conclusion and makes the reasoning auditable.

Multi-hop question answering. When the answer requires combining information from multiple sources or reasoning through a chain of facts (A implies B, B implies C, therefore A implies C), CoT prevents the model from guessing at the final conclusion without verifying the intermediate links.

Chain-of-thought does not make the model smarter. It makes the model show its work -- and showing work catches errors before they reach the final answer.

When Chain-of-Thought Hurts

Here is the part that most articles omit. Chain-of-thought prompting is actively counterproductive on a significant category of common tasks. Not neutral -- counterproductive. It adds latency, increases cost, and in some cases degrades accuracy by overthinking problems that have obvious answers.

Simple classification. Sentiment analysis, topic categorization, spam detection, language identification. These tasks have clear, direct mappings from input to output. When you ask a model to "think step by step" about whether "I love this product" is positive or negative, the model generates 50-200 tokens of unnecessary reasoning to arrive at the same answer it would have produced in one token. Worse, the reasoning occasionally talks itself into an incorrect answer by over-analyzing hedging language that does not exist.

Data extraction. Pulling names, dates, email addresses, or phone numbers from text is a pattern-matching task. CoT adds no value. The model either recognizes the pattern or it does not -- reasoning about why something looks like an email address does not help it extract the email address.

Format conversion. Converting CSV to JSON, reformatting dates, translating between markup languages. These are deterministic transformations. Asking the model to reason about them is like asking a calculator to show its work on 2 + 2.

Instead of

You are a sentiment classifier. Think step by step about the sentiment of the following review, considering tone, word choice, and context, then provide your classification. Review: "The shipping was fast and the product works great." [Output: 147 tokens of reasoning + "positive"] Latency: 1.8s | Cost: ~$0.004

Try this

Classify the sentiment as positive, negative, or neutral. Reply with one word only. Review: "The shipping was fast and the product works great." [Output: "positive"] Latency: 0.3s | Cost: ~$0.0002

The Thinking Tax

Every chain-of-thought token has a cost. Not a metaphorical cost -- a literal one. Output tokens are typically 3-4x more expensive than input tokens across major API providers. When you ask a model to reason through 200 tokens before delivering a 10-token answer, you have increased the output cost by a factor of 20. At scale, this is not a rounding error.

3-4x
Output token cost vs input
200-500
Typical CoT reasoning tokens
2-5x
Latency increase with CoT
0%
Accuracy gain on simple tasks

Consider a production system processing 100,000 classification requests per day. Without CoT, each request generates roughly 5 output tokens. With CoT, each generates 150. That is 14.5 million additional output tokens per day -- tokens that provide no accuracy improvement on a classification task. At GPT-4 pricing, that is hundreds of dollars per day in wasted compute. At Claude pricing, the numbers are similar. The thinking tax is real, it is measurable, and it compounds.

Latency is the other tax. Every generated token takes time. A response that could arrive in 300 milliseconds instead takes 1.5-2 seconds because the model is generating reasoning tokens that nobody reads. In user-facing applications, that latency difference is the difference between an interface that feels responsive and one that feels sluggish.

The Three-Second Rule

Here is a practical heuristic that holds up remarkably well across task types: if a competent human could answer the question correctly in under three seconds, chain-of-thought prompting will not improve the model's accuracy. It will only increase cost and latency.

"Is this email spam?" -- under three seconds. No CoT needed. "What is the capital of France?" -- under three seconds. No CoT needed. "Extract the date from this invoice" -- under three seconds. No CoT needed.

"Given these three conflicting data sources, which revenue figure should we use in the quarterly report and why?" -- that takes a human minutes of reasoning. CoT helps. "Write a function that merges two sorted linked lists while handling edge cases" -- that takes a human focused effort. CoT helps. "Evaluate whether this contract clause creates liability under GDPR Article 28" -- that takes a human significant analysis. CoT helps.

The three-second rule is a heuristic, not a law. Test it against your specific tasks. But when you are unsure whether CoT is worth it, this rule gets you to the right answer faster than A/B testing every prompt variant.

The rule works because it tracks cognitive complexity. Tasks that humans answer instantly are tasks with direct pattern matches -- the kind of tasks where models already perform well without scaffolding. Tasks that require human deliberation are tasks with intermediate reasoning steps -- exactly what CoT is designed to support.


Key Takeaways

1

Chain-of-thought prompting improves accuracy on complex reasoning, multi-step math, code generation, and nuanced analysis. It is not a universal improvement.

2

On simple classification, data extraction, and format conversion, CoT adds cost and latency with zero accuracy gain -- and occasionally reduces accuracy by overthinking.

3

The thinking tax is real: CoT can increase output token costs by 20x and latency by 2-5x. At production scale, this costs hundreds of dollars per day on tasks that do not benefit.

4

Apply the three-second rule: if a human could answer correctly in under three seconds, skip CoT. If the task requires deliberation, use it.

Frequently Asked Questions

Common questions about this topic

Prompt Versioning: Treat Prompts Like CodeThe Prompt Engineering Stack: Layers of Abstraction

Related Articles

Prompt Engineering Craft

Context Window Economics: A Mental Model

Context window management is an economics problem. You have a fixed budget, every token costs something, and ROI varies ...

Prompt Engineering Craft

5 Prompt Anti-Patterns That Waste Tokens and Trust

Five specific anti-patterns with examples: vague instructions, over-constraining, context dumping, ignoring output forma...

Prompt Engineering Craft

Debugging Prompts: A Systematic Approach

Isolation testing, token analysis, output comparison, and systematic elimination. A debugging methodology for prompts th...

All Articles