How To Read This Page
The mistakes below are the failure modes that show up repeatedly when teams move from playground experiments to production prompts. None are exotic. All are easy to fix once you see them. Use this page as a checklist when a prompt is producing poor output: walk through it top to bottom and rule out each mistake before reaching for a more sophisticated technique.
Vague Instructions
The most common failure. The prompt asks for something general, the model produces something generic, the writer is disappointed.
Bad
Make this email better.
Better
Rewrite the email below to be 30 percent shorter, in second person, with the
core ask in the first sentence and a single clear next step at the end.
Email:
{email}
The fix is always the same: replace adjectives with attributes. “Better” becomes “shorter, second person, one clear ask, one clear next step”. The model can act on attributes; it cannot act on a vibe.
Negative Instructions Without Positive Replacements
Negative instructions tell the model what not to do. The model has to imagine the forbidden behavior to avoid it, which sometimes produces exactly that behavior. Whenever possible, replace negation with positive direction.
Weak
Do not be too formal. Do not use jargon. Do not be boring.
Strong
Use a conversational tone, plain English, and short sentences. Open with a
specific concrete benefit.
There are exceptions. Some constraints are genuinely negative, like “do not invent statistics” or “do not name specific people”. Keep those. The rule is: prefer positive when both forms are available, and when a constraint must be negative, follow it with a positive replacement where possible.
Stacking Too Many Rules
A prompt with thirty bullet points of rules is rarely better than a prompt with the five most important ones. Past a certain length, the model starts treating the rules as decorative rather than binding.
When a prompt has accumulated rules over time, audit it. Group related rules, drop rules that are redundant, and rank the rest by importance. Move the top three to the most prominent position and accept that the rest will hold most of the time but not always.
A useful diagnostic: if you cannot describe the prompt’s intent in one sentence, the prompt is overgrown. Refactor before adding more.
Conflicting Rules
Worse than too many rules: rules that contradict each other. The model picks one, sometimes inconsistently, and the writer cannot diagnose why.
Conflict
Be concise. Provide thorough explanations of every step. Use bullet points where
appropriate. Always write in flowing prose.
The model cannot satisfy all four. It will pick whichever one is most recent or most specific, which means the output drifts depending on the rest of the context. Fix by removing one of each pair: either concise or thorough, either bullets or prose.
Conflicts are easy to introduce when prompts are edited by multiple people over time. This is why every change to a PANTA OS assistant system prompt should be reviewed end to end before publishing, looking specifically for contradictions.
Over Engineering
The other extreme of stacking too many rules: writing a 2000 token system prompt for a task that needs 200. Modern models are highly capable; a focused short prompt often outperforms an exhaustive long one.
The signs of over engineering:
- The system prompt has more than ten distinct sections.
- The same instruction appears in three different forms.
- There are rules for cases that have never come up.
- The prompt has been edited many times and never simplified.
The fix: rewrite the prompt from scratch, keeping only the rules you can demonstrate are necessary. If you cannot point to a specific failure that a rule prevents, the rule is probably noise. Start with the smallest prompt that passes your evaluations, and add blocks only when they fix a measured failure mode.
Mixing Instructions and Data
When user input is concatenated into the prompt without clear delimiters, the model sometimes treats the input as additional instructions. The output suffers, and the prompt is also vulnerable to prompt injection, where a hostile user includes commands in their input.
Bad
Summarize this article: {article}
Good
Summarize the article inside the <article> tags below.
<article>
{article}
</article>
XML tags, triple backticks, or triple quotes all work. The wrong choice is no delimiter at all.
Few Shot Examples That Contradict The Instructions
When the prompt says one thing and the examples show another, the examples win. Models follow the patterns in examples more reliably than they follow the rules in prose, which is exactly why few shot is so powerful and exactly why bad examples are so destructive.
If the instructions say “respond in formal English” and one of the three examples uses contractions, half the outputs will use contractions. Audit examples whenever you change instructions.
A specific variant to watch for: examples that all happen to share an irrelevant property. If all three of your few shot examples use customer names starting with A, the model will sometimes pick up that pattern and apply it. Vary the examples on every dimension that should not matter.
Using Few Shot For Reasoning Tasks
Few shot is excellent for classification and formatting. It is often counterproductive for math, logic, and multi step reasoning. The examples bias the model toward following the surface pattern of the demonstrated reasoning rather than reasoning fresh from the new input.
For reasoning tasks, prefer zero shot with explicit chain of thought. “Let’s think step by step” or a structured “first explain your reasoning, then state the answer” instruction outperforms most few shot reasoning prompts on the latest models.
Asking For Unverifiable Confidence
A prompt that asks “how confident are you?” without anchoring the confidence in something concrete produces noise. The model will say “high” or “85 percent” with no real basis.
Weak
Rate your confidence in the answer above from 1 to 100.
Stronger
Rate your confidence on a 1 to 5 scale, where 5 means the answer is directly
quoted from the source, 4 means closely paraphrased from the source, 3 means
inferred from the source, 2 means partially supported by the source, and 1
means not supported.
The stronger version anchors each level in observable criteria. The model can apply the criteria; it cannot meaningfully introspect on a continuous probability scale.
Treating Temperature As A Quality Knob
Temperature controls variety, not quality. For factual tasks, higher temperature produces more hallucinations, not more creativity. For creative tasks, very low temperature produces predictable boring output. The trade off is task specific and not a general “good versus bad” axis.
Default to temperature 0 for anything where there is a single correct answer, and raise it only when the task benefits from variety: drafting alternatives, brainstorming, exploring tone.
Forgetting That Prompts Are Versioned Artifacts
A prompt that lives in a single file in production for six months is a prompt that no one tested against the current model. Models update, evals drift, and what worked in March may not work in September.
Treat prompts the way you treat code: keep them in version control, write tests for them, and re run the tests when the model version changes. Most production prompts go through three or four major revisions before they stabilize, and then need light maintenance every few months as the underlying model improves. Inside PANTA OS, every change to an assistant’s system prompt is versioned automatically, so you always have a rollback path.
What Comes Next
The next page, Patterns and Templates, provides reusable skeletons for the recurring tasks: classification, extraction, summarization, drafting, transformation. Use it as reference material when starting a new prompt.Last modified on June 1, 2026