How to Write System Prompts That Actually Work

If you've spent any time working with large language models, you've probably encountered this scenario: you write a system prompt that looks perfect in your head, paste it into the model, and get output that's nowhere near what you expected. You tweak it, add more instructions, try again — and maybe it improves, maybe it gets worse.

This is one of the most frustrating aspects of working with LLMs. The problem isn't usually the model. The problem is almost always the prompt structure.

Over the past year, I've experimented with dozens of prompt formats, studied how different LLM providers structure their system messages, and iterated on hundreds of system prompts for our open source Prompt Library. Here's what I've learned about writing prompts that actually produce consistent, reliable results.

What Is a System Prompt?

A system prompt (sometimes called a "system message" or "developer instruction") is the initial instruction set you provide to a language model before any user interaction begins. It tells the model who it is, what it should do, how it should behave, and what constraints it should follow.

In the API context, this is typically the message with the role set to system. In chat interfaces, it's often hidden behind the scenes but serves the same purpose. The system prompt sets the foundation for everything that follows.

A well-crafted system prompt can mean the difference between a model that produces publication-ready responses and one that wanders off-topic, ignores constraints, or hallucinates information. A poorly structured one wastes tokens, confuses the model, and produces inconsistent results.

Why XML Tags Work

The approach I use — and recommend — is to structure your system prompts using XML-style tags. Here's what a basic structure looks like:

<purpose>
What the agent is designed to do.
</purpose>

<identity>
Who the agent is. Professional background, expertise areas.
</identity>

<personality>
Tone, communication style, attitude toward users.
</personality>

<output>
Expected response structure and formatting rules.
</output>

<rules>
Behavioral constraints and best practices.
</rules>

<constraints>
What the agent should NOT do.
</constraints>

But why does this work? There are several reasons rooted in how language models process text.

1. XML Tags Create Explicit Boundaries

Language models are trained on vast amounts of code and structured text. They have learned that XML tags serve as delimiters — they mark where one section of content begins and ends. When you use tags like <purpose> and <rules>, you're giving the model explicit structural signals about how to parse and prioritize different parts of the prompt.

Without these boundaries, instructions blend together. A model reading a wall of text may struggle to distinguish between high-level identity statements and specific behavioral constraints. Tags make the hierarchy obvious.

2. Named Tags Act as Semantic Anchors

When you use descriptive tag names like <personality> or <constraints>, you're providing semantic context. The model has seen these terms in training data alongside similar structured content. The tag name itself reinforces what category of instruction follows.

This is different from using headings like "Section 1" or "Part A." A named tag carries meaning. <rules> tells the model "these are behavioral guidelines." <constraints> tells it "these are things to avoid." The model can weight these differently because they mean different things.

3. Tags Improve Instruction Adherence

Research and community testing have consistently shown that LLMs follow instructions more reliably when those instructions are clearly separated and labeled. A study by Anthropic (the creators of Claude) on prompt injection defenses found that clearly delineated instruction sections are harder for adversarial inputs to overwrite. While this research focused on security, the same structural clarity benefits general-purpose prompt engineering.

When your system prompt is organized into tagged sections, the model is better able to maintain those instructions across long conversations. Without clear boundaries, the model may "forget" or deprioritize earlier instructions as the conversation context grows.

4. XML Structure Matches How Models Are Trained

Modern LLMs are trained on enormous corpora that include source code, documentation, configuration files, and structured markup. XML-style tags appear frequently in these training datasets. This means the models have internalized XML syntax patterns — they understand that opening and closing tags encapsulate related content.

This isn't about the model "preferring" XML. It's about alignment. The model's training data includes thousands of examples of well-structured XML documents. Using a similar structure in your prompts leverages those learned patterns.

Breaking Down Each Section

Let's look at each section of the XML structure and what makes it effective.

`<purpose>` — The Why

The purpose section answers one question: what is this agent designed to accomplish? This should be a single, clear statement of function.

Good example:

<purpose>
Reviews code for correctness, readability, performance, security
vulnerabilities, and adherence to best practices. Provides
actionable feedback to improve code quality.
</purpose>

Bad example:

You are supposed to be a code reviewer who looks at code and tells
people what's wrong with it. Also maybe suggest improvements if
you feel like it. You know, just be helpful I guess.

The first example is specific, action-oriented, and unambiguous. The second is vague, informal, and leaves too much to interpretation.

`<identity>` — The Who

The identity section establishes the model's role and expertise. This isn't just a label — it primes the model to access relevant knowledge patterns from its training data.

When you say "You are a Senior Code Reviewer with deep experience across multiple languages," the model activates different knowledge pathways than if you said "You are a helpful assistant."

Be specific about areas of expertise. The more precise your identity statement, the more focused the model's responses will be.

`<personality>` — The Tone

Personality defines how the model communicates. This includes tone (formal, casual, authoritative), attitude (constructive, skeptical, encouraging), and communication style (concise, thorough, educational).

Personality tags are particularly important because LLMs tend toward a default "helpful assistant" tone that may not match your use case. A Creative Writer prompt without a personality section will produce dry, generic content. Adding "Imaginative, expressive, and empathetic" fundamentally changes the output.

`<output>` — The Format

This is one of the most impactful sections. By specifying exactly how responses should be structured, you dramatically improve consistency across generations.

Instead of hoping the model formats things nicely, tell it. Use numbered lists, bullet points, markdown formatting, or any structure that matches your needs. The model will follow this structure because you've given it a clear template to fill in.

`<rules>` — The Do's

Rules are positive behavioral instructions. They tell the model what to do, not what to avoid. Each rule should be a single, clear action item.

Rules work best when they're specific and actionable. "Be helpful" is too vague. "Provide specific, actionable suggestions — not vague feedback" gives the model a concrete standard to meet.

`<constraints>` — The Don'ts

Constraints are negative instructions — what the model should not do. They're often the most overlooked section, but they're critical for preventing common failure modes.

For a Code Reviewer, a constraint like "Do not rewrite entire files — focus on targeted feedback" prevents the model from generating massive diffs when you only wanted specific comments. For a Data Analyst, "Do not fabricate data" prevents the model from inventing statistics to support its analysis.

Constraints work because they close off common failure paths. Without them, the model defaults to the most general behavior it can find, which is often not what you want.

Practical Examples

Let's look at a before-and-after comparison using a real system prompt.

Before — Flat Text

You are a helpful assistant that helps with code reviews. Be constructive and thorough. Look for bugs, security issues, and style problems. Give specific feedback with code examples. Also mention what was done well. Don't rewrite entire files. Consider the project context before making suggestions. Format your response with a summary first, then critical issues, then improvements, then things that were done well.

This prompt contains all the right information, but it's a dense paragraph. The model has to parse it all at once, and instructions blur together. There's no clear hierarchy between what's important and what's supplementary.

After — XML Structured

<purpose>
Reviews code for correctness, readability, performance, and security.
Provides actionable feedback to improve code quality.
</purpose>

<identity>
You are a Code Reviewer — a senior software engineer with deep
experience across multiple languages and frameworks.
</identity>

<personality>
Constructive, thorough, and respectful. Focus on helping the author
improve rather than criticizing. Acknowledge good practices.
</personality>

<output>
1. **Summary** — Overall impression (strengths and concerns)
2. **Critical Issues** — Bugs, security flaws (Priority: HIGH)
3. **Improvements** — Style and readability (Priority: MEDIUM)
4. **Recommendations** — Optimization ideas (Priority: LOW)
5. **Positive Notes** — Good practices observed
</output>

<rules>
- Provide specific, actionable suggestions — not vague feedback
- Include before/after code examples
- Consider the project context before suggesting changes
</rules>

<constraints>
- Do not rewrite entire files — focus on targeted feedback
- Do not introduce breaking changes in suggested fixes
</constraints>

The same information, but now the model can clearly distinguish between identity, behavior, format, positive instructions, and constraints. The output section gives a literal template. The result is significantly more consistent and predictable.

Common Mistakes to Avoid

Too Much Information

A system prompt should be comprehensive but not exhaustive. If your prompt is thousands of words long, you're likely including information that belongs in the user message or in conversational context. The system prompt should define the agent's core behavior, not every possible edge case.

Vague Language

"Be professional," "be helpful," "make it good" — these instructions are too vague for a model to act on. Replace them with specific behavioral descriptions. Instead of "be professional," use "Maintain formal language appropriate for business stakeholders."

Conflicting Instructions

If you tell the model to be "concise and direct" in one section and "thorough and detailed" in another, it will struggle to reconcile those directives. Ensure all sections work together toward the same behavioral goal.

Forgetting Constraints

As mentioned above, constraints are the most overlooked section. Always ask: "What bad things might this model do with this prompt?" Then add a constraint to prevent it.

Testing and Iteration

No system prompt is perfect on the first try. The key is systematic testing:

Write your prompt using the XML structure
Test with edge cases — try inputs that push the boundaries of your prompt
Identify failures — where does the model ignore instructions or behave unexpectedly?
Add constraints or rules to close those gaps
Refine the output format based on what actually worked
Repeat until the output is consistently what you need

This iteration process is where most people give up. But the difference between a "good" system prompt and a "great" one is usually just three or four rounds of targeted refinement.

Why This Matters for Everyone

Whether you're building a production AI agent, automating workflows with n8n, or just trying to get better answers from a chat interface, how you structure your system prompt directly impacts the quality of the output.

XML-tagged system prompts are not a hack or a workaround. They're a structured approach that leverages how language models actually process text. The tags provide explicit boundaries, semantic context, and structural clarity — three things that all modern LLMs rely on to produce consistent, reliable results.

If you've been struggling with inconsistent AI outputs, start with your system prompt. Structure it with clear XML tags, be specific in every section, and add constraints for every failure mode you've observed. You'll be surprised at how much better the results become.

And if you want to try these prompts right now, check out the Prompt Builder tool on this site, where you can load, edit, and test these system prompts directly. Or browse the full Prompt Library on CodeVault for the complete collection.