Topic 5 — Prompt Iteration & Debugging

1

The Iteration Loop

Prompt engineering is not a one-shot process. Great prompts are built through iteration: Write → Test → Analyze → Refine → Repeat. This cycle improves output quality exponentially.

The Iteration Loop (Detail):

Write: Draft your initial prompt based on the requirements
Test: Run the prompt on representative test cases (10-20 examples minimum)
Analyze: Examine failures and patterns. Ask "Why did it fail?"
Refine: Make targeted changes to address root causes
Repeat: Continue until results are acceptable

💡

Analogy: Software Development

Prompt engineering is like writing code. Your first draft won't be perfect. You test, debug, refactor, and iterate. The difference: with prompts, your "tests" are small API calls, not full test suites.

The Loop in Practice:

Python — Iteration Loop

import anthropic

client = anthropic.Anthropic()

# Test cases
test_cases = [
    ("I love this!", "positive"),
    ("Terrible product.", "negative"),
    ("It's okay.", "neutral"),
]

def test_prompt(prompt, test_data):
    """Run prompt on test cases and check accuracy."""
    correct = 0
    for text, expected in test_data:
        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=100,
            messages=[{
                "role": "user",
                "content": f"{prompt}\n\nText: {text}"
            }]
        )
        result = response.content[0].text.strip().lower()
        if expected in result:
            correct += 1
    accuracy = (correct / len(test_data)) * 100
    return accuracy

# Iteration cycle
prompt_v1 = "Classify sentiment: positive, negative, or neutral."
acc_v1 = test_prompt(prompt_v1, test_cases)
print(f"V1: {acc_v1}%")

# Analyze and improve
prompt_v2 = """Classify sentiment.
Examples: "love" → positive, "bad" → negative.
Return ONE word: positive, negative, or neutral."""
acc_v2 = test_prompt(prompt_v2, test_cases)
print(f"V2: {acc_v2}%")

✅

Measure Everything

Define success metrics upfront (accuracy, latency, cost). Without metrics, you're iterating blindly. Measure before and after each change.

2

Common Failure Modes

Most prompt failures fall into predictable patterns. Learn to recognize them, and you'll debug faster.

Failure Mode 1: Too Verbose Prompts

❌ Problem

Your job is to carefully read the
text and think deeply about whether
the sentiment is positive or negative.
Please consider all nuances...

✅ Fix

Classify sentiment: positive or
negative. Be concise. One word
answer only.

Why it fails: Long prompts confuse the model and waste tokens. How to fix: Remove filler words. Be specific, not verbose.

Failure Mode 2: Hallucinations

❌ Problem

What is the annual revenue of
company X?

✅ Fix

Based ONLY on this text, extract
the annual revenue. If not mentioned,
say "Not found".

Why it fails: Model makes up information when uncertain. How to fix: Tell it what to do when info is unavailable.

Failure Mode 3: Wrong Format Output

❌ Problem

Return the data as JSON.

✅ Fix

Return valid JSON (no markdown).
{
  "name": "...",
  "score": 0-100
}

Why it fails: "JSON" is ambiguous. Model might return markdown-wrapped JSON. How to fix: Show exact expected format with an example.

Failure Mode 4: Ignoring Instructions

❌ Problem

Don't be too creative. Just summarize
the text factually.

✅ Fix

Summarize ONLY facts from the text.
Do NOT add interpretations or
creative additions.

Why it fails: Weak negatives ("don't") are less effective. How to fix: Use system prompt for hard rules. Be explicit.

Failure Mode 5: Lazy Responses

❌ Problem

Summarize the meeting.

✅ Fix

Summarize the meeting covering:
1. Key decisions (at least 3)
2. Action items with owners
3. Next meeting date
Format: markdown with headers

Why it fails: Vague instructions allow low-effort responses. How to fix: Specify exactly what you want. Set expectations.

⚠️

Test Against Failure Modes

When you notice a failure, check which mode it is. Then apply the targeted fix. Don't just say "the prompt is bad" — diagnose the specific problem.

3

Debugging Techniques

When a prompt fails, you need to isolate the cause. Is it the system prompt? The input data? The output format spec? Debugging is detective work.

Systematic Debugging Process:

Isolate: Test each component separately (system prompt, examples, format specs)
Simplify: Create minimal test cases that reproduce the failure
Hypothesis: Form a theory about what's wrong
Test: Make ONE change and measure the effect
Iterate: If the hypothesis was right, apply to all cases

Debugging Example: "Model is returning markdown when I want JSON"

Python — Systematic Debugging

# Debug: Is it the format spec or the system prompt?

# Test 1: Just format spec (no system prompt)
response_1 = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    messages=[{
        "role": "user",
        "content": "Return valid JSON: {\"name\": \"...\"}"
    }]
)
print("Test 1 (no system):", response_1.content[0].text)

# Test 2: With system prompt that says to be helpful
response_2 = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    system="Be helpful",
    messages=[{
        "role": "user",
        "content": "Return valid JSON: {\"name\": \"...\"}"
    }]
)
print("Test 2 (vague system):", response_2.content[0].text)

# Test 3: Strong system prompt about JSON only
response_3 = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    system="Return ONLY valid JSON. No markdown. No explanation.",
    messages=[{
        "role": "user",
        "content": "Return valid JSON: {\"name\": \"...\"}"
    }]
)
print("Test 3 (strong system):", response_3.content[0].text)

# Conclusion: Hypothesis was right! System prompt was too weak.

💡

One Change at a Time

Never change multiple things at once. Change one variable, measure, then move to the next. Otherwise, you won't know which change caused the improvement.

4

A/B Testing Prompts

When you have two competing prompt versions, A/B test them on your dataset. Don't rely on gut feeling. Let data decide.

A/B Testing Process:

Create a test dataset (50-200 examples with known correct answers)
Run Prompt A on all examples, measure accuracy/quality
Run Prompt B on all examples, measure the same metrics
Compare: Which performed better? By how much?
If B is better, adopt B. If tie or marginal, stick with A (simpler).

Python: A/B Testing Code

Python — A/B Testing Framework

import json
from anthropic import Anthropic

client = Anthropic()

# Test dataset
test_data = [
    {"text": "I love this!", "expected": "positive"},
    {"text": "Awful experience.", "expected": "negative"},
    {"text": "It's okay.", "expected": "neutral"},
]

def test_prompt_version(prompt, dataset):
    """Test a prompt and return accuracy."""
    results = []
    for item in dataset:
        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=100,
            messages=[{
                "role": "user",
                "content": f"{prompt}\n\nText: {item['text']}"
            }]
        )
        result = response.content[0].text.strip().lower()
        correct = item["expected"] in result
        results.append({
            "input": item["text"],
            "expected": item["expected"],
            "output": result,
            "correct": correct
        })
    accuracy = (sum(1 for r in results if r["correct"])) / len(results) * 100
    return accuracy, results

# Prompt A (basic)
prompt_a = "Classify sentiment: positive, negative, or neutral."

# Prompt B (improved with examples)
prompt_b = """Classify sentiment: positive, negative, neutral.
Examples:
- "love" → positive
- "hate" → negative
- "ok" → neutral

Return ONE word only."""

# Run A/B test
acc_a, results_a = test_prompt_version(prompt_a, test_data)
acc_b, results_b = test_prompt_version(prompt_b, test_data)

print(f"Prompt A accuracy: {acc_a}%")
print(f"Prompt B accuracy: {acc_b}%")
print(f"Winner: {'B' if acc_b > acc_a else 'A'}")

✅

Statistical Significance

Small improvements (1-2%) might be noise. Look for bigger gaps (5%+) before declaring a winner. On larger datasets (1000+ examples), smaller improvements become meaningful.

5

Prompt Versioning

Track your prompt versions like you track code. Versioning lets you compare, rollback, and understand what changed and why. This is essential in production.

Prompt Changelog Example:

Text — Prompt Changelog

## Sentiment Classifier Prompt Changelog

### v1.0 (2025-01-15)
- Initial: "Classify sentiment: positive, negative, neutral"
- Accuracy: 72%
- Issue: Too vague, model guesses

### v1.1 (2025-01-16)
- Added few-shot examples
- Added "Return ONE word only" constraint
- Accuracy: 85%
- Improvement: +13%

### v1.2 (2025-01-17)
- Moved examples to system prompt
- Added "If unsure, say 'unclear'"
- Accuracy: 87%
- Improvement: +2%

### v2.0 (2025-01-18)
- Full XML structure with tags
- Explicit JSON output format
- Accuracy: 89%
- Improvement: +2%, better production-grade

Best Practices for Versioning:

Semantic versioning: v1.0, v1.1, v2.0 (major.minor.patch)
Date stamps: Include date for tracking when changes happened
Metrics: Always record accuracy/quality metrics
Rationale: Explain why you made each change
Comparison: Show before/after metrics to prove improvement

💡

Store Prompts in Version Control

Keep prompts in Git (or equivalent) alongside your code. Tag releases. This lets you roll back to a previous version if a new prompt performs worse.

6

Real-World Case Study: Code Review Prompt

Let's walk through a real example of iterating on a code review prompt from terrible (v1) to excellent (v5). Notice the pattern of progressive improvements.

V1 (Terrible) — Initial Attempt

Prompt — V1: Too Vague

Review this code.

Problem: No specificity. Model gives generic feedback. Accuracy: 30% (misses bugs, gives irrelevant comments)

V2 (Okay) — Add Some Context

Prompt — V2: Basic Structure

You are a code reviewer.
Review this Python code for bugs and improvements.
List issues and suggestions.

Code:
[code here]

Improvement: More specific, but still loose. Accuracy: 60% (catches some issues, but output is rambling)

V3 (Better) — Add Format & Examples

Prompt — V3: Structured Format

You are a senior Python engineer.
Review code focusing on:
1. Security issues (highest priority)
2. Performance problems
3. Code quality improvements

Output format:
## Security Issues
[list with severity]

## Performance
[specific suggestions]

## Code Quality
[improvements with examples]

Code:
[code here]

Improvement: Clear format, prioritization. Accuracy: 75% (catches most issues, better organized)

V4 (Excellent) — Add XML Tags & Guardrails

Prompt — V4: Production-Grade

<system_role>
You are a senior Python engineer at a Fortune 500
company with 15 years of experience. Your code
reviews are known for catching subtle bugs and
suggesting high-impact improvements.
</system_role>

<instructions>
Review this Python code. Flag issues in priority order.
</instructions>

<priorities>
1. Security vulnerabilities (SQL injection, etc.)
2. Race conditions or concurrency issues
3. Memory leaks or performance problems
4. Code quality and style issues
</priorities>

<output_format>
## 🔴 Critical Issues
- Issue 1 (line X): description
- Issue 2 (line Y): description

## 🟡 Warnings
- Warning 1: description

## 💡 Improvements
- Suggestion 1: description

## ✅ Positive Notes
- What's good about the code
</output_format>

<rules>
- Cite specific line numbers
- Provide code examples for fixes
- Be constructive, not dismissive
- Don't comment on style preferences
- If you're unsure, say "needs clarification"
</rules>

Code:
[code here]

Improvement: Strong role, clear rules, structured format. Accuracy: 88% (catches nearly all issues, highly useful)

V5 (Optimal) — Add Few-Shot & Error Handling

Prompt — V5: Fully Optimized

<system_role>
You are a senior Python engineer at a Fortune 500
company with 15 years of experience. Your code
reviews are known for catching subtle bugs and
suggesting high-impact improvements.
</system_role>

<examples>
Example 1: SQL injection vulnerability
```python
result = db.execute(f"SELECT * FROM users WHERE id = {user_id}")
```
Review: "🔴 Critical security issue on line 2. This is vulnerable to
SQL injection. Fix: Use parameterized queries."

Example 2: Good code
```python
def calculate_total(items: list[float]) -> float:
    """Calculate sum of items with validation."""
    return sum(item for item in items if isinstance(item, (int, float)))
```
Review: "✅ Good: Type hints, docstring, and defensive code."
</examples>

<instructions>
Review this Python code. Flag issues in priority order.
</instructions>

... [rest of v4 structure] ...

Code:
[code here]

Improvement: Few-shot examples teach desired behavior. Accuracy: 92% (production-ready, consistent, actionable)

✅

The Progression

Notice each version builds on the previous: v1 → v2 (add context) → v3 (structure) → v4 (XML tags, guardrails) → v5 (examples). This is the path most prompts follow to production.

✓

Check Your Understanding

Quick Quiz — 4 Questions

1. What are the steps of the prompt iteration loop?

2. Which failure mode is fixed by adding "Return ONLY JSON. No markdown"?

3. When debugging a prompt, what should you do?

4. Why should you version control your prompts?

✓

Topic 5 Summary

Here's what you've learned:

Iteration is everything. Start simple, test, and refine. Failure modes are predictable. Learn to recognize and fix them. Debugging is systematic. Isolate, hypothesize, test one change at a time. A/B testing eliminates guessing. Let data decide between versions. Versioning prevents chaos. Track changes like code. Case studies show the path. Most prompts follow a v1 → v5 progression to production.

Next up → Topic 6: Hands-On — Building with LLM APIs
Write real Python code to call Claude and OpenAI APIs with streaming, error handling, and practical patterns.