Why AI Testing Is a Different Problem

Traditional QA assumes your code either works or it doesn't. LLMs don't follow that rule. Most teams find out the hard way. Here's how to think about it.

Standard testing doesn't map onto LLMs

When you write a function, you can assert the output. When you call an LLM, you get a distribution of possible outputs. That one difference breaks most of what we know about software testing.

Non-determinism

The same prompt can return different outputs on every call. You can't write a simple assertEqual. Any evaluation method that requires exact string matching will miss most real failures and cry wolf on acceptable variation.

No ground truth

For most LLM tasks like summarisation, tone, helpfulness, and reasoning quality, there's no single correct answer. Quality is a spectrum. You need rubrics, not assertions. Defining what "good" means is often the hardest part of the whole problem.

Emergent failure modes

LLMs fail in ways that don't exist in traditional software: hallucination, sycophancy, prompt injection, context confusion. You can't find these by reading the code or checking logs. You find them by probing the model's behaviour directly.

LLM failure modes, explained

These aren't edge cases. Every production LLM application encounters most of these. Knowing what to look for is half the work.

Hallucination

Confident fabrication

The model generates plausible-sounding content that is factually wrong: invented citations, non-existent regulations, made-up product specs. The danger is the confidence: users can't tell the difference.

  • Factual claims with no grounding in context
  • Cited sources that don't exist
  • Numbers and dates that are just wrong
  • Over-extrapolation from partial information
Prompt Sensitivity

Fragile prompt behaviour

Small changes to phrasing, such as a different word order or an added sentence, can produce dramatically different outputs. This makes prompt engineering feel like whack-a-mole: fix one thing, break another.

  • Output changes with semantically identical prompts
  • Fixes in one area silently break another
  • Different behaviour across model versions
  • Language or locale sensitivity
Safety & Guardrails

Bypassed safety layers

Safety instructions in the system prompt are guidelines, not hard constraints. Sufficiently creative inputs can work around them, and users will find these paths, intentionally or not.

  • Jailbreaks via roleplay or hypothetical framing
  • Prompt injection from user-controlled content
  • Indirect policy violations through multi-turn context
  • Guardrail bypasses via language switching
Consistency

Inconsistent outputs

The model gives different answers to the same question across sessions, within a session, or when the question is rephrased slightly. This breaks trust and makes downstream processing brittle.

  • Contradicts itself within a conversation
  • Same query returns opposite conclusions
  • Format inconsistency breaks parsers
  • Tone drift across long sessions
RAG & Retrieval

Retrieval pipeline failures

If your product uses RAG, the LLM's output quality is only as good as what it retrieves. The model can ignore retrieved context, misattribute it, or blend it incorrectly with its training knowledge.

  • Ignoring retrieved context entirely
  • Contradicting source documents
  • Mixing training knowledge with retrieved facts
  • Citation errors in grounded responses
Regression

Silent regression on update

Model providers update their models without always announcing what changed. Your prompts that worked last month may behave differently today. Without eval coverage, you won't know until users tell you.

  • Model version updates with no changelog
  • Prompt changes that break unrelated flows
  • Fine-tune updates altering base behaviour
  • System prompt interactions that shift over time

How LLM evaluation actually works

Since you can't use exact-match assertions, you need a different measurement layer. Here are the main approaches, each with different trade-offs.

Human Evaluation

A human rates outputs against a rubric. Highest quality signal. Slow and expensive to scale. Essential for establishing ground truth on a calibration set.

LLM-as-Judge

A separate LLM (often GPT-4 or Claude) scores outputs against your rubric. Fast and scalable. Requires careful prompt design as the judge has its own biases.

Custom Rubrics

Scoring criteria written specifically for your use case, not generic accuracy. The question is: does this response meet our tone policy and answer the user's actual question?

Regression Suites

A fixed set of test cases with expected behaviour. Run automatically on every prompt or model change. The closest thing to unit tests for LLM behaviour.

What I use and why

No single tool covers everything. The right stack depends on your use case. Here's what I reach for and what each is good at.

PromptFoo

Open-source LLM testing framework used for prompt regression suites and head-to-head model comparisons. Helps catch behavioural changes when prompts or model versions change.

SHAP / LIME / PDP

Explainability tools used in ML testing engagements. SHAP and LIME explain individual predictions, while PDP plots reveal how features influence model behaviour globally.

Custom Harnesses

When off-the-shelf tools don't fit the use case, I build lightweight Python-based eval harnesses tailored to your specific inputs, outputs, and scoring criteria.

Manual Probing

Not everything can be automated. Structured manual exploration, especially for safety, edge cases, and novel failure discovery, remains essential at every engagement.

LLM-as-Judge

Using a strong model (GPT-4o, Claude) to score outputs at scale. I design the judge prompts carefully and validate them against a human-rated calibration set first.

Pytest + CI

For teams that want eval in their existing dev workflow. Wrapping LLM assertions into Pytest lets regressions surface in the same pipeline as regular tests.

Want to talk through your specific situation?

Every LLM product has different failure modes. Bring yours and we'll figure out where to start.

Book a Free Call