How to Test LLM Applications — AI Testing Guide

The core problem

Standard testing doesn't map onto LLMs

When you write a function, you can assert the output. When you call an LLM, you get a distribution of possible outputs. That one difference breaks most of what we know about software testing.

Non-determinism

The same prompt can return different outputs on every call. You can't write a simple assertEqual. Any evaluation method that requires exact string matching will miss most real failures and cry wolf on acceptable variation.

No ground truth

For most LLM tasks like summarisation, tone, helpfulness, and reasoning quality, there's no single correct answer. Quality is a spectrum. You need rubrics, not assertions. Defining what "good" means is often the hardest part of the whole problem.

Emergent failure modes

LLMs fail in ways that don't exist in traditional software: hallucination, sycophancy, prompt injection, context confusion. You can't find these by reading the code or checking logs. You find them by probing the model's behaviour directly.

What goes wrong

LLM failure modes, explained

These aren't edge cases. Every production LLM application encounters most of these. Knowing what to look for is half the work.

Hallucination

Confident fabrication

The model generates plausible-sounding content that is factually wrong: invented citations, non-existent regulations, made-up product specs. The danger is the confidence: users can't tell the difference.

Factual claims with no grounding in context
Cited sources that don't exist
Numbers and dates that are just wrong
Over-extrapolation from partial information

Prompt Sensitivity

Fragile prompt behaviour

Small changes to phrasing, such as a different word order or an added sentence, can produce dramatically different outputs. This makes prompt engineering feel like whack-a-mole: fix one thing, break another.

Output changes with semantically identical prompts
Fixes in one area silently break another
Different behaviour across model versions
Language or locale sensitivity

Safety & Guardrails

Bypassed safety layers

Safety instructions in the system prompt are guidelines, not hard constraints. Sufficiently creative inputs can work around them, and users will find these paths, intentionally or not.

Jailbreaks via roleplay or hypothetical framing
Prompt injection from user-controlled content
Indirect policy violations through multi-turn context
Guardrail bypasses via language switching

Consistency

Inconsistent outputs

The model gives different answers to the same question across sessions, within a session, or when the question is rephrased slightly. This breaks trust and makes downstream processing brittle.

Contradicts itself within a conversation
Same query returns opposite conclusions
Format inconsistency breaks parsers
Tone drift across long sessions

RAG & Retrieval

Retrieval pipeline failures

If your product uses RAG, the LLM's output quality is only as good as what it retrieves. The model can ignore retrieved context, misattribute it, or blend it incorrectly with its training knowledge.

Ignoring retrieved context entirely
Contradicting source documents
Mixing training knowledge with retrieved facts
Citation errors in grounded responses

Regression

Silent regression on update

Model providers update their models without always announcing what changed. Your prompts that worked last month may behave differently today. Without eval coverage, you won't know until users tell you.

Model version updates with no changelog
Prompt changes that break unrelated flows
Fine-tune updates altering base behaviour
System prompt interactions that shift over time

The methodology

How LLM evaluation actually works

Since you can't use exact-match assertions, you need a different measurement layer. Here are the main approaches, each with different trade-offs.

Human Evaluation

A human rates outputs against a rubric. Highest quality signal. Slow and expensive to scale. Essential for establishing ground truth on a calibration set.

LLM-as-Judge

A separate LLM (often GPT-4 or Claude) scores outputs against your rubric. Fast and scalable. Requires careful prompt design as the judge has its own biases.

Custom Rubrics

Scoring criteria written specifically for your use case, not generic accuracy. The question is: does this response meet our tone policy and answer the user's actual question?

Regression Suites

A fixed set of test cases with expected behaviour. Run automatically on every prompt or model change. The closest thing to unit tests for LLM behaviour.

Tools & approach

What I use and why

No single tool covers everything. The right stack depends on your use case. Here's what I reach for and what each is good at.

PromptFoo

Open-source LLM testing framework used for prompt regression suites and head-to-head model comparisons. Helps catch behavioural changes when prompts or model versions change.

SHAP / LIME / PDP

Explainability tools used in ML testing engagements. SHAP and LIME explain individual predictions, while PDP plots reveal how features influence model behaviour globally.

Custom Harnesses

When off-the-shelf tools don't fit the use case, I build lightweight Python-based eval harnesses tailored to your specific inputs, outputs, and scoring criteria.

Manual Probing

Not everything can be automated. Structured manual exploration, especially for safety, edge cases, and novel failure discovery, remains essential at every engagement.

LLM-as-Judge

Using a strong model (GPT-4o, Claude) to score outputs at scale. I design the judge prompts carefully and validate them against a human-rated calibration set first.

Pytest + CI

For teams that want eval in their existing dev workflow. Wrapping LLM assertions into Pytest lets regressions surface in the same pipeline as regular tests.

Why AI Testing Is a Different Problem