Traditional QA assumes your code either works or it doesn't. LLMs don't follow that rule. Most teams find out the hard way. Here's how to think about it.
When you write a function, you can assert the output. When you call an LLM, you get a distribution of possible outputs. That one difference breaks most of what we know about software testing.
The same prompt can return different outputs on every call. You can't write a
simple assertEqual. Any evaluation method that requires exact string
matching will miss most real failures and cry wolf on acceptable variation.
For most LLM tasks like summarisation, tone, helpfulness, and reasoning quality, there's no single correct answer. Quality is a spectrum. You need rubrics, not assertions. Defining what "good" means is often the hardest part of the whole problem.
LLMs fail in ways that don't exist in traditional software: hallucination, sycophancy, prompt injection, context confusion. You can't find these by reading the code or checking logs. You find them by probing the model's behaviour directly.
These aren't edge cases. Every production LLM application encounters most of these. Knowing what to look for is half the work.
The model generates plausible-sounding content that is factually wrong: invented citations, non-existent regulations, made-up product specs. The danger is the confidence: users can't tell the difference.
Small changes to phrasing, such as a different word order or an added sentence, can produce dramatically different outputs. This makes prompt engineering feel like whack-a-mole: fix one thing, break another.
Safety instructions in the system prompt are guidelines, not hard constraints. Sufficiently creative inputs can work around them, and users will find these paths, intentionally or not.
The model gives different answers to the same question across sessions, within a session, or when the question is rephrased slightly. This breaks trust and makes downstream processing brittle.
If your product uses RAG, the LLM's output quality is only as good as what it retrieves. The model can ignore retrieved context, misattribute it, or blend it incorrectly with its training knowledge.
Model providers update their models without always announcing what changed. Your prompts that worked last month may behave differently today. Without eval coverage, you won't know until users tell you.
Since you can't use exact-match assertions, you need a different measurement layer. Here are the main approaches, each with different trade-offs.
A human rates outputs against a rubric. Highest quality signal. Slow and expensive to scale. Essential for establishing ground truth on a calibration set.
A separate LLM (often GPT-4 or Claude) scores outputs against your rubric. Fast and scalable. Requires careful prompt design as the judge has its own biases.
Scoring criteria written specifically for your use case, not generic accuracy. The question is: does this response meet our tone policy and answer the user's actual question?
A fixed set of test cases with expected behaviour. Run automatically on every prompt or model change. The closest thing to unit tests for LLM behaviour.
No single tool covers everything. The right stack depends on your use case. Here's what I reach for and what each is good at.
PromptFoo
Open-source LLM testing framework used for prompt regression suites and head-to-head model comparisons. Helps catch behavioural changes when prompts or model versions change.
SHAP / LIME / PDP
Explainability tools used in ML testing engagements. SHAP and LIME explain individual predictions, while PDP plots reveal how features influence model behaviour globally.
Custom Harnesses
When off-the-shelf tools don't fit the use case, I build lightweight Python-based eval harnesses tailored to your specific inputs, outputs, and scoring criteria.
Manual Probing
Not everything can be automated. Structured manual exploration, especially for safety, edge cases, and novel failure discovery, remains essential at every engagement.
LLM-as-Judge
Using a strong model (GPT-4o, Claude) to score outputs at scale. I design the judge prompts carefully and validate them against a human-rated calibration set first.
Pytest + CI
For teams that want eval in their existing dev workflow. Wrapping LLM assertions into Pytest lets regressions surface in the same pipeline as regular tests.
Every LLM product has different failure modes. Bring yours and we'll figure out where to start.
Book a Free Call