Your Test Suite Should Hit the LLM, Stop Mocking It

I know. That makes some of you uncomfortable.

"Never hit external services in tests." We've been repeating this for so long it feels like a law of physics. Mock the HTTP layer. Stub the client. Record the response once and replay it forever.

This made sense when external services had stable contracts. You mock Stripe because Stripe's response to the same input is always the same. The mock is a faithful stand-in.

LLMs are not Stripe.

LLMs are part of your algorithm

When your app calls an LLM, that call is not a side effect. It's the core logic. The model decides what tool to call and how to interpret ambiguous input. Mocking that is like testing your sorting algorithm by replacing the comparison function with a hardcoded list. You're not testing anything.

Say your product routes support tickets through an LLM orchestrator that picks a tool and generates a response. If you mock the orchestrator, you've tested the plumbing around the thing that matters. You still have no idea if the product works.

"But it's slow and expensive"

Run them in parallel. Tag them @tag :llm and run them alongside your unit tests. A dozen LLM calls in parallel take seconds. The cost per run is cents. You spend more writing a Pull Request.

"But it's non-deterministic"

So is your product.

Assert on structure, not strings. Did the orchestrator call the right tool? Did it pass the right arguments? Did the response contain the key information? You're testing behavior, not verbatim output.

If you need semantic assertions, use embeddings to compare meaning instead of characters. In Elixir I built Alike for this: assert "30 day return policy" <~> "you can return items within a month" passes because the meaning matches, even though the strings don't. No regex gymnastics, no substring hacks.

For the harder stuff (hallucination detection, faithfulness checks, safety gates) there's Tribunal, which wraps all of this into ExUnit assertions you can run in CI. assert_faithful response, context fails if the model made something up. refute_hallucination does what you'd expect.

Don't match against a snapshot from three months ago with a model version that doesn't exist anymore.

"But what about flaky tests"

Mostly overblown if you're asserting on the right things. Tool calls and structured arguments are pretty deterministic. If you ask the model to classify a ticket and call the route_ticket tool, it either calls it or it doesn't. That's not flaky.

Set temperature to 0 for test runs. You won't eliminate variance entirely, but you'll cut most of it.

And if a test does flake, pay attention. A prompt that produces inconsistent tool calls across runs is a fragile prompt. The flaky test just found a real problem for you.

If you actually want to test with some variability (temperature > 0, evaluating prose quality), run multiple passes and set a threshold. 8 out of 10 passes should succeed. That's an eval, not a unit test, and tools like Tribunal are built for exactly this.

The real risk

Here's what happens when you mock everything: you ship a prompt change, all tests pass, production breaks because the model interprets your system prompt differently now. Your mocks didn't care. They returned the same thing they always return.

You refactor your tool definitions, tests pass, the model stops calling tools correctly. Your cassettes are three model versions behind. Nobody noticed because the tests were green.

Green tests that don't test reality are worse than no tests. They're a green light to ship confidently into a broken product.

Do it

Add a handful of integration tests that call your LLM provider. Tag them, run them in parallel, assert on tool calls and response structure. Run them in CI by default, not as some optional nightly job nobody checks.

LLMs are not external services you happen to call. They're the brains of your application. Test the brains.