Hallmark: detect LLM hallucinations locally in Elixir

I just open-sourced Hallmark, a small Elixir library I wrote to score whether an LLM's output is actually grounded in source material or if it's making stuff up.

It runs Vectara's HHEM model (a fine-tuned FLAN-T5, 184M params) entirely on your machine. No API calls, no sending your data anywhere.

What it does

You give it a premise (your source text) and a hypothesis (what the LLM said). It returns a score between 0 and 1.

{:ok, model} = Hallmark.load(compiler: EXLA)

{:ok, score} = Hallmark.score(model,
  "The capital of France is Paris.",
  "Paris is the capital of France."
)
# => 0.98

{:ok, score} = Hallmark.score(model,
  "The capital of France is Paris.",
  "The capital of France is Berlin."
)
# => 0.01

It's not checking if things are true in general. It's checking entailment: does the hypothesis follow from the premise? "Dogs are mammals" is a true statement, but if your premise only says "the dog sat on the mat", the score will be low because the premise doesn't support that claim.

This is what I wanted for RAG pipelines: a way to verify that the LLM's response is grounded in the retrieved context, not invented from training data.

The API

Two modules, four public functions.

# load the model (~440MB download on first run, cached after)
{:ok, model} = Hallmark.load(compiler: EXLA)

# score a single pair
{:ok, score} = Hallmark.score(model, premise, hypothesis)

# score a batch (way more efficient than looping)
{:ok, scores} = Hallmark.score_batch(model, [{premise1, hyp1}, {premise2, hyp2}])

# get a binary label instead of a score
{:ok, :consistent} = Hallmark.evaluate(model, premise, hypothesis)
{:ok, :hallucinated} = Hallmark.evaluate(model, premise, hypothesis, threshold: 0.8)

That's it. The model struct is opaque, you just pass it around. I wanted it to feel like calling any other function.

It catches subtle stuff

The model understands asymmetric relationships, which surprised me at first. "I am in California" entails "I am in the United States" (score: 0.65), but "I am in the United States" does not entail "I am in California" (score: 0.13). Makes sense once you think about it: California is in the US but the US is not in California.

It also catches relational swaps. "Mark Wahlberg was a fan of Manny" scores 0.05 against "Manny was a fan of Mark Wahlberg". Same words, different meaning, low score.

Speed

With EXLA, I get ~170ms per scoring call. Batch scoring is faster per pair. Without a compiler it falls back to pure Nx, which takes minutes per call, so you really want EXLA or EMLX (for Apple Silicon).

The model is about 440MB. First Hallmark.load/1 downloads it, after that it's cached locally.

Where I'm using it

I'm building a grounding feature in Arcana where I verify that the LLM's responses are actually supported by the context we fed it. Hallmark runs as a post-processing step: score each claim against the source material, flag anything below the threshold.

The fact that it runs locally matters to me. I don't want to send customer data to yet another API just to verify it. And at ~170ms per check, it's fast enough to run in the request path.

Try it

# mix.exs
{:hallmark, "~> 1.0"},
{:exla, "~> 0.10"}  # or {:emlx, ...} for Apple Silicon

GitHub: georgeguimaraes/hallmark

Hallmark: detect LLM hallucinations locally in Elixir

What it does

The API

It catches subtle stuff

Speed

Where I'm using it

Try it

Read next

Arcana: Embeddable RAG for Elixir/Phoenix

Stephen: ColBERT-Style Neural Retrieval for Elixir

Elixir/BEAM Doesn't Solve Everything for AI Agents. Addressing the Criticisms.