Testing LLM outputs is frustrating. The model might say "The return policy is 30 days" or "You have thirty days to return items" or "Returns are accepted within a month of purchase." All correct, all different strings. Traditional assertions break immediately.
I built Alike to solve this with a single operator: <~>. It compares two strings by meaning, not by characters. It runs locally, costs nothing per assertion, and feels like a natural ExUnit extension.
The wave operator
import Alike.WaveOperator
assert "The quick brown fox jumps over the lazy dog"
<~> "A fast auburn fox leaps over a sleeping canine"
That's it. Two sentences that mean the same thing, verified by semantic embeddings. No regex gymnastics, no substring hacks. If the meaning matches, the assertion passes.
How it works
Under the hood, Alike runs two models locally via Bumblebee:
- Sentence embeddings: converts both strings to 384-dimensional vectors using all-MiniLM-L6-v2, then computes cosine similarity
- Natural Language Inference (NLI): classifies the relationship as entailment, contradiction, or neutral to catch logical conflicts that raw similarity would miss
Everything runs locally. No API keys, no network calls, no cost per assertion. The models download once on first run (~460MB total, stored in ~/.cache/bumblebee/), and subsequent runs are instant.
Catching contradictions
NLI is what makes Alike more than a similarity score. If your LLM says "The price is $10" but should have said "The price is $50", cosine similarity might still be high since both sentences are about prices, so a similarity-only check would pass and hide the bug. Alike catches this by flagging it as a contradiction:
{:ok, result} = Alike.classify("The price is $10", "The price is $50")
result.label
# => "contradiction"
This matters for testing correctness, not just relevance.
Beyond the operator
For more control, use the functions directly:
# Get raw similarity score (0.0 to 1.0)
{:ok, score} = Alike.similarity("machine learning", "artificial intelligence")
# score => 0.70
# Check with custom threshold
Alike.alike?("fast car", "quick automobile", threshold: 0.6)
# => true
Thresholds are configurable globally or per-call. The defaults (0.45 similarity, 0.8 contradiction confidence) work well for most cases, but you can tune them for your domain.
Where it fits
Alike powers the semantic assertions in Tribunal, my LLM evaluation framework for Elixir. When you write assert_semantic_similar response, "expected meaning" in a Tribunal test case, Alike is doing the heavy lifting underneath.
But it's useful on its own too. Any test where you're comparing natural language output benefits from semantic comparison: chatbot responses, translation quality, summarization accuracy, NLP pipeline validation.
# In your mix.exs (test only)
{:alike, "~> 0.3.0", only: :test}