Stephen: ColBERT-Style Neural Retrieval for Elixir

When I was building the retrieval layer for Arcana, single-vector search kept disappointing me on technical queries. A document could talk extensively about GenServers but only briefly mention concurrency, and it would rank below a shallow article that mentioned both terms once. I wanted ColBERT-style per-token matching in Elixir, and nothing existed, so I built Stephen.

Most vector search compresses an entire document into a single embedding, then compares it to a query embedding. It's fast and simple, but you lose granularity. A long document about multiple topics gets averaged into one vector, and subtle matches between specific query terms and document terms disappear.

Stephen takes a different approach: ColBERT-style late interaction retrieval. Instead of one embedding per document, it keeps one embedding per token. At search time, each query token finds its best match among all document tokens, and the scores get summed up. You get much better retrieval for complex queries where individual terms matter.

Why this matters

Say you search for "Elixir concurrency patterns for GenServer." With single-vector search, that entire query becomes one point in embedding space. If a document talks extensively about GenServer but only briefly mentions concurrency, it might rank lower than a generic article that mentions both terms once.

With Stephen's per-token matching, "GenServer" in your query matches strongly against "GenServer" tokens in the document. "Concurrency" matches against concurrency discussion. "Patterns" matches against code examples. Each concept gets matched independently, and the scores combine. Documents that actually cover the terms you searched for rank higher.

If you're doing technical search and care about matching specific terms, this is a big deal. And unlike cross-encoder reranking (which runs the full model for every query-document pair), late interaction pre-computes document embeddings. The per-token matching at query time is just matrix operations.

How late interaction works

Traditional dense retrieval: compress text into a single vector, compute cosine similarity. Simple, fast, lossy.

Late interaction: keep per-token embeddings, match query tokens to document tokens individually using MaxSim (maximum similarity per query token), then sum the scores. More computation, but way better at matching specific terms and concepts.

# Load encoder and create index
encoder = Stephen.Encoder.load()
index = Stephen.Index.new(encoder)

# Add documents
index = Stephen.Index.add(index, [
  "Elixir is a functional programming language",
  "Phoenix is a web framework for Elixir",
  "Nx brings numerical computing to Elixir"
])

# Search
results = Stephen.search(index, encoder, "web development with Elixir")
# => [{1, 24.5}, {0, 18.2}, {2, 15.1}]

Three index types

Stephen has three index backends:

Stephen.Index is the standard in-memory backend, backed by HNSW for approximate nearest-neighbor search. Best for small-to-medium collections with frequent updates.
Stephen.Plaid uses centroid-based partitioning for sub-linear search. Use this when the standard index gets slow.
Stephen.Index.Compressed quantizes embeddings with 4-32x compression. Use this when memory is the bottleneck.

Reranking

You can also use Stephen as a reranker on top of a faster initial retrieval system. Pass in candidates from your vector search or BM25 index, and Stephen re-scores them with token-level precision:

# Rerank results from another system
candidates = ["doc1 text", "doc2 text", "doc3 text"]
reranked = Stephen.rerank(encoder, "your query", candidates)

This is how Arcana's reranking pipeline can work: fast vector search first, then Stephen for precision on the top candidates.

Query expansion

Stephen includes pseudo-relevance feedback for query expansion. It takes the top results from an initial search, extracts related terms, and re-searches with an enriched query. You get better recall for free.

See what matched

One of my favorite features: Stephen can show you exactly which query tokens matched which document tokens, with similarity scores for each pair. When your search results look off, you can inspect the token-level matching to understand why.

{:stephen, "~> 1.0"}

Check it out on GitHub and HexDocs.

Stephen: ColBERT-Style Neural Retrieval for Elixir

Why this matters

How late interaction works

Three index types

Reranking

Query expansion

See what matched

Read next

Hallmark: detect LLM hallucinations locally in Elixir

Arcana: Embeddable RAG for Elixir/Phoenix

Elixir/BEAM Doesn't Solve Everything for AI Agents. Addressing the Criticisms.