The previous post about AI agent frameworks and Elixir landed on Hacker News and the pushback was fair. Some of it pointed at real gaps worth addressing directly.
Criticism 1: State is lost on restart — you don't have durable execution
The original post didn't address this one. The commenter who put it best:
BEAM is great at surviving process crashes, but if the whole cluster goes down or you redeploy, that in-memory state evaporates. It's not magic. For agents that might hang around for days, pure Elixir isn't enough, you still need a persistence layer.
True. BEAM solves process-level fault tolerance: crash a process, the supervisor restarts it. It doesn't solve execution-level durability: if the whole node goes down or you deploy new code, in-memory state is gone. Supervision trees can't save you from kill -9 beam.smp.
Temporal.io solves a different problem. It records every decision your workflow makes (which activities were scheduled, what their results were, what signals arrived) as an append-only event log in a database. When a worker crashes and restarts, it replays that log to reconstruct exactly where it was. Your code is sequential and imperative; the infrastructure makes it durable.
# This Temporal workflow survives the death of the worker
# at any point between any two awaits
@workflow.defn
class AgentWorkflow:
@workflow.run
async def run(self, session_id: str) -> str:
response = await workflow.execute_activity(call_llm, ...)
tool_result = await workflow.execute_activity(run_tool, ...)
final = await workflow.execute_activity(call_llm, [response, tool_result], ...)
return final
If the worker dies after call_llm completes but before run_tool starts, Temporal restarts the workflow, replays the history (returning the recorded LLM result without re-calling it), and resumes at run_tool. Nothing is repeated, nothing is lost.
Elixir doesn't have this out of the box. If your GenServer crashes mid-handle_call and the supervisor restarts it, you get a clean state. The LLM call you were in the middle of is gone.
What exists today
There are real answers in the ecosystem, just not a Temporal equivalent yet.
durable_object is an Elixir library that takes the Cloudflare Durable Objects approach: every state change is synchronously persisted to Postgres before the reply is sent. State survives restarts because it's always on disk.
defmodule MyApp.AgentSession do
use DurableObject
state do
field :messages, {:array, :map}, default: []
field :status, :string, default: "active"
field :last_tool_result, :map, default: nil
end
handlers do
handler :add_message, args: [:message]
handler :get_history
end
def handle_add_message(message, state) do
new_messages = state.messages ++ [message]
# State is persisted to Postgres before this reply is sent.
# Crash after this? Next request loads from DB and continues.
{:reply, :ok, %{state | messages: new_messages}}
end
def after_load(state) do
# Called on process start after state loads from DB.
# Inspect state, heal inconsistencies, resume in-progress work.
case state.status do
"tool_executing" -> %{state | status: "tool_interrupted"}
_ -> state
end
end
end
It covers the core complaint: redeploy your cluster, the agent resumes from where it was. You don't get execution replay, activity retry with independent policies, or a history log you can inspect, but for most agent session use cases, you don't need those.
For job-style workloads (async tasks, scheduled processing, retry logic), Oban is mature and battle-tested. It's a solid backbone for anything that looks like a queue with durable jobs. Oban recently came to Python too.
What's missing
A proper Temporal equivalent for Elixir. The community knows this, and it's probably the biggest gap in Elixir's agentic story right now.
Elixir is better suited for building one than Python. Enforcing determinism is the core challenge for Temporal's Python SDK, and Elixir's functional, immutable-by-default style makes replay naturally deterministic. The primitives map cleanly onto OTP. Even the hardest part (versioning, making code changes backward-compatible with running workflow histories) feels like an area where Elixir's hot code reloading and pattern matching could offer a better foundation than Temporal's patched() guards, though that's still an open design problem. I know people are working on that.
Criticism 2: "Let it crash" doesn't work for probabilistic errors
The argument from the thread:
If an LLM returns garbage, restarting the process with the same prompt and temperature 0 yields the same garbage. An Erlang Supervisor restarts a process in a clean state. For an agent "clean state" = lost conversation context.
It's a misapplication of "let it crash", one I implicitly encouraged by not being precise about what the pattern is actually for.
"Let it crash" is designed for unexpected, unrecoverable failures: hardware errors, network timeouts, malformed data from external systems, bugs that put a process in an inconsistent state. Don't try to handle the unexpected: restart into a known-good state. It works because those failures are transient and rare.
An LLM returning garbage is neither transient nor a process-state problem. Restarting doesn't help because the process state wasn't the issue. The LLM output was. That needs semantic error handling, not process supervision.
Elixir has the primitives for this. They just require explicit design.
defmodule MyApp.AgentSession do
use GenServer
def handle_call({:ask, question}, _from, state) do
with {:ok, response} <- LLM.chat(state.messages ++ [question]),
:ok <- validate_response(response) do
{:reply, {:ok, response}, update_messages(state, question, response)}
else
{:error, :hallucination} -> retry_with_grounding(question, state)
{:error, :context_overflow} -> state |> compress_context() |> retry_question(question)
{:error, :rate_limited} -> {:reply, {:error, :rate_limited}, state}
# Genuinely unexpected errors don't match — process crashes, supervisor handles it
end
end
end
Errors you can anticipate (bad LLM output, context overflow, rate limits) need application-level handling. Errors you can't anticipate (corrupted state, unexpected nils, library bugs) are what supervision trees are for.
What Elixir doesn't have out of the box is what one commenter called "semantic supervision trees": supervisors that can inspect the crash reason and choose different restart strategies based on what failed. You can build this with Supervisor callback customization and careful use of terminate/2, but it's not automatic. Worth solving, as is in other runtimes, although not with the same Supervision infra.
Criticism 3: Your concurrency argument doesn't apply to I/O-bound workloads
The most common rebuttal: if 95% of your agent's time is spent waiting on OpenAI, the BEAM's scheduling model is irrelevant. Any async runtime (Node.js, Python asyncio, Go goroutines) handles "wait for HTTP response" at scale. You're not doing work; you're waiting. The BEAM can't make OpenAI faster.
That's partially right. For a simple single-tier pipeline (receive request, call LLM, return response) the runtime doesn't matter much. Node.js handles 10,000 concurrent await fetch(openai) calls without drama.
The argument holds past simple pipelines though. Two things:
GC pauses compound at scale. Node.js has stop-the-world garbage collection. With 10,000 open connections, a GC pause stalls all of them simultaneously. BEAM's garbage collector is per-process: each process has its own heap and collects independently. At low concurrency this is invisible. At scale it shows up as latency spikes that hit every connection at once.
Real agents aren't pure I/O. Parsing structured outputs, running embedding comparisons, managing context windows, routing between models: this is real CPU work happening concurrently with other agents. Python's GIL means CPU work on one coroutine can starve others. BEAM preempts after 4,000 reductions regardless of what the process is doing.
And if you're running models locally, Elixir can do that natively. Bumblebee runs transformer models (embeddings, rerankers, vision models, local LLMs) directly in the BEAM via Nx and EXLA. That inference workload lives in the same supervision tree as everything else, with the same fault tolerance and process model, no separate Python sidecar to manage. CPU-bound GPU inference running alongside I/O-bound API calls, all coordinated by the same runtime. The "it's all I/O anyway" argument doesn't hold there.
The criticism assumes agents stay simple. As they get more complex (running local models, doing structured output parsing, coordinating sub-agents, streaming results while handling tool calls in parallel) the workload stops being pure I/O and the runtime starts mattering a lot. The BEAM was built for exactly that mix of concurrent, heterogeneous work.
What actually stands
BEAM gives you the right primitives for multi-agent coordination, concurrency, and transient fault recovery. For durable execution across full restarts and long-running workflows, you need to add a persistence layer, and it's not fully there yet. Those are different problems.
The durability gap is real, but it's not unique to Elixir. Every runtime needs a database to solve it. Node.js needs one. Python needs one. The difference is that the BEAM ships with Mnesia, a distributed, fault-tolerant database built into the runtime itself. Mnesia has its own quirks and isn't the right fit for everything, but the option is there without reaching for an external service.
One gap doesn't cancel out everything else. Process isolation, preemptive scheduling, per-process GC, supervision trees, hot code reloading, built-in distribution, PubSub, Registry, and runtime observability (:observer, telemetry, live process inspection without restarting): these are real primitives you get the moment you pick Elixir. You don't have to bolt them on, design around them, or fight the language to get them. That's a lot of hard problems you don't have to solve yourself.
Once you have this infrastructure in place (durable state, fault isolation, real-time coordination, semantic error handling), the question stops being "how do I solve these distributed systems problems" and starts being "what can I actually build now that those problems are out of the way."
George Guimarães builds agentic commerce infrastructure at New Generation. Previously: Principal Engineer at a unicorn fintech, co-founder of Plataformatec (acqui-hired by Nubank).