How the cluster works
MirrorNeuron’s cluster stack has four components:- BEAM node distribution — provides transparent cross-node message passing
libcluster— handles peer discovery and join coordinationHorde— provides distributed supervision so agent and job processes can migrate between nodes on failure- Shared Redis — stores durable job state, handles leader election via lease locks, and acts as the ultimate arbiter during network partitions
Leader election
Cluster-wide coordination tasks — such as sweeping orphaned jobs — are handled by a dynamically elected leader:- The leader acquires a
cluster:leaderlease in Redis and refreshes it periodically. - If the leader node crashes or becomes unreachable, the lease expires.
- Another node immediately acquires the lease and takes over leadership.
Job failover
When a job is submitted, a Job Coordinator process is started and managed by Horde. The coordinator holds ajob:<job_id> ownership lease in Redis. If the node running the coordinator dies:
- Horde detects the failure and schedules the coordinator on a surviving peer.
- The new coordinator finds the job already exists in Redis, waits for the previous lease to expire if necessary, then resumes the job by reloading its persisted state.
Because all durable state lives in Redis, a node failure does not lose job progress. Agents that had already completed remain completed — only in-flight execution is replayed.
Required environment variables
Every node in the cluster must agree on the following values. Set them identically on all machines before starting any node.| Variable | Purpose |
|---|---|
MIRROR_NEURON_COOKIE | Shared secret that authenticates BEAM distribution connections. Must be identical on all nodes. |
MIRROR_NEURON_CLUSTER_NODES | Comma-separated list of name@ip addresses for all nodes in the cluster. |
MIRROR_NEURON_REDIS_URL | Connection URL for the shared Redis instance. |
Recommended dev-mode networking
Use fixed distribution ports during development to make network failures easier to reason about. Dynamic ephemeral ports make it harder to identify which connection is failing.Start a two-box cluster
Use thestart_cluster_node.sh helper script to start each box. The script sets up the correct BEAM node name, distribution settings, and environment for the given box number.
Start box 2
Run this on the machine at Both nodes connect to the same Redis instance running on box 1.
192.168.4.35:Inspect cluster nodes
At any time, you can check which nodes the local runtime sees in the cluster:MIRROR_NEURON_CLUSTER_NODES value and verify that port 4369 (epmd) and the distribution port are reachable between both machines.
Submit a job to the cluster
Once both nodes are running and connected, submit a distributed job. The following example runs theprime_sweep_scale benchmark across the two-node cluster with four parallel workers:
Common failure patterns
:nodistribution error
:nodistribution error
This error means BEAM cannot establish a distribution connection. Common causes:
epmdis not running on one of the machines. Start it withepmd -daemon.- Port
4369(epmd) is blocked by a firewall between the two machines. - The configured distribution port (default
4370in dev mode) is blocked.
telnet 192.168.4.35 4369Invalid challenge reply (cookie mismatch)
Invalid challenge reply (cookie mismatch)
Port 4000 already in use (eaddrinuse)
Port 4000 already in use (eaddrinuse)
Two processes are competing for the same port. This typically happens when:
- Two runtime nodes are starting on the same machine and both try to bind the web API on port
4000. - The Erlang distribution port binding conflicts with the web API port.
Node name already in use
Node name already in use
A previous runtime instance or CLI control node is still registered with the same BEAM node name. This can happen after an unclean shutdown.Check for running beam processes:
pgrep -a beam or epmd -names. Kill any stale instances before restarting.Split-brain
Split-brain
MirrorNeuron uses Redis lease locks as the authoritative source for cluster leadership and job ownership. If a network partition splits the cluster in two:
- The partition that can still reach Redis retains leadership and job ownership.
- The isolated partition cannot acquire or renew leases, so it cannot drive new coordination.