Skip to main content
MirrorNeuron distributes work across multiple machines using a peer-to-peer model where every node runs the same binary and carries the same responsibilities. There is no master or designated coordinator — any node can accept a job submission and any node can host agent processes. This guide walks you through the cluster architecture, environment setup, starting a two-box cluster, inspecting membership, submitting a distributed job, and diagnosing common failure patterns.

How the cluster works

MirrorNeuron’s cluster stack has four components:
  • BEAM node distribution — provides transparent cross-node message passing
  • libcluster — handles peer discovery and join coordination
  • Horde — provides distributed supervision so agent and job processes can migrate between nodes on failure
  • Shared Redis — stores durable job state, handles leader election via lease locks, and acts as the ultimate arbiter during network partitions

Leader election

Cluster-wide coordination tasks — such as sweeping orphaned jobs — are handled by a dynamically elected leader:
  1. The leader acquires a cluster:leader lease in Redis and refreshes it periodically.
  2. If the leader node crashes or becomes unreachable, the lease expires.
  3. Another node immediately acquires the lease and takes over leadership.
This design eliminates a single point of failure for cluster coordination.

Job failover

When a job is submitted, a Job Coordinator process is started and managed by Horde. The coordinator holds a job:<job_id> ownership lease in Redis. If the node running the coordinator dies:
  1. Horde detects the failure and schedules the coordinator on a surviving peer.
  2. The new coordinator finds the job already exists in Redis, waits for the previous lease to expire if necessary, then resumes the job by reloading its persisted state.
Because all durable state lives in Redis, a node failure does not lose job progress. Agents that had already completed remain completed — only in-flight execution is replayed.

Required environment variables

Every node in the cluster must agree on the following values. Set them identically on all machines before starting any node.
export MIRROR_NEURON_COOKIE="mirrorneuron"
export MIRROR_NEURON_CLUSTER_NODES="mn1@192.168.4.29,mn2@192.168.4.35"
export MIRROR_NEURON_REDIS_URL="redis://192.168.4.29:6379/0"
VariablePurpose
MIRROR_NEURON_COOKIEShared secret that authenticates BEAM distribution connections. Must be identical on all nodes.
MIRROR_NEURON_CLUSTER_NODESComma-separated list of name@ip addresses for all nodes in the cluster.
MIRROR_NEURON_REDIS_URLConnection URL for the shared Redis instance.
If MIRROR_NEURON_COOKIE differs between machines, all inter-node connections will fail with an “invalid challenge reply” error. This is the most common cluster setup mistake.
Use fixed distribution ports during development to make network failures easier to reason about. Dynamic ephemeral ports make it harder to identify which connection is failing.
export ERL_AFLAGS="-kernel inet_dist_listen_min 4370 inet_dist_listen_max 4370"
export MIRROR_NEURON_DIST_PORT="4370"

Start a two-box cluster

Use the start_cluster_node.sh helper script to start each box. The script sets up the correct BEAM node name, distribution settings, and environment for the given box number.
1

Start box 1

Run this on the machine at 192.168.4.29:
bash scripts/start_cluster_node.sh --box1-ip 192.168.4.29 --box2-ip 192.168.4.35 --box 1
2

Start box 2

Run this on the machine at 192.168.4.35:
bash scripts/start_cluster_node.sh --box1-ip 192.168.4.29 --box2-ip 192.168.4.35 --box 2 --redis-host 192.168.4.29
Both nodes connect to the same Redis instance running on box 1.
3

Verify the cluster formed

From box 1, inspect the node list:
bash scripts/cluster_cli.sh \
  --box1-ip 192.168.4.29 \
  --box2-ip 192.168.4.35 \
  --self-ip 192.168.4.29 \
  -- inspect nodes
You should see both mn1@192.168.4.29 and mn2@192.168.4.35 in the output.

Inspect cluster nodes

At any time, you can check which nodes the local runtime sees in the cluster:
./mirror_neuron node list
From a control node connected to the cluster, use the cluster CLI wrapper:
bash scripts/cluster_cli.sh \
  --box1-ip 192.168.4.29 \
  --box2-ip 192.168.4.35 \
  --self-ip 192.168.4.29 \
  -- inspect nodes
A healthy two-node cluster shows:
mn1@192.168.4.29
mn2@192.168.4.35
If only one node appears, the nodes have not yet discovered each other — check your MIRROR_NEURON_CLUSTER_NODES value and verify that port 4369 (epmd) and the distribution port are reachable between both machines.

Submit a job to the cluster

Once both nodes are running and connected, submit a distributed job. The following example runs the prime_sweep_scale benchmark across the two-node cluster with four parallel workers:
bash examples/prime_sweep_scale/run_scale_test.sh \
  --workers 4 \
  --start 1000003 \
  --box1-ip 192.168.4.29 \
  --box2-ip 192.168.4.35 \
  --self-ip 192.168.4.29
Horde distributes the agent processes across both nodes. You can watch placement in real time using the cluster monitor:
./mirror_neuron monitor \
  --box1-ip 192.168.4.29 \
  --box2-ip 192.168.4.35 \
  --self-ip 192.168.4.29

Common failure patterns

This error means BEAM cannot establish a distribution connection. Common causes:
  • epmd is not running on one of the machines. Start it with epmd -daemon.
  • Port 4369 (epmd) is blocked by a firewall between the two machines.
  • The configured distribution port (default 4370 in dev mode) is blocked.
Verify connectivity: telnet 192.168.4.35 4369
Two processes are competing for the same port. This typically happens when:
  • Two runtime nodes are starting on the same machine and both try to bind the web API on port 4000.
  • The Erlang distribution port binding conflicts with the web API port.
Change the web API port on one of the nodes:
export MIRROR_NEURON_API_PORT=4001
A previous runtime instance or CLI control node is still registered with the same BEAM node name. This can happen after an unclean shutdown.Check for running beam processes: pgrep -a beam or epmd -names. Kill any stale instances before restarting.
MirrorNeuron uses Redis lease locks as the authoritative source for cluster leadership and job ownership. If a network partition splits the cluster in two:
  • The partition that can still reach Redis retains leadership and job ownership.
  • The isolated partition cannot acquire or renew leases, so it cannot drive new coordination.
When the partition heals, the previously isolated nodes rejoin and reconcile against Redis state. No manual intervention is needed for job state recovery.