Run MirrorNeuron across a multi-node cluster

MirrorNeuron distributes work across multiple machines using a peer-to-peer model where every node runs the same binary and carries the same responsibilities. There is no master or designated coordinator — any node can accept a job submission and any node can host agent processes. This guide walks you through the cluster architecture, environment setup, starting a two-box cluster, inspecting membership, submitting a distributed job, and diagnosing common failure patterns.

How the cluster works

MirrorNeuron’s cluster stack has four components:

BEAM node distribution — provides transparent cross-node message passing
libcluster — handles peer discovery and join coordination
Horde — provides distributed supervision so agent and job processes can migrate between nodes on failure
Shared Redis — stores durable job state, handles leader election via lease locks, and acts as the ultimate arbiter during network partitions

Leader election

Cluster-wide coordination tasks — such as sweeping orphaned jobs — are handled by a dynamically elected leader:

The leader acquires a cluster:leader lease in Redis and refreshes it periodically.
If the leader node crashes or becomes unreachable, the lease expires.
Another node immediately acquires the lease and takes over leadership.

This design eliminates a single point of failure for cluster coordination.

Job failover

When a job is submitted, a Job Coordinator process is started and managed by Horde. The coordinator holds a job:<job_id> ownership lease in Redis. If the node running the coordinator dies:

Horde detects the failure and schedules the coordinator on a surviving peer.
The new coordinator finds the job already exists in Redis, waits for the previous lease to expire if necessary, then resumes the job by reloading its persisted state.

Because all durable state lives in Redis, a node failure does not lose job progress. Agents that had already completed remain completed — only in-flight execution is replayed.

Required environment variables

Every node in the cluster must agree on the following values. Set them identically on all machines before starting any node.

export MIRROR_NEURON_COOKIE="mirrorneuron"
export MIRROR_NEURON_CLUSTER_NODES="mn1@192.168.4.29,mn2@192.168.4.35"
export MIRROR_NEURON_REDIS_URL="redis://192.168.4.29:6379/0"

Variable	Purpose
`MIRROR_NEURON_COOKIE`	Shared secret that authenticates BEAM distribution connections. Must be identical on all nodes.
`MIRROR_NEURON_CLUSTER_NODES`	Comma-separated list of `name@ip` addresses for all nodes in the cluster.
`MIRROR_NEURON_REDIS_URL`	Connection URL for the shared Redis instance.

If MIRROR_NEURON_COOKIE differs between machines, all inter-node connections will fail with an “invalid challenge reply” error. This is the most common cluster setup mistake.

Recommended dev-mode networking

Use fixed distribution ports during development to make network failures easier to reason about. Dynamic ephemeral ports make it harder to identify which connection is failing.

export ERL_AFLAGS="-kernel inet_dist_listen_min 4370 inet_dist_listen_max 4370"
export MIRROR_NEURON_DIST_PORT="4370"

Start a two-box cluster

Use the start_cluster_node.sh helper script to start each box. The script sets up the correct BEAM node name, distribution settings, and environment for the given box number.

Start box 1

Run this on the machine at 192.168.4.29:

bash scripts/start_cluster_node.sh --box1-ip 192.168.4.29 --box2-ip 192.168.4.35 --box 1

Start box 2

Run this on the machine at 192.168.4.35:

bash scripts/start_cluster_node.sh --box1-ip 192.168.4.29 --box2-ip 192.168.4.35 --box 2 --redis-host 192.168.4.29

Both nodes connect to the same Redis instance running on box 1.

Verify the cluster formed

From box 1, inspect the node list:

bash scripts/cluster_cli.sh \
  --box1-ip 192.168.4.29 \
  --box2-ip 192.168.4.35 \
  --self-ip 192.168.4.29 \
  -- inspect nodes

You should see both mn1@192.168.4.29 and mn2@192.168.4.35 in the output.

Inspect cluster nodes

At any time, you can check which nodes the local runtime sees in the cluster:

./mirror_neuron node list

From a control node connected to the cluster, use the cluster CLI wrapper:

bash scripts/cluster_cli.sh \
  --box1-ip 192.168.4.29 \
  --box2-ip 192.168.4.35 \
  --self-ip 192.168.4.29 \
  -- inspect nodes

A healthy two-node cluster shows:

mn1@192.168.4.29
mn2@192.168.4.35

If only one node appears, the nodes have not yet discovered each other — check your MIRROR_NEURON_CLUSTER_NODES value and verify that port 4369 (epmd) and the distribution port are reachable between both machines.

Submit a job to the cluster

Once both nodes are running and connected, submit a distributed job. The following example runs the prime_sweep_scale benchmark across the two-node cluster with four parallel workers:

bash examples/prime_sweep_scale/run_scale_test.sh \
  --workers 4 \
  --start 1000003 \
  --box1-ip 192.168.4.29 \
  --box2-ip 192.168.4.35 \
  --self-ip 192.168.4.29

Horde distributes the agent processes across both nodes. You can watch placement in real time using the cluster monitor:

./mirror_neuron monitor \
  --box1-ip 192.168.4.29 \
  --box2-ip 192.168.4.35 \
  --self-ip 192.168.4.29

Common failure patterns

:nodistribution error

This error means BEAM cannot establish a distribution connection. Common causes:

epmd is not running on one of the machines. Start it with epmd -daemon.
Port 4369 (epmd) is blocked by a firewall between the two machines.
The configured distribution port (default 4370 in dev mode) is blocked.

Verify connectivity: telnet 192.168.4.35 4369

Invalid challenge reply (cookie mismatch)

BEAM nodes authenticate each other using the distribution cookie. If the cookie differs between machines, every connection attempt fails with this error.Ensure MIRROR_NEURON_COOKIE is exported with an identical value on all nodes before starting the cluster:

export MIRROR_NEURON_COOKIE="your_shared_secret"

Port 4000 already in use (eaddrinuse)

Two processes are competing for the same port. This typically happens when:

Two runtime nodes are starting on the same machine and both try to bind the web API on port 4000.
The Erlang distribution port binding conflicts with the web API port.

Change the web API port on one of the nodes:

export MIRROR_NEURON_API_PORT=4001

Node name already in use

A previous runtime instance or CLI control node is still registered with the same BEAM node name. This can happen after an unclean shutdown.Check for running beam processes: pgrep -a beam or epmd -names. Kill any stale instances before restarting.

Split-brain

MirrorNeuron uses Redis lease locks as the authoritative source for cluster leadership and job ownership. If a network partition splits the cluster in two:

The partition that can still reach Redis retains leadership and job ownership.
The isolated partition cannot acquire or renew leases, so it cannot drive new coordination.

When the partition heals, the previously isolated nodes rejoin and reconcile against Redis state. No manual intervention is needed for job state recovery.

Get Started

Core Concepts

Guides

CLI Reference

Configuration

Troubleshooting

Run MirrorNeuron across a multi-node cluster

How the cluster works

Leader election

Job failover

Required environment variables

Recommended dev-mode networking

Start a two-box cluster

Inspect cluster nodes

Submit a job to the cluster

Common failure patterns

Get Started

Core Concepts

Guides

CLI Reference

Configuration

Troubleshooting

​How the cluster works

​Leader election

​Job failover

​Required environment variables

​Recommended dev-mode networking

​Start a two-box cluster

​Inspect cluster nodes

​Submit a job to the cluster

​Common failure patterns

How the cluster works

Leader election

Job failover

Required environment variables

Recommended dev-mode networking

Start a two-box cluster

Inspect cluster nodes

Submit a job to the cluster

Common failure patterns