MirrorNeuron Clustering Architecture
Documentation for MirrorNeuron Clustering Architecture.
MirrorNeuron Clustering Architecture
MirrorNeuron supports horizontal scaling by seamlessly clustering multiple Elixir/Docker nodes. Under the hood, this relies on Distributed Erlang for real-time messaging, and Redis for shared durable state.
1. Network Requirements
For two nodes, for example 192.168.4.25 and 192.168.4.173, to communicate successfully:
- Erlang Port Mapper Daemon (EPMD): Port
4369must be open and reachable. - Erlang Distribution Port: MirrorNeuron helpers pin BEAM distribution to port
4370withMN_DIST_PORTandERL_AFLAGS. - gRPC: The deployed host gRPC port, usually
55051, must be reachable for CLI, SDK, and node operator calls. The core container listens on50051internally. - Redis: Development clusters can use one shared Redis on port
6379. Multi-box reliability should use Redis Sentinel HA so each box has a replicated Redis and MirrorNeuron reconnects to the Sentinel-elected primary. - Redis Sentinel: Sentinel uses port
26379when HA mode is enabled.
2. Docker Configuration
When deploying a cluster across different host operating systems (like macOS and Linux), Docker networking behaves differently.
- Local Docker Compose: MirrorNeuron uses a named bridge network and a persisted
MN_NODE_ALIASso the single-node BEAM identity is stable across laptop IP changes. - Multi-host Docker: Use an existing attachable Docker overlay network. The CLI validates the overlay and uses Docker DNS aliases for Erlang distribution and Redis.
- Legacy IP mode: Set
MN_DOCKER_NETWORK_MODE=disabledto use host/IP-backed Erlang names.
In Docker network mode, we inject MN_NODE_NAME=mirror_neuron@<MN_NODE_ALIAS> so the Erlang node is explicitly addressable without depending on the current LAN IP. In legacy mode, MN_NODE_NAME=mirror_neuron@<IP> remains available.
3. Remote Payload Execution (HostLocal Runner)
When a Job is submitted via the REST API to the Leader node, its artifacts (like Python scripts inside payloads/) are extracted to a temporary directory in the Leader's /tmp/bundle_xxx path.
As tasks are scheduled to execute remotely, the MirrorNeuron.Runner.HostLocal module dynamically detects if the execution is happening locally or remotely:
- Local: It simply uses
File.cp_rto natively copy the files to the sandbox. - Remote: It establishes a synchronous
rpc:callback to thecoordinator_node(the Leader) to recursively read the directory tree over the Erlang distribution network and writes the payloads to the remote container's Sandbox filesystem.
This ensures that workflows can seamlessly map/reduce completely agnostic to the underlying hardware execution plane.
4. Nomad-Inspired Control Loop
The clustered runtime now has a Nomad-inspired control loop:
- the scheduler places agents on eligible nodes using resources, devices, ports, volumes, runtime drivers, constraints, and service requirements
- node state in Redis decides whether a node is healthy, joining, draining, in maintenance, disconnected, offline, or quarantined
- the reconciler handles node loss, orphaned jobs, and policy-driven reschedules
- the job coordinator restarts agents locally first, then asks the reconciler to move safe work when restart policy is exhausted
- node drain marks a node ineligible, moves safe service work, lets batch work finish, and leaves the node in maintenance until undrained
- the leader sweeps recovery evals, due drains, orphaned jobs, and due schedules
See Nomad-Inspired Runtime Features, Reliability Guide, and Cluster Guide.
5. Starting a Cluster
On Node 1 (The Leader / Initial Node)
mn runtime startStarts Redis, the API, and sets itself up as the coordinating node. Ensure your firewall permits access to 4369, Redis/Sentinel ports, and the configured Erlang distribution ports.
On Node 2 (The Worker)
mn runtime start --worker-nodeBack On Node 1
mn node join <WORKER_IP> --token <worker-token>
# e.g., mn node join 192.168.4.25 --token <worker-token>Promotes the main runtime to cluster mode if needed, then connects the worker. If Node 1 has multiple LAN addresses, pass --local-host <NODE_1_IP> to choose the advertised address.
Verifying Connection
mn node listYou should see multiple items under nodes, and their respective hardware capacities pooled together in the executor_pools.
6. Avoiding Local Resource Exhaustion
When running heavy distributed load tests such as parallel_worker_benchmark, very high worker counts across a small two-node development setup may exhaust CPU and networking file descriptors, causing nodes to miss Erlang heartbeats (timed out waiting for recovered agent ...).
To test scaling logic without overloading small development VMs, lower the worker count in the blueprint configuration before running the benchmark.
This enables the framework to accurately demonstrate Map/Reduce scaling topologies, Remote RPC artifact synchronization (MirrorNeuron.Runner.HostLocal), and cross-node swarm orchestration entirely under manageable resource constraints.