Troubleshoot common MirrorNeuron issues

Most MirrorNeuron problems fall into a small number of categories: missing or unhealthy dependencies (Redis, OpenShell), cluster networking and configuration errors, job manifest or sandbox failures, and monitor connectivity. Work through the relevant section below to identify and fix the problem you are seeing.

Installation issues

Problem: Redis is not running

Symptoms

Runtime tests fail immediately on startup.
mirror_neuron run ... hangs without output or exits with a connection error.

DiagnosisCheck whether the Redis container is running and accepting connections:

docker ps
docker exec mirror-neuron-redis redis-cli ping

A healthy Redis responds with PONG. If the container is missing or not responding, continue to the fix below.FixRemove any stale container and start a fresh one:

docker rm -f mirror-neuron-redis 2>/dev/null || true
docker run -d --name mirror-neuron-redis -p 6379:6379 redis:7

Run docker exec mirror-neuron-redis redis-cli ping again to confirm Redis is available before retrying your command.

Problem: OpenShell gateway is not reachable

Symptoms

Transport errors or “connection reset by peer” when jobs start.
Jobs fail before any worker code runs.
openshell status reports the gateway as unreachable.

Diagnosis

openshell status
openshell sandbox list

If the gateway shows as stopped or reports errors, proceed to the fix.FixDestroy the stale gateway and start a clean one:

openshell gateway destroy --name openshell
openshell gateway start
openshell status

Confirm the gateway is running before submitting jobs again.

Problem: Stale sandboxes cause slow provisioning

Symptoms

Long provisioning delays even for small jobs.
Repeated benchmark runs get progressively slower.
openshell sandbox list shows many old sandbox entries.

FixClean up leftover sandboxes from previous runs. The command below removes sandboxes whose names start with prime-worker-:

NO_COLOR=1 openshell sandbox list \
  | awk 'NR>1 && index($1, "prime-worker-")==1 {print $1}' \
  | xargs -I{} openshell sandbox delete {}

Adjust the prefix pattern if your workflow uses a different naming convention.

If provisioning latency is still high after cleanup, also check gateway health with openshell status. A stale gateway state can add overhead independent of sandbox count.

Cluster issues

Problem: :nodistribution error on startup

Symptoms

The runtime exits at startup with a :nodistribution error.
Nodes cannot reach each other even when IP addresses are correct.

DiagnosisCheck whether epmd (the Erlang port mapper daemon) is running and whether port 4369 is reachable:

epmd -names
nc -vz 127.0.0.1 4369

Fix

Start epmd if it is not running: epmd -daemon

Pin the Erlang distribution port range to avoid random port allocation:

export ERL_AFLAGS="-kernel inet_dist_listen_min 4370 inet_dist_listen_max 4370"
export MIRROR_NEURON_DIST_PORT="4370"

Verify that your firewall allows traffic on port 4369 and the distribution port you chose.

Problem: Invalid challenge reply (cookie mismatch)

Symptoms

Error log: Connection attempt from node :"node2@..." rejected. Invalid challenge reply.
Nodes cannot form a cluster even though IP addresses and ports are fully reachable.

CauseEach physical machine auto-generates a different Erlang cookie by default. Both nodes must share the exact same secret to authenticate.FixSet MIRROR_NEURON_COOKIE to the same value on every box before starting the runtime:

export MIRROR_NEURON_COOKIE="my_shared_secret"

Restart both nodes after setting the variable. The cookie must be identical — including case — on every machine in the cluster.

Do not rely on the auto-generated cookie when running nodes on separate physical machines. They will never match by default.

Problem: Port 4000 already in use (eaddrinuse)

Symptoms

Error log: Running MirrorNeuron.API.Router with Bandit at http failed, port 4000 already in use
Startup fails with ** (EXIT) shutdown: failed to start child: :listener

CauseMirrorNeuron’s HTTP API binds to port 4000 by default. If you run two nodes on the same machine, or if your Erlang distribution --bind port is also set to 4000, the second process fails to start.FixOverride the HTTP API port for the second node:

export MIRROR_NEURON_API_PORT=4001

Make sure this value is completely different from your Erlang distribution port (e.g. 4370). The two must not overlap.

Problem: Node name already in use

Symptoms

Error: the name mn1@... seems to be in use
eaddrinuse on startup without an obvious port conflict.

FixA previous runtime process is still registered under that node name. Stop it before starting again:

./mirror_neuron node list

Identify the stale process, stop it, and retry. Avoid starting the same node twice on the same box.

Problem: Cluster forms but work only runs on one box

Symptoms

Both nodes appear in the cluster view, but executor activity only shows on one box.
Jobs complete but remote box capacity is idle.

Possible causes

Jobs are too small to benefit from remote distribution.
The remote bundle failed to sync to the second node.
A stale CLI or control node is interfering with routing.
One node has significantly less executor pool capacity configured.

DiagnosisInspect cluster membership and live job distribution:

./mirror_neuron node list

./mirror_neuron monitor \
  --box1-ip 192.168.4.29 \
  --box2-ip 192.168.4.35 \
  --self-ip 192.168.4.29

Confirm both nodes are visible and check whether executor capacity is balanced.

Split-brain is not a concern here — Redis acts as the single arbiter for leader election and job ownership, so the partition that can reach Redis retains control.

Job execution issues

Problem: Manifest validation errors

Symptoms

mirror_neuron run exits early with a validation error.
Error message references a missing field, wrong type, or unknown agent type.

DiagnosisRun the validator directly to get a detailed error message:

./mirror_neuron validate path/to/your/bundle

Common causes and fixes

Missing required top-level fields (name, agents, edges).
Agent type is not one of the built-in primitives: router, executor, aggregator, sensor.
An edge references an agent ID that does not exist in the agents list.
Payload files referenced in the manifest are missing from the payloads/ directory.

Fix each reported error, then re-run validate until it passes before submitting.

Problem: Sandbox execution failure

Symptoms

Jobs start but individual executor tasks fail with sandbox errors.
Error logs include transport errors, connection reset, or sandbox startup failures.

DiagnosisCheck gateway health and current sandbox state:

openshell status
openshell sandbox list

Review recent job events for the failing job:

./mirror_neuron events <job_id>

FixThe executor retries transient sandbox failures automatically with backoff. If failures persist:

Reset the OpenShell gateway (see OpenShell gateway not reachable above).
Clean up stale sandboxes.
Re-submit the job.

If worker code succeeds on one box but fails on another, check that both machines run the same Python version: python3 --version. Syntax errors from version mismatches are a common cross-box failure mode.

Problem: Job is stuck in pending

Symptoms

A job is submitted successfully but stays in pending state indefinitely.
No executor activity appears in the monitor.

Possible causes

All executor pool slots are occupied by a previous job.
The job coordinator failed to start and was not yet rescheduled by Horde.
Redis is unavailable, preventing the coordinator from reading job state.

Diagnosis

./mirror_neuron node list
./mirror_neuron agent list <job_id>

Also verify Redis is healthy:

docker exec mirror-neuron-redis redis-cli ping

If the coordinator node is healthy and Redis is up, wait briefly for Horde to reschedule the coordinator. If it does not recover, restart the runtime on the affected node.

Monitor issues

Problem: Monitor shows too many old or completed jobs

Symptoms

The monitor view is cluttered with jobs from previous runs.
It is hard to distinguish active jobs from historical ones.

FixFilter to running jobs only:

./mirror_neuron monitor --running-only

To permanently remove old job metadata from Redis, delete the relevant records manually. Use node list and event history to confirm a job is truly complete before deleting.

Problem: Monitor output contains build noise or garbage characters

Symptoms

JSON output from the monitor includes Elixir compiler messages.
Terminal rendering is corrupted or hard to read.

FixUse the checked-in wrapper instead of running mix run directly:

./mirror_neuron monitor

The wrapper starts the application in a cleaner mode that suppresses build-time output before rendering the monitor UI.

Problem: Monitor cannot connect to the cluster

Symptoms

Monitor starts but shows no nodes or agents.
All metrics read as zero despite jobs being active.

DiagnosisConfirm the monitor is pointing at the correct node addresses:

./mirror_neuron monitor \
  --box1-ip 192.168.4.29 \
  --box2-ip 192.168.4.35 \
  --self-ip 192.168.4.29

Also run a direct node check to confirm the cluster is reachable:

./mirror_neuron node list

If nodes do not appear, work through the cluster issues section above.

Useful diagnostic commands

When you are not sure where a problem originates, run these commands to get a quick picture of cluster and job health:

./mirror_neuron node list
./mirror_neuron events <job_id>
./mirror_neuron agent list <job_id>
./mirror_neuron monitor
openshell status
openshell sandbox list
epmd -names

Get Started

Core Concepts

Guides

CLI Reference

Configuration

Troubleshooting

Troubleshoot common MirrorNeuron issues

Installation issues

Cluster issues

Job execution issues

Monitor issues

Useful diagnostic commands

Get Started

Core Concepts

Guides

CLI Reference

Configuration

Troubleshooting

​Installation issues

​Cluster issues

​Job execution issues

​Monitor issues

​Useful diagnostic commands

Installation issues

Cluster issues

Job execution issues

Monitor issues

Useful diagnostic commands