A collection of common operational failures and their fixes during local and two-box testing.

Troubleshooting

This guide collects the most common operational failures seen during local and two-box testing.

Redis issues

Redis is not running

Symptoms:

runtime tests fail immediately
mn blueprint run ... hangs or errors

Check:

docker ps
docker exec mirror-neuron-redis redis-cli ping

Expected output:

PONG

Fix:

docker rm -f mirror-neuron-redis 2>/dev/null || true
docker run -d --name mirror-neuron-redis -p 6379:6379 redis:7

Redis Sentinel two-box smoke says replica did not become online

Symptoms:

remote replica did not become online

or remote Redis logs show:

Error condition on socket for SYNC: No route to host

Cause:

the remote box cannot route to the local Redis test port
remote Docker bridge networking cannot reach the local LAN IP
firewall rules block the test Redis port

Check from the remote box:

nc -vz -w 3 192.168.4.25 46379

If this fails, let the smoke test auto-select the remote side as the initial primary:

python3 mn-system-tests/test_all.py --redis-ha \
  --redis-ha-remote-host 192.168.4.173 \
  --redis-ha-local-ip 192.168.4.25 \
  --redis-ha-remote-ip 192.168.4.173

Expected output includes:

Remote cannot reach local Redis at 192.168.4.25:46379; using remote as initial primary.
two_box_post_failover_write_read_ok

For direct script control:

cd MirrorNeuron
bash scripts/test_redis_sentinel_two_box_ha.sh \
  --remote-host 192.168.4.173 \
  --local-ip 192.168.4.25 \
  --remote-ip 192.168.4.173 \
  --remote-network auto \
  --initial-primary auto

Redis failover returns `READONLY` or connection errors

During Sentinel promotion, Redis clients can briefly see:

READONLY You can't write against a read only replica

or:

%Redix.ConnectionError{}

MirrorNeuron retries reconnectable Redis errors with bounded backoff. If errors persist, check Sentinel:

redis-cli -p 26379 SENTINEL get-master-addr-by-name mirror-neuron

Expected output is the current primary host and port.

OpenShell issues

gateway is not reachable

Symptoms:

transport errors
connection reset by peer
jobs fail before worker code starts

Check:

openshell status
openshell sandbox list

Expected output includes:

Status: Connected

Reset:

openshell gateway destroy --name openshell
openshell gateway start
openshell status

stale sandboxes slow everything down

Symptoms:

long provisioning delays
tiny jobs feel slow
repeated benchmark runs degrade over time

Clean prime-test sandboxes:

NO_COLOR=1 openshell sandbox list | awk 'NR>1 && index($1, "prime-worker-")==1 {print $1}' | xargs -I{} openshell sandbox delete {}

Cluster issues

`:nodistribution`

Check:

epmd -names
nc -vz 127.0.0.1 4369

Fix:

make sure epmd is running
use fixed Erlang distribution ports
verify local firewall rules

Invalid challenge reply

Symptoms:

[error] ** Connection attempt from node :"node2@192.168.4.173" rejected. Invalid challenge reply. **
Nodes fail to form a cluster even when IP and ports are fully reachable

Fix:

This is an Erlang Cookie mismatch. Both nodes must share the exact same secret cookie.
If you are running nodes on different physical machines, they will auto-generate different cookies by default.
Set the cookie explicitly on both boxes before starting: export MN_COOKIE="my_shared_secret"

HTTP port `eaddrinuse` (4000 already in use)

Symptoms:

[error] Running MirrorNeuron.API.Router with Bandit 1.10.4 at http failed, port 4000 already in use
** (EXIT) shutdown: failed to start child: :listener

Fix:

By default, the MirrorNeuron HTTP API binds to port 4000.
If you run multiple nodes on the same machine, or if you accidentally configure the Erlang --bind to port 4000, they will clash.
Override the HTTP API port for one of the nodes: export MN_API_PORT=4001
Make sure your Erlang --bind distribution port (e.g., 4370) is completely different from your MN_API_PORT.

runtime node name already in use

Symptoms:

the name mn1@... seems to be in use
eaddrinuse

Fix:

stop the old runtime first
avoid starting the same box twice

cluster forms but work does not land on both boxes

Possible causes:

job is too small
remote bundle sync failed
stale CLI/control nodes are confusing routing
one box has less executor capacity

Check:

bash scripts/cluster_cli.sh --box1-ip 192.168.4.29 --box2-ip 192.168.4.35 --self-ip 192.168.4.29 -- inspect nodes
mn node list

Monitor issues

monitor shows too many old jobs

This usually means Redis still contains older job metadata.

Options:

ignore completed jobs with --running-only
delete old jobs manually if needed

monitor JSON has build noise

Use the CLI command:

mn job monitor <job_id>

If you need all jobs first, run mn job list.

LLM example issues

Gemini API key missing

Symptoms:

local LLM e2e fails quickly
cluster LLM harness fails at the first codegen stage

Fix:

export GEMINI_API_KEY="..."

Python version mismatch across boxes

Symptoms:

code works on box 1 but fails on box 2
typing-related syntax errors on older Python versions

Check:

python3 --version

Try to keep both boxes on a compatible Python version.

When a run feels slower than expected

Common reasons:

OpenShell provisioning cost
cold image pulls
stale gateway state
large numbers of very tiny executor tasks
low executor concurrency

If the workflow itself is tiny but runtime is slow, look first at:

sandbox lifecycle overhead
gateway health
whether jobs are being oversharded

Good diagnostic commands

mn node list
mn job status <job_id>
mn job monitor <job_id>
mn job dead-letters <job_id>
openshell status
openshell sandbox list
epmd -names

unlink(content/docs/md-legacy/troubleshooting.md)

Troubleshooting

On this page