Troubleshooting
A collection of common operational failures and their fixes during local and two-box testing.
Troubleshooting
This guide collects the most common operational failures seen during local and two-box testing.
Redis issues
Redis is not running
Symptoms:
- runtime tests fail immediately
mn blueprint run ...hangs or errors
Check:
docker ps
docker exec mirror-neuron-redis redis-cli pingExpected output:
PONGFix:
docker rm -f mirror-neuron-redis 2>/dev/null || true
docker run -d --name mirror-neuron-redis -p 6379:6379 redis:7Redis Sentinel two-box smoke says replica did not become online
Symptoms:
remote replica did not become onlineor remote Redis logs show:
Error condition on socket for SYNC: No route to hostCause:
- the remote box cannot route to the local Redis test port
- remote Docker bridge networking cannot reach the local LAN IP
- firewall rules block the test Redis port
Check from the remote box:
nc -vz -w 3 192.168.4.25 46379If this fails, let the smoke test auto-select the remote side as the initial primary:
python3 mn-system-tests/test_all.py --redis-ha \
--redis-ha-remote-host 192.168.4.173 \
--redis-ha-local-ip 192.168.4.25 \
--redis-ha-remote-ip 192.168.4.173Expected output includes:
Remote cannot reach local Redis at 192.168.4.25:46379; using remote as initial primary.
two_box_post_failover_write_read_okFor direct script control:
cd MirrorNeuron
bash scripts/test_redis_sentinel_two_box_ha.sh \
--remote-host 192.168.4.173 \
--local-ip 192.168.4.25 \
--remote-ip 192.168.4.173 \
--remote-network auto \
--initial-primary autoRedis failover returns READONLY or connection errors
During Sentinel promotion, Redis clients can briefly see:
READONLY You can't write against a read only replicaor:
%Redix.ConnectionError{}MirrorNeuron retries reconnectable Redis errors with bounded backoff. If errors persist, check Sentinel:
redis-cli -p 26379 SENTINEL get-master-addr-by-name mirror-neuronExpected output is the current primary host and port.
OpenShell issues
gateway is not reachable
Symptoms:
- transport errors
- connection reset by peer
- jobs fail before worker code starts
Check:
openshell status
openshell sandbox listExpected output includes:
Status: ConnectedReset:
openshell gateway destroy --name openshell
openshell gateway start
openshell statusstale sandboxes slow everything down
Symptoms:
- long provisioning delays
- tiny jobs feel slow
- repeated benchmark runs degrade over time
Clean prime-test sandboxes:
NO_COLOR=1 openshell sandbox list | awk 'NR>1 && index($1, "prime-worker-")==1 {print $1}' | xargs -I{} openshell sandbox delete {}Cluster issues
:nodistribution
Check:
epmd -names
nc -vz 127.0.0.1 4369Fix:
- make sure
epmdis running - use fixed Erlang distribution ports
- verify local firewall rules
Invalid challenge reply
Symptoms:
[error] ** Connection attempt from node :"node2@192.168.4.173" rejected. Invalid challenge reply. **- Nodes fail to form a cluster even when IP and ports are fully reachable
Fix:
- This is an Erlang Cookie mismatch. Both nodes must share the exact same secret cookie.
- If you are running nodes on different physical machines, they will auto-generate different cookies by default.
- Set the cookie explicitly on both boxes before starting:
export MN_COOKIE="my_shared_secret"
HTTP port eaddrinuse (4000 already in use)
Symptoms:
[error] Running MirrorNeuron.API.Router with Bandit 1.10.4 at http failed, port 4000 already in use** (EXIT) shutdown: failed to start child: :listener
Fix:
- By default, the MirrorNeuron HTTP API binds to port
4000. - If you run multiple nodes on the same machine, or if you accidentally configure the Erlang
--bindto port 4000, they will clash. - Override the HTTP API port for one of the nodes:
export MN_API_PORT=4001 - Make sure your Erlang
--binddistribution port (e.g.,4370) is completely different from yourMN_API_PORT.
runtime node name already in use
Symptoms:
the name mn1@... seems to be in useeaddrinuse
Fix:
- stop the old runtime first
- avoid starting the same box twice
cluster forms but work does not land on both boxes
Possible causes:
- job is too small
- remote bundle sync failed
- stale CLI/control nodes are confusing routing
- one box has less executor capacity
Check:
bash scripts/cluster_cli.sh --box1-ip 192.168.4.29 --box2-ip 192.168.4.35 --self-ip 192.168.4.29 -- inspect nodes
mn node listMonitor issues
monitor shows too many old jobs
This usually means Redis still contains older job metadata.
Options:
- ignore completed jobs with
--running-only - delete old jobs manually if needed
monitor JSON has build noise
Use the CLI command:
mn job monitor <job_id>
If you need all jobs first, run mn job list.
LLM example issues
Gemini API key missing
Symptoms:
- local LLM e2e fails quickly
- cluster LLM harness fails at the first codegen stage
Fix:
export GEMINI_API_KEY="..."Python version mismatch across boxes
Symptoms:
- code works on box 1 but fails on box 2
- typing-related syntax errors on older Python versions
Check:
python3 --versionTry to keep both boxes on a compatible Python version.
When a run feels slower than expected
Common reasons:
- OpenShell provisioning cost
- cold image pulls
- stale gateway state
- large numbers of very tiny executor tasks
- low executor concurrency
If the workflow itself is tiny but runtime is slow, look first at:
- sandbox lifecycle overhead
- gateway health
- whether jobs are being oversharded
Good diagnostic commands
mn node list
mn job status <job_id>
mn job monitor <job_id>
mn job dead-letters <job_id>
openshell status
openshell sandbox list
epmd -namesunlink(content/docs/md-legacy/troubleshooting.md)