MirrorNeuron reliability model and recovery behavior

MirrorNeuron is a durable, retryable workflow runtime — not a consensus-based control plane. Its reliability model is intentionally practical: it persists enough state to recover individual jobs and agents after node loss, elects a cluster leader dynamically, and keeps work moving on healthy nodes when an executor box fails. This page explains what the runtime guarantees today, what it does not yet guarantee, and how to reason about recovery when things go wrong.

What the reliability model provides

MirrorNeuron’s current design targets four goals:

No single executor node causes total workflow failure by default.
Job shards are small enough to replay individually rather than restarting an entire workflow.
Enough durable state is persisted to restart work after process or node loss.
Recovery is at-least-once with lightweight deduplication on common patterns.

It does not yet target exactly-once delivery, Redis failover, or quorum-based control.

Reliability mechanisms

Redis-backed job persistence

MirrorNeuron writes all durable workflow state to Redis. This includes:

Job records and current status.
Job events and history.
Agent snapshots (assigned node, processed message count, in-flight and pending messages, local agent state, heartbeat timestamp).
Cluster and job leases.

Because this state lives outside any individual node, it survives node restarts and process crashes. When a coordinator or agent is rescheduled, it reads its state back from Redis and resumes from where it left off.

Redis is the single source of truth for all durable state. Keep it healthy and reachable from all nodes. See current limitations for what happens when Redis itself is unavailable.

Agent heartbeats and health checks

Agents periodically write a fresh snapshot to Redis, including a last_heartbeat_at timestamp. The job coordinator polls these snapshots on an interval and treats a missing or stale heartbeat as a recovery signal. This lets the coordinator detect agent failures and initiate recovery without relying on Erlang process monitors alone.

Dynamic leader election

One node in the cluster holds the cluster:leader lease in Redis at any given time. The leader is responsible for cluster-wide health checks such as sweeping and recovering orphaned jobs — jobs whose coordinator crashed before Horde could reschedule them. Leader election works as follows:

Acquire the lease

A node acquires the cluster:leader key in Redis. Acquisition is exclusive: only one node holds it at a time.

Refresh periodically

The leader refreshes its lease on an interval to signal liveness.

Automatic failover

If the leader node crashes or becomes unresponsive, the lease expires. Another node immediately acquires it and assumes leadership responsibilities.

Because Redis is the sole arbiter, this design avoids split-brain: the partition that can communicate with Redis retains leadership and job ownership.

Horde-based job coordinator failover

Job runners and coordinators are managed dynamically by Horde across the peer cluster. When a job is submitted, the coordinator acquires a job:<job_id> lease in Redis. If the node running that coordinator dies:

Horde detects the failure and reschedules the coordinator on another available node.
The new coordinator reads the existing job state from Redis.
It acquires the lease (waiting for the previous lease to expire if necessary).
Work resumes from the last durable checkpoint.

This means job coordination is not pinned to the submission node. Any healthy peer can take over.

Agent recovery from persisted snapshots

When an agent disappears (process crash, node loss), the coordinator can restart it from its last snapshot. Recovery restores:

Local agent state.
Pending messages.
The in-flight message (replayed from the snapshot).

This works for both coordinator-initiated recovery and agent redistribution when an executor node joins or rejoins the cluster.

Replay of completed executor outputs

Executors persist their last emitted output payload. If an executor had already finished its sandbox work before a node died, recovery re-emits that logical result rather than losing it silently. This closes the gap where sandbox execution completed but the downstream collector had not yet durably observed the result.

Aggregator deduplication

The built-in aggregator deduplicates replayed results by agent_id when that field is present in the payload. This makes replay safe for common fan-out/fan-in patterns such as prime sweep workers or single-result executor shards.

Deduplication is keyed on agent_id. It works well for one-result-per-worker patterns. It is not a universal dedupe system for arbitrary multi-message streams.

Executor retry and backoff

Before cross-node recovery is needed, executors retry transient sandbox failures automatically with bounded backoff. Covered failures include OpenShell transport errors, connection reset or close, and transient sandbox startup failures.

What to expect when a node fails

When an executor node dies during a job:

Jobs continue running if the coordinator is on a healthy node.
Some work may be replayed (at-least-once model).
Throughput drops proportionally to the lost capacity.
Completion takes longer but the result should still converge if replayable state exists.

This is degraded service, not zero-impact failover. Plan your workflow shard sizes and executor pool capacity accordingly.

Recovery policies

MirrorNeuron supports multiple recovery modes for agents. The local_restart policy restarts a failed agent on the same node, which minimizes coordination overhead for transient process failures. For cross-node recovery after a node loss, the coordinator uses the persisted snapshot to restart the agent on whichever node is available.

Design your executor tasks to be deterministic and idempotent. This makes replay safe and removes the need for complex exactly-once logic in your worker code.

Current limitations

These are real constraints you should understand before relying on MirrorNeuron in production.

Redis is a single point of failure

All durable state — job records, events, agent snapshots, leases — lives in Redis. If Redis is unavailable or its data is corrupted:

Job state persistence stops working.
Recovery data is inaccessible.
Event history is lost.

Connection handling is more resilient than it used to be (the runtime reconnects and falls back to one-shot connections when needed), but Redis itself has no built-in redundancy in the current design. Monitor Redis health actively and treat it as critical infrastructure.

At-least-once, not exactly-once

Recovery can replay work or results. The aggregator deduplication helps for common patterns, but it is not a universal guarantee. If your workflow logic requires exactly-once semantics, the current runtime does not support that yet.

Node loss is covered; full platform loss is not

The current reliability work validates executor-node loss during active work. The following scenarios do not yet have comparable HA mechanisms:

Redis unavailability or data loss.
Loss of the seed/control node before or during job submission.
Multi-box network partitions with competing split-brain handling.

Redis client logs can be noisy under stress

The runtime recovers from broken Redis connections gracefully, but under high load you may see warning logs about closed connections. Jobs continue completing successfully via the fallback path — the noise is a logging issue, not a data loss issue.

Practical guidance for reliable workflows

To get the best results with the current runtime:

Keep work in bounded shards. Small units of work mean recovery replays one shard, not an entire workflow. The prime sweep examples demonstrate this pattern.
Write deterministic executor tasks. Deterministic tasks make replay safe without needing exactly-once guarantees.
Design aggregators to tolerate replay. Use the agent_id field in executor output payloads to benefit from built-in deduplication.
Keep Redis healthy. Redis is not optional. Monitor it, back it up, and ensure all nodes can reach it.
Treat box loss as capacity loss. When a node dies, the workflow degrades in throughput but should still complete. Do not restart the whole workflow unless the coordinator itself is also unrecoverable.

Running the failover test harness

You can verify the recovery path end-to-end with the included harness. It starts a two-box cluster, submits a prime fan-out job, kills one node mid-execution, and verifies the job completes on the surviving node:

bash scripts/test_cluster_prime_failover_e2e.sh \
  --box1-ip 192.168.4.29 \
  --box2-ip 192.168.4.35 \
  --start 1000003 \
  --end 1006002

A successful run completes the job and emits recovery events in the event log. Inspect them with:

./mirror_neuron events <job_id>

What comes next

If reliability becomes the next major investment area, the most valuable additions would be:

Stale lease reclaim tied to node liveness signals.
Event-driven job completion instead of coordinator polling.
Stronger durable mailbox semantics for critical messages.
Deterministic coordinator ownership recorded in durable state.
HA Redis or an equivalent replicated metadata store.

Get Started

Core Concepts

Guides

CLI Reference

Configuration

Troubleshooting

MirrorNeuron reliability model and recovery behavior

What the reliability model provides

Reliability mechanisms

Redis-backed job persistence

Agent heartbeats and health checks

Dynamic leader election

Horde-based job coordinator failover

Agent recovery from persisted snapshots

Replay of completed executor outputs

Aggregator deduplication

Executor retry and backoff

What to expect when a node fails

Recovery policies

Current limitations

Practical guidance for reliable workflows

Running the failover test harness

What comes next

Get Started

Core Concepts

Guides

CLI Reference

Configuration

Troubleshooting

​What the reliability model provides

​Reliability mechanisms

​Redis-backed job persistence

​Agent heartbeats and health checks

​Dynamic leader election

​Horde-based job coordinator failover

​Agent recovery from persisted snapshots

​Replay of completed executor outputs

​Aggregator deduplication

​Executor retry and backoff

​What to expect when a node fails

​Recovery policies

​Current limitations

​Practical guidance for reliable workflows

​Running the failover test harness

​What comes next

What the reliability model provides

Reliability mechanisms

Redis-backed job persistence

Agent heartbeats and health checks

Dynamic leader election

Horde-based job coordinator failover

Agent recovery from persisted snapshots

Replay of completed executor outputs

Aggregator deduplication

Executor retry and backoff

What to expect when a node fails

Recovery policies

Current limitations

Practical guidance for reliable workflows

Running the failover test harness

What comes next