What the reliability model provides
MirrorNeuron’s current design targets four goals:- No single executor node causes total workflow failure by default.
- Job shards are small enough to replay individually rather than restarting an entire workflow.
- Enough durable state is persisted to restart work after process or node loss.
- Recovery is at-least-once with lightweight deduplication on common patterns.
Reliability mechanisms
Redis-backed job persistence
MirrorNeuron writes all durable workflow state to Redis. This includes:- Job records and current status.
- Job events and history.
- Agent snapshots (assigned node, processed message count, in-flight and pending messages, local agent state, heartbeat timestamp).
- Cluster and job leases.
Redis is the single source of truth for all durable state. Keep it healthy and reachable from all nodes. See current limitations for what happens when Redis itself is unavailable.
Agent heartbeats and health checks
Agents periodically write a fresh snapshot to Redis, including alast_heartbeat_at timestamp. The job coordinator polls these snapshots on an interval and treats a missing or stale heartbeat as a recovery signal. This lets the coordinator detect agent failures and initiate recovery without relying on Erlang process monitors alone.
Dynamic leader election
One node in the cluster holds thecluster:leader lease in Redis at any given time. The leader is responsible for cluster-wide health checks such as sweeping and recovering orphaned jobs — jobs whose coordinator crashed before Horde could reschedule them.
Leader election works as follows:
Acquire the lease
A node acquires the
cluster:leader key in Redis. Acquisition is exclusive: only one node holds it at a time.Horde-based job coordinator failover
Job runners and coordinators are managed dynamically by Horde across the peer cluster. When a job is submitted, the coordinator acquires ajob:<job_id> lease in Redis.
If the node running that coordinator dies:
- Horde detects the failure and reschedules the coordinator on another available node.
- The new coordinator reads the existing job state from Redis.
- It acquires the lease (waiting for the previous lease to expire if necessary).
- Work resumes from the last durable checkpoint.
Agent recovery from persisted snapshots
When an agent disappears (process crash, node loss), the coordinator can restart it from its last snapshot. Recovery restores:- Local agent state.
- Pending messages.
- The in-flight message (replayed from the snapshot).
Replay of completed executor outputs
Executors persist their last emitted output payload. If an executor had already finished its sandbox work before a node died, recovery re-emits that logical result rather than losing it silently. This closes the gap where sandbox execution completed but the downstream collector had not yet durably observed the result.Aggregator deduplication
The built-in aggregator deduplicates replayed results byagent_id when that field is present in the payload. This makes replay safe for common fan-out/fan-in patterns such as prime sweep workers or single-result executor shards.
Deduplication is keyed on
agent_id. It works well for one-result-per-worker patterns. It is not a universal dedupe system for arbitrary multi-message streams.Executor retry and backoff
Before cross-node recovery is needed, executors retry transient sandbox failures automatically with bounded backoff. Covered failures include OpenShell transport errors, connection reset or close, and transient sandbox startup failures.What to expect when a node fails
When an executor node dies during a job:- Jobs continue running if the coordinator is on a healthy node.
- Some work may be replayed (at-least-once model).
- Throughput drops proportionally to the lost capacity.
- Completion takes longer but the result should still converge if replayable state exists.
Recovery policies
MirrorNeuron supports multiple recovery modes for agents. Thelocal_restart policy restarts a failed agent on the same node, which minimizes coordination overhead for transient process failures. For cross-node recovery after a node loss, the coordinator uses the persisted snapshot to restart the agent on whichever node is available.
Current limitations
These are real constraints you should understand before relying on MirrorNeuron in production.Redis is a single point of failure
Redis is a single point of failure
All durable state — job records, events, agent snapshots, leases — lives in Redis. If Redis is unavailable or its data is corrupted:
- Job state persistence stops working.
- Recovery data is inaccessible.
- Event history is lost.
At-least-once, not exactly-once
At-least-once, not exactly-once
Recovery can replay work or results. The aggregator deduplication helps for common patterns, but it is not a universal guarantee. If your workflow logic requires exactly-once semantics, the current runtime does not support that yet.
Node loss is covered; full platform loss is not
Node loss is covered; full platform loss is not
The current reliability work validates executor-node loss during active work. The following scenarios do not yet have comparable HA mechanisms:
- Redis unavailability or data loss.
- Loss of the seed/control node before or during job submission.
- Multi-box network partitions with competing split-brain handling.
Redis client logs can be noisy under stress
Redis client logs can be noisy under stress
The runtime recovers from broken Redis connections gracefully, but under high load you may see warning logs about closed connections. Jobs continue completing successfully via the fallback path — the noise is a logging issue, not a data loss issue.
Practical guidance for reliable workflows
To get the best results with the current runtime:- Keep work in bounded shards. Small units of work mean recovery replays one shard, not an entire workflow. The prime sweep examples demonstrate this pattern.
- Write deterministic executor tasks. Deterministic tasks make replay safe without needing exactly-once guarantees.
- Design aggregators to tolerate replay. Use the
agent_idfield in executor output payloads to benefit from built-in deduplication. - Keep Redis healthy. Redis is not optional. Monitor it, back it up, and ensure all nodes can reach it.
- Treat box loss as capacity loss. When a node dies, the workflow degrades in throughput but should still complete. Do not restart the whole workflow unless the coordinator itself is also unrecoverable.
Running the failover test harness
You can verify the recovery path end-to-end with the included harness. It starts a two-box cluster, submits a prime fan-out job, kills one node mid-execution, and verifies the job completes on the surviving node:What comes next
If reliability becomes the next major investment area, the most valuable additions would be:- Stale lease reclaim tied to node liveness signals.
- Event-driven job completion instead of coordinator polling.
- Stronger durable mailbox semantics for critical messages.
- Deterministic coordinator ownership recorded in durable state.
- HA Redis or an equivalent replicated metadata store.