Nomad-Inspired Runtime Features

MirrorNeuron now has a small-lab orchestration layer inspired by Nomad. It is not a Kubernetes clone and it is not a full Nomad replacement. The goal is narrower: run agent workloads across a few machines, make placement decisions from real resources, recover safely when nodes fail, and expose enough operator control for AI lab maintenance.

Feature Status

Feature	Status	Start here
Reconciliation and automatic rescheduling	Implemented	Reliability Guide and Cluster Guide
Full job type behavior	Implemented	Reliability Guide
Restart and reschedule policies	Implemented	Reliability Guide
Node drain and maintenance	Implemented	Cluster Guide
Service registry and health checks	Implemented	Services and Health Checks
Stronger resources and devices	Implemented	Resources and Devices
Deployment and update strategy	Implemented	Deployments
Periodic, delayed, and event-triggered jobs	Implemented	Schedules and Events

Design Concept

The runtime uses a desired-vs-actual model:

the manifest describes desired agents, services, resources, policies, and schedules
Redis stores durable job state, node state, snapshots, leases, schedules, services, and deployments
the scheduler chooses eligible target nodes and concrete allocations
job coordinators keep agents running and apply lifecycle policy
the leader sweeps recovery evals, orphaned jobs, due drains, and due schedules
the reconciler moves only the affected agents when it can, and restarts a whole job only when the coordinator lease is gone

That gives MirrorNeuron the Nomad-like heart for small clusters without requiring a heavy control plane.

How To Use The Features Together

Use cluster_recover for work that can be replayed safely:

{
  "type": "service",
  "policies": {
    "recovery_mode": "cluster_recover",
    "restart": {
      "attempts": 3,
      "interval_ms": 600000,
      "delay_ms": 1000,
      "delay_function": "exponential",
      "max_delay_ms": 30000,
      "mode": "fail"
    },
    "reschedule": {
      "unlimited": true,
      "delay_ms": 5000,
      "delay_function": "exponential",
      "max_delay_ms": 300000
    }
  }
}

Use system for one copy per eligible machine:

{
  "policies": {
    "scheduler": {
      "job_type": "system"
    }
  }
}

Use resource and service requirements together when a worker needs a healthy local service:

{
  "nodes": [
    {
      "node_id": "embedding_worker",
      "agent_type": "executor",
      "resources": {
        "devices": [
          {
            "kind": "gpu",
            "driver": "cuda",
            "min_memory_mb": 16000,
            "count": 1
          }
        ]
      },
      "requires_services": [
        {
          "name": "vllm",
          "tags": ["embedding"],
          "required": true
        }
      ]
    }
  ]
}

Drain a node before rebooting it:

mn node drain mirror_neuron@192.168.4.20 --reason "driver update" --deadline 30m --dry-run
mn node drain mirror_neuron@192.168.4.20 --reason "driver update" --deadline 30m --wait
mn node undrain mirror_neuron@192.168.4.20 --mark-eligible --reason "ready"

Deploy a long-running service under a stable key:

mn deployment deploy /path/to/bundle --key agent-api --strategy rolling --max-parallel 1
mn deployment status agent-api

Schedule a batch job:

mn schedule create /path/to/bundle --cron "0 2 * * *" --timezone America/New_York

Create an event trigger:

mn trigger create /path/to/bundle --event file_uploaded --filter-json '{"path":{"prefix":"/datasets/"}}'
mn event emit file_uploaded --payload-json '{"path":"/datasets/eval.jsonl"}'

Important Code

Area	Important files
Manifest fields and validation	`MirrorNeuron/lib/mirror_neuron/manifest.ex`
Placement scheduler	`MirrorNeuron/lib/mirror_neuron/scheduler.ex`
Resource spec and allocation env	`MirrorNeuron/lib/mirror_neuron/resource_spec.ex`
Node inventory and resource API	`MirrorNeuron/lib/mirror_neuron/resource.ex`
Job coordinator lifecycle	`MirrorNeuron/lib/mirror_neuron/runtime/job_coordinator.ex`
Restart and reschedule policy	`MirrorNeuron/lib/mirror_neuron/runtime/lifecycle_policy.ex`
Reconciliation	`MirrorNeuron/lib/mirror_neuron/cluster/reconciler.ex`
Node monitor and leader sweeps	`MirrorNeuron/lib/mirror_neuron/cluster/node_monitor.ex`, `MirrorNeuron/lib/mirror_neuron/cluster/leader.ex`
Drain and maintenance	`MirrorNeuron/lib/mirror_neuron/cluster/node_drainer.ex`
Service declarations	`MirrorNeuron/lib/mirror_neuron/service_spec.ex`
Service checks and registry	`MirrorNeuron/lib/mirror_neuron/service_check.ex`, `MirrorNeuron/lib/mirror_neuron/service_registry.ex`, `MirrorNeuron/lib/mirror_neuron/service_monitor.ex`, `MirrorNeuron/lib/mirror_neuron/service_preflight.ex`
Deployments	`MirrorNeuron/lib/mirror_neuron/runtime/deployment_policy.ex`, `MirrorNeuron/lib/mirror_neuron/runtime/deployment_controller.ex`
Schedules and triggers	`MirrorNeuron/lib/mirror_neuron/runtime/schedule_policy.ex`, `MirrorNeuron/lib/mirror_neuron/runtime/schedule_dispatcher.ex`
Durable Redis state	`MirrorNeuron/lib/mirror_neuron/persistence/redis_store.ex`
gRPC server surfaces	`MirrorNeuron/lib/mirror_neuron_grpc/server.ex`
CLI commands	`mn-cli/mn_cli/main.py`, `mn-cli/mn_cli/libs/*.py`
Python SDK	`mn-python-sdk/mn_sdk/client.py`, `mn-python-sdk/mn_sdk/bundle.py`, `mn-python-sdk/mn_sdk/decorators.py`
Blueprint validation	`mn-python-sdk/mn_sdk/blueprint_validation.py`

What Is Still Intentionally V1

resources and devices drive scheduling and runtime environment hints, not OS isolation
volumes are validated and advertised, but not mounted automatically by core
ports are explicit only, with no dynamic port allocator yet
service health affects discovery and placement, but does not automatically restart jobs yet
deployment rollout behavior is focused on service and system jobs
scheduled child jobs are ordinary jobs, so the normal reliability model still applies

Nomad-Inspired Runtime Features

Nomad-Inspired Runtime Features

Feature Status

Design Concept

How To Use The Features Together

Important Code

What Is Still Intentionally V1

On this page