Mirror Neuron Documents

Nomad-Inspired Runtime Features

Documentation for Nomad-Inspired Runtime Features.

Nomad-Inspired Runtime Features

MirrorNeuron now has a small-lab orchestration layer inspired by Nomad. It is not a Kubernetes clone and it is not a full Nomad replacement. The goal is narrower: run agent workloads across a few machines, make placement decisions from real resources, recover safely when nodes fail, and expose enough operator control for AI lab maintenance.

Feature Status

FeatureStatusStart here
Reconciliation and automatic reschedulingImplementedReliability Guide and Cluster Guide
Full job type behaviorImplementedReliability Guide
Restart and reschedule policiesImplementedReliability Guide
Node drain and maintenanceImplementedCluster Guide
Service registry and health checksImplementedServices and Health Checks
Stronger resources and devicesImplementedResources and Devices
Deployment and update strategyImplementedDeployments
Periodic, delayed, and event-triggered jobsImplementedSchedules and Events

Design Concept

The runtime uses a desired-vs-actual model:

  • the manifest describes desired agents, services, resources, policies, and schedules
  • Redis stores durable job state, node state, snapshots, leases, schedules, services, and deployments
  • the scheduler chooses eligible target nodes and concrete allocations
  • job coordinators keep agents running and apply lifecycle policy
  • the leader sweeps recovery evals, orphaned jobs, due drains, and due schedules
  • the reconciler moves only the affected agents when it can, and restarts a whole job only when the coordinator lease is gone

That gives MirrorNeuron the Nomad-like heart for small clusters without requiring a heavy control plane.

How To Use The Features Together

Use cluster_recover for work that can be replayed safely:

{
  "type": "service",
  "policies": {
    "recovery_mode": "cluster_recover",
    "restart": {
      "attempts": 3,
      "interval_ms": 600000,
      "delay_ms": 1000,
      "delay_function": "exponential",
      "max_delay_ms": 30000,
      "mode": "fail"
    },
    "reschedule": {
      "unlimited": true,
      "delay_ms": 5000,
      "delay_function": "exponential",
      "max_delay_ms": 300000
    }
  }
}

Use system for one copy per eligible machine:

{
  "policies": {
    "scheduler": {
      "job_type": "system"
    }
  }
}

Use resource and service requirements together when a worker needs a healthy local service:

{
  "nodes": [
    {
      "node_id": "embedding_worker",
      "agent_type": "executor",
      "resources": {
        "devices": [
          {
            "kind": "gpu",
            "driver": "cuda",
            "min_memory_mb": 16000,
            "count": 1
          }
        ]
      },
      "requires_services": [
        {
          "name": "vllm",
          "tags": ["embedding"],
          "required": true
        }
      ]
    }
  ]
}

Drain a node before rebooting it:

mn node drain mirror_neuron@192.168.4.20 --reason "driver update" --deadline 30m --dry-run
mn node drain mirror_neuron@192.168.4.20 --reason "driver update" --deadline 30m --wait
mn node undrain mirror_neuron@192.168.4.20 --mark-eligible --reason "ready"

Deploy a long-running service under a stable key:

mn deployment deploy /path/to/bundle --key agent-api --strategy rolling --max-parallel 1
mn deployment status agent-api

Schedule a batch job:

mn schedule create /path/to/bundle --cron "0 2 * * *" --timezone America/New_York

Create an event trigger:

mn trigger create /path/to/bundle --event file_uploaded --filter-json '{"path":{"prefix":"/datasets/"}}'
mn event emit file_uploaded --payload-json '{"path":"/datasets/eval.jsonl"}'

Important Code

AreaImportant files
Manifest fields and validationMirrorNeuron/lib/mirror_neuron/manifest.ex
Placement schedulerMirrorNeuron/lib/mirror_neuron/scheduler.ex
Resource spec and allocation envMirrorNeuron/lib/mirror_neuron/resource_spec.ex
Node inventory and resource APIMirrorNeuron/lib/mirror_neuron/resource.ex
Job coordinator lifecycleMirrorNeuron/lib/mirror_neuron/runtime/job_coordinator.ex
Restart and reschedule policyMirrorNeuron/lib/mirror_neuron/runtime/lifecycle_policy.ex
ReconciliationMirrorNeuron/lib/mirror_neuron/cluster/reconciler.ex
Node monitor and leader sweepsMirrorNeuron/lib/mirror_neuron/cluster/node_monitor.ex, MirrorNeuron/lib/mirror_neuron/cluster/leader.ex
Drain and maintenanceMirrorNeuron/lib/mirror_neuron/cluster/node_drainer.ex
Service declarationsMirrorNeuron/lib/mirror_neuron/service_spec.ex
Service checks and registryMirrorNeuron/lib/mirror_neuron/service_check.ex, MirrorNeuron/lib/mirror_neuron/service_registry.ex, MirrorNeuron/lib/mirror_neuron/service_monitor.ex, MirrorNeuron/lib/mirror_neuron/service_preflight.ex
DeploymentsMirrorNeuron/lib/mirror_neuron/runtime/deployment_policy.ex, MirrorNeuron/lib/mirror_neuron/runtime/deployment_controller.ex
Schedules and triggersMirrorNeuron/lib/mirror_neuron/runtime/schedule_policy.ex, MirrorNeuron/lib/mirror_neuron/runtime/schedule_dispatcher.ex
Durable Redis stateMirrorNeuron/lib/mirror_neuron/persistence/redis_store.ex
gRPC server surfacesMirrorNeuron/lib/mirror_neuron_grpc/server.ex
CLI commandsmn-cli/mn_cli/main.py, mn-cli/mn_cli/libs/*.py
Python SDKmn-python-sdk/mn_sdk/client.py, mn-python-sdk/mn_sdk/bundle.py, mn-python-sdk/mn_sdk/decorators.py
Blueprint validationmn-python-sdk/mn_sdk/blueprint_validation.py

What Is Still Intentionally V1

  • resources and devices drive scheduling and runtime environment hints, not OS isolation
  • volumes are validated and advertised, but not mounted automatically by core
  • ports are explicit only, with no dynamic port allocator yet
  • service health affects discovery and placement, but does not automatically restart jobs yet
  • deployment rollout behavior is focused on service and system jobs
  • scheduled child jobs are ordinary jobs, so the normal reliability model still applies

On this page