Mirror Neuron Documents

System Benchmarks

Overview of the performance and accuracy metrics for MirrorNeuron system benchmarks.

System Benchmarks

MirrorNeuron system benchmarks provide deterministic fixtures and report generation for runtime evaluation.

Metrics

MetricWhat it measures
Workflow Completion RateWhether a full multi-step workflow finishes the intended task.
Fault Recovery RateWhether the workflow resumes after worker, tool, loop, or approval failures.
Tool Execution AccuracyWhether agents choose expected tools, parameters, and side-effect boundaries.
Cost per Successful WorkflowEstimated runtime cost for successful workflow execution.
Human Intervention RateHow often a person is needed at review or repair checkpoints.

Run Tests

cd mn-system-tests
python3 -m pytest benchmarks -q

Generate A Report

cd mn-system-tests
python3 -m benchmarks.agent_runtime_benchmark --output-dir /tmp/mn-agent-runtime-benchmark

The report includes completion, recovery, tool/action correctness, cost, human intervention, and a compact scorecard JSON object for CI trend checks.

Publishing Notes

  • Replace fixture observations with fresh artifacts from live MirrorNeuron runs before sharing external benchmark claims.
  • Refresh provider pricing data before publishing cost estimates.
  • Document any human-review rubric used for quality grading.
  • Add a repository-level license before distributing benchmark fixtures or reports outside the project.

On this page