System Benchmarks
Overview of the performance and accuracy metrics for MirrorNeuron system benchmarks.
System Benchmarks
MirrorNeuron system benchmarks provide deterministic fixtures and report generation for runtime evaluation.
Metrics
| Metric | What it measures |
|---|---|
| Workflow Completion Rate | Whether a full multi-step workflow finishes the intended task. |
| Fault Recovery Rate | Whether the workflow resumes after worker, tool, loop, or approval failures. |
| Tool Execution Accuracy | Whether agents choose expected tools, parameters, and side-effect boundaries. |
| Cost per Successful Workflow | Estimated runtime cost for successful workflow execution. |
| Human Intervention Rate | How often a person is needed at review or repair checkpoints. |
Run Tests
cd mn-system-tests
python3 -m pytest benchmarks -qGenerate A Report
cd mn-system-tests
python3 -m benchmarks.agent_runtime_benchmark --output-dir /tmp/mn-agent-runtime-benchmarkThe report includes completion, recovery, tool/action correctness, cost, human intervention, and a compact scorecard JSON object for CI trend checks.
Publishing Notes
- Replace fixture observations with fresh artifacts from live MirrorNeuron runs before sharing external benchmark claims.
- Refresh provider pricing data before publishing cost estimates.
- Document any human-review rubric used for quality grading.
- Add a repository-level license before distributing benchmark fixtures or reports outside the project.