Evals Suite

The evalsSuite/ directory is the evaluation harness used to test probabilistic runtime behavior in AchillesAgentLib. Its role is different from that of ordinary deterministic unit tests. Instead of checking only fixed software outputs, it evaluates whether LLM-mediated planning, routing, generation, and session behavior remain sufficiently reliable across defined case sets and model configurations.

Evaluation Rather Than Ordinary Testing

The distinction matters because the target of evaluation is not a purely deterministic function. In this repository, the evaluated behaviors often involve model-mediated decisions such as intent classification, step planning, code generation, or comparative performance across different models. The purpose of evalsSuite is therefore not to impose one universal reproducibility rule. It is to establish disciplined confidence about recurring behavior under controlled case sets, prompts, and runtimes.

This is also why the suite should not be described through one global success threshold. Different scripts report success in different ways. Some count passing cases directly. Some compare expected variables against observed session state. Some generate benchmark tables across many models. Some support repeated runs, semantic comparison, or filtered reruns of failed cases. The common principle is methodological, not the enforcement of one fixed pass-rate number across the whole directory.

Current Evaluation Surfaces

The current evaluation tree covers several distinct runtime surfaces. The planning directory contains full session-level evaluations for both startLoopAgentSession and startSOPLangAgentSession. The detect-intents evaluator measures skill detection and semantic parameter accuracy against JSON case files and a declared skill-description space. The mirror-code-generation evaluator generates code for benchmark skills and then executes behavior-oriented checks against the generated artifacts. The model-benchmark scripts compare configured models under different scenarios, including model benchmarks, SOP-versus-loop sweeps, ad hoc orchestrator evaluation, and code-generation benchmarks. The tree also includes Anthropic Skill evaluations and a separate agentic-performance evaluator that compares loop and SOP behavior over the same performance cases.

Current evaluation tree

evalsSuite/
- planning/
- detectIntents/
- mirror-code-gen/
- modelBenchmark/
- anthropic-skills/
- performanceCases/
- evalDetectIntents.mjs
- evalAgenticPerformance.mjs
- runMainSuites.js

Note

runMainSuites.js is not a universal runner for the entire directory. In the current implementation it runs the two planning-related suites only: startSOPLangAgentSession and startLoopAgentSession.

How Cases Are Represented

Most evaluators operate over explicit case files or explicit benchmark configurations. The detectIntents evaluator reads JSON cases from evalsSuite/detectIntents/ together with a shared skillsDescription.json. The planning evaluators load JSON cases from evalsSuite/planning/startLoopAgentSession/ and evalsSuite/planning/startSOPLangAgentSession/. The agentic-performance evaluator reads structured performance cases from evalsSuite/performanceCases/. The model-benchmark scripts combine case collections, model configuration, CLI options, and reporting logic in order to compare families of providers or named models under the same evaluation program.

The scoring logic is correspondingly local to each script. For example, evalDetectIntents.mjs distinguishes between key detection and semantic parameter matching. evalSOPLangPlanning.mjs validates expected output against session variables and lastAnswer. evalAgenticPerformance.mjs measures repeated loop and SOP execution over the same cases. The benchmark scripts in modelBenchmark/ emphasize comparative metrics such as latency, token usage, pass rate, and model-to-model ranking rather than a single boolean pass condition.

Operational Entry Points

The most direct planning-oriented entry point is node evalsSuite/runMainSuites.js, which launches the two full session suites and summarizes failed cases. For intent evaluation, the relevant entry point is node evalsSuite/evalDetectIntents.mjs, which also supports rerunning only previously failed cases. For direct loop-versus-SOP comparison across the performance cases, the entry point is node evalsSuite/evalAgenticPerformance.mjs. For benchmark-oriented work, the entry points live under evalsSuite/modelBenchmark/ and expose their own CLI options for model selection, difficulty filtering, run count, output files, and related controls.

The suite therefore behaves less like one monolithic application and more like an evaluation workspace with several specialized runners. This is consistent with the role described in the architectural material: evaluation is treated as a separate methodological layer around the runtime, not as a thin wrapper around ordinary unit tests.