Working definition: “agentic code generation” is a workflow where an AI agent iteratively proposes code changes (or new code), validates them against tests/evals, and keeps artifacts (traces, failing cases, hypotheses) so the process is reproducible and improves over time.

Why this matters for AGISystem2

AGISystem2 is built around explicit theories (DSL), deterministic execution, and evaluation suites. That makes it a strong platform for “learning by synthesis”: instead of only prompting a model for answers, we can use models to propose executable artifacts:

General pattern (LLM + data → programs + theories)

Stage Input Agent output Validation signal
1) Observe Failures, slow paths, ambiguous proofs, missing coverage Hypotheses about root cause + candidate fixes Reproducible minimal failing case
2) Synthesize Specs + codebase context + failing case Patch (code/tests) or DSL theory snippet Static checks + targeted tests
3) Evaluate Local test suite + eval runners Run results + diffs in behavior Pass/fail + performance deltas
4) Curate New learnings Quarantine bad hypotheses; keep minimal repro + explanation Regression tests prevent reintroduction
5) Formalize Stable patterns in failures and fixes Promote to DS/Docs, Core theories, or reusable modules Traceability in specs + repeatable evals

AutoDiscovery as a concrete instance

AutoDiscovery is AGISystem2’s internal “agentic debugging” and regression discovery workflow. It runs evaluation suites, collects failures and traces, and uses structured analysis to identify minimal bug cases and propose fixes.

What is “learned”

Not model weights: AGISystem2 “learns” by adding stable artifacts: regression tests, repaired reasoning rules, clarified specs, and (optionally) new DSL theory fragments.

Why it scales

The evaluation suite is the judge. As coverage grows, the agent’s search space becomes safer: fewer patches pass unless they preserve semantics across many suites.

Why it stays scientific

The process produces inspectable evidence: minimal repro cases, proof traces, and performance counters. Hypotheses are falsified by deterministic runs, not by narrative.

From code synthesis to theory synthesis

UTE-oriented research (see Universal Theory Engine (UTE)) needs a pathway from data and experiments to new theory fragments. Agentic workflows can help by generating candidate DSL theories and checking them against observed data:

Important guardrail: LLMs are used as proposal engines, not as truth engines. Anything promoted to “theory” (DSL) or “algorithm” (code) must be validated by deterministic evaluation and, when applicable, proof/evidence objects.

Where this goes next (research questions)