Research – Agentic Code Generation

Working definition: “agentic code generation” is a workflow where an AI agent iteratively proposes code changes (or new code), validates them against tests/evals, and keeps artifacts (traces, failing cases, hypotheses) so the process is reproducible and improves over time.

Why this matters for AGISystem2

AGISystem2 is built around explicit theories (DSL), deterministic execution, and evaluation suites. That makes it a strong platform for “learning by synthesis”: instead of only prompting a model for answers, we can use models to propose executable artifacts:

code that implements algorithms or optimizations,
simulations that model real mechanisms,
DSL theories that encode invariants and constraints,
tests/evals that prevent regressions and sharpen semantics.

General pattern (LLM + data → programs + theories)

Stage	Input	Agent output	Validation signal
1) Observe	Failures, slow paths, ambiguous proofs, missing coverage	Hypotheses about root cause + candidate fixes	Reproducible minimal failing case
2) Synthesize	Specs + codebase context + failing case	Patch (code/tests) or DSL theory snippet	Static checks + targeted tests
3) Evaluate	Local test suite + eval runners	Run results + diffs in behavior	Pass/fail + performance deltas
4) Curate	New learnings	Quarantine bad hypotheses; keep minimal repro + explanation	Regression tests prevent reintroduction
5) Formalize	Stable patterns in failures and fixes	Promote to DS/Docs, Core theories, or reusable modules	Traceability in specs + repeatable evals

AutoDiscovery as a concrete instance

AutoDiscovery is AGISystem2’s internal “agentic debugging” and regression discovery workflow. It runs evaluation suites, collects failures and traces, and uses structured analysis to identify minimal bug cases and propose fixes.

Spec: DS20 – AutoDiscovery
Runtime folder: autoDiscovery/ (runner: autoDiscovery/runAutodiscoveryAgent.mjs)

What is “learned”

Not model weights: AGISystem2 “learns” by adding stable artifacts: regression tests, repaired reasoning rules, clarified specs, and (optionally) new DSL theory fragments.

Why it scales

The evaluation suite is the judge. As coverage grows, the agent’s search space becomes safer: fewer patches pass unless they preserve semantics across many suites.

Why it stays scientific

The process produces inspectable evidence: minimal repro cases, proof traces, and performance counters. Hypotheses are falsified by deterministic runs, not by narrative.

From code synthesis to theory synthesis

UTE-oriented research (see Universal Theory Engine (UTE)) needs a pathway from data and experiments to new theory fragments. Agentic workflows can help by generating candidate DSL theories and checking them against observed data:

Constraint discovery: infer invariants that remove contradictions or explain patterns, then validate with prove().
Model sketching: generate small simulations/models that fit data; keep the model executable and audited.
Experiment loops: propose which additional facts/measurements would disambiguate competing hypotheses.

Important guardrail: LLMs are used as proposal engines, not as truth engines. Anything promoted to “theory” (DSL) or “algorithm” (code) must be validated by deterministic evaluation and, when applicable, proof/evidence objects.

Where this goes next (research questions)

How do we represent “candidate theories” so that they can be evaluated, revised, and compared systematically?
Which operators should be allowed in synthesized DSL to keep search tractable and semantics stable?
How do we tie numeric/probabilistic model fitting to evidence and revision (UTE: DS34–DS37)?
What is the best interface between “agent planning” and “theory validation” (DS28 + UTE)?

Agentic Code Generation (Research)

Why this matters for AGISystem2

General pattern (LLM + data → programs + theories)

AutoDiscovery as a concrete instance

What is “learned”

Why it scales

Why it stays scientific

From code synthesis to theory synthesis

Where this goes next (research questions)