RL as shaping for stable choices
This page is a theory note. It expands the topic in short chapters and defines terminology without duplicating the formal specification documents.
The diagram has a transparent background and is intended to be read together with the caption and the sections below.
Related wiki pages: VM, event stream, VSA, bounded closure, consistency contract.
Related specs: DS005.
Overview
VSAVM uses RL as shaping rather than as a replacement for language training. The system often faces multiple plausible candidate programs or response modes. A learned preference can bias selection toward candidates that have historically remained consistent under closure.
What is optimized
The action space is intentionally small: selecting among candidate programs, schemas, or response modes. This avoids token-level RL, which is expensive and difficult to audit. Each action corresponds to a semantic decision that can be logged and evaluated.
Signals and discipline
Bounded closure naturally provides negative feedback when contradictions are detected. Additional shaping can penalize branching blow-ups and reward compact programs. The resulting preferences steer search toward stable solutions without overriding the explicit consistency gate.
Trade-offs
Shaping can overfit to a narrow verifier if the verifier does not reflect the real failure modes. The safe approach is to keep RL as a stability prior while maintaining the correctness guarantee in explicit closure checks and deterministic boundary behavior.
References
Reinforcement learning (Wikipedia) Sutton & Barto (book) Multi-armed bandit (Wikipedia)