RL as shaping for stable choices

This page is a theory note. It expands the topic in short chapters and defines terminology without duplicating the formal specification documents.

The diagram has a transparent background and is intended to be read together with the caption and the sections below.

Related wiki pages: VM, event stream, VSA, bounded closure, consistency contract.

Related specs: DS005.

Overview

VSAVM uses RL as shaping rather than as a replacement for language training. The system often faces multiple plausible candidate programs or response modes. A learned preference can bias selection toward candidates that have historically remained consistent under closure.

What is optimized

The action space is intentionally small: selecting among candidate programs, schemas, or response modes. This avoids token-level RL, which is expensive and difficult to audit. Each action corresponds to a semantic decision that can be logged and evaluated.

Signals and discipline

Bounded closure naturally provides negative feedback when contradictions are detected. Additional shaping can penalize branching blow-ups and reward compact programs. The resulting preferences steer search toward stable solutions without overriding the explicit consistency gate.

Trade-offs

Shaping can overfit to a narrow verifier if the verifier does not reflect the real failure modes. The safe approach is to keep RL as a stability prior while maintaining the correctness guarantee in explicit closure checks and deterministic boundary behavior.

rl-shaping diagram — RL provides shaping signals for discrete choices, prioritizing candidates that remain stable under bounded closure.

References

Reinforcement learning (Wikipedia) Sutton & Barto (book) Multi-armed bandit (Wikipedia)