Macro-unit (macro token)

This wiki entry defines a term used across VSAVM and explains why it matters in the architecture.

The diagram has a transparent background and highlights the operational meaning of the term inside VSAVM.

Related wiki pages: VM, event stream, VSA, bounded closure, consistency contract.

Definition

A macro-unit is a reversible sequence of tokens (in the current training harness: bytes) that is promoted because it improves compression under MDL and is useful for continuation prediction (DS011).

Role in VSAVM

Macro-units provide a “larger than token” unit for the DS011 outer loop:

Fewer steps per continuation: predicting a frequent multi-byte unit reduces decoding iterations.
Compression pressure: frequent sequences become reusable units that reduce description length.
Stable handles: unit IDs can be counted, cached, versioned, and compared across runs.

Mechanics and implications

Reversibility is mandatory. If expansion is ambiguous, scoring becomes inconsistent and the system cannot maintain traceability. VSAVM treats deterministic expansion as a hard constraint.

Macro-units are not the same thing as structural separators:

Separators (DS010) split the stream into structural regions (paragraphs, scenes, functions).
Macro-units (DS011) compress recurring content inside (or across) those regions.

Implementation notes (current code)

The concrete macro-unit model is implemented in src/training/outer-loop/macro-unit-model.mjs. It supports streaming training (trainStream), bounded n-gram orders, pruning, and a trie for fast encoding/decoding.

In eval_tinyLLM, trained macro-unit models are cached under eval_tinyLLM/cache/models/vsavm/<datasetId>/<modelId>/ so multiple dataset sizes and multiple model variants can coexist.

References