Training and emergent compilation

This page is a theory note. It expands the topic in short chapters and defines terminology without duplicating the formal specification documents.

The diagram has a transparent background and is intended to be read together with the caption and the sections below.

Related wiki pages: VM, event stream, context scope, MDL, RL, LLM, macro-unit.

Related specs: DS005, DS010, DS011, DS012.

Overview

VSAVM treats “compilation” as a learned capability driven by compression pressure. Repeated patterns create incentives to represent intent as reusable executable programs (inner loop, DS005) and as reusable surface continuations (outer loop, DS011). Crucially, scope boundaries must emerge from structure (DS010 / NFS11), so learning does not rely on hardcoded topical partitions.

The two loops (what exists today)

Inner loop (DS005): pattern mining, schema proposal, consolidation, and rule learning. The repository contains the inner-loop building blocks (PatternMiner / SchemaProposer / Consolidator) and a coordinator (TrainingService).
Outer loop (DS011): a macro-unit language model trained to continue byte sequences under budgets, optionally conditioned on VM state. The current concrete implementation is MacroUnitModel (src/training/outer-loop/macro-unit-model.mjs).

The loops are compatible by design: inner-loop consolidation can produce more reusable units and programs; outer-loop continuation benefits from stable reversible macro-units. Not every integration point is wired into every harness yet, so the practical pipeline is documented below.

Practical training pipeline (eval_tinyLLM)

eval_tinyLLM is the “ground truth” harness for today’s reproducible training and comparisons. It trains a small TensorFlow byte-level Transformer and VSAVM’s macro-unit model on the same prepared dataset and writes timestamped reports.

Step-by-step: from raw text to a comparison report

Fetch data: download a raw dataset into eval_tinyLLM/cache/.
Prepare a split: create train.txt and valid.txt under a deterministic datasetId (keyed by maxBytes/trainRatio/textField).
Train VSAVM macro-units: stream bytes from train.txt and train MacroUnitModel.trainStream. Optionally ingest facts into the VM, but large runs typically use --skip-ingest to focus on the language-model comparison.
Train TF baseline: train a minimal byte-level Transformer (kept small on purpose so training stays feasible).
Evaluate: compute perplexity and auxiliary metrics for both engines.
Compare: run a budgeted prompt suite and write an HTML+JSON report to eval_tinyLLM/results/<timestamp>_results.html.

Artifacts are versioned by dataset size and model configuration

Prepared datasets and trained models are stored under eval_tinyLLM/cache/ so multiple sizes and multiple model variants can coexist without overwriting:

Datasets: cache/datasets/<datasetId>/train.txt, valid.txt, meta.json + a latest.json pointer.
Models: cache/models/{vsavm,tf}/<datasetId>/<modelId>/ with meta.json and per-engine artifacts + latest.json pointers.
Results: timestamped results/<timestamp>_results.html and .json reports for comparisons.

This is what makes size-based comparisons realistic: you can train multiple variants on different byte budgets and compare them without manually moving files.

Scaling guidance (larger datasets, RAM constraints)

The outer-loop macro-unit model is designed to stream training data, but it still maintains in-memory n-gram maps and subsequence counters. Large datasets are therefore feasible only with explicit caps and pruning.

Use streaming training: MacroUnitModel.trainStream processes one record at a time (no full corpus in RAM).
Cap n-gram order: keep maxNgramOrder small (default in the harness is 8) to prevent combinatorial growth.
Prune aggressively: increase minFrequency / pruneThreshold on larger runs.
Sample subsequences: set subsequenceSampleRate to trade accuracy for memory/time.
Guard the heap: use eval_tinyLLM/tools/run-with-ram.mjs and/or train-large.mjs to pick a safe --max-old-space-size automatically.

Disk-backed fact storage (DS012) reduces RAM pressure when ingesting and persisting facts. It does not currently move language-model n-gram state to disk.

Risks and mitigations

Compression can consolidate spurious patterns if prediction quality is the only criterion. VSAVM mitigates this by (a) scoping via DS010 so unstable patterns do not contaminate unrelated regions and (b) correctness checks (DS004) when translating learned structure into executable commitments.

training-and-emergence diagram — Compilation emerges when prediction pressure makes compact representations the cheapest explanation for recurring patterns. Inner-loop consolidation targets executable programs; outer-loop consolidation targets reversible macro-units for continuation.

References

Minimum description length (Wikipedia) The MDL Book (Grunwald) Program synthesis (Wikipedia) Reinforcement learning (Wikipedia)