BSP Optimization Journey

Last Updated: 2026-01-16

Overview

This page documents the optimization experiments conducted on BSP, tracking what worked, what didn't, and the measurable impact of each change.

Initial State (Baseline)

BPC: 4.93

vs Gzip: -104%

Fixed universe (100k), no compression machine

Current State (2026-01-16)

BPC: 2.20

vs Gzip: +8.6%

Adaptive universe + compression machine + vocab fix

Optimization Timeline

DS-020: Adaptive Universe SUCCESS

Date: 2026-01-16 (early)

Problem: Fixed universe size of 100,000 meant every surprise bit cost log₂(100k) ≈ 16.6 bits, even when only 1,000 tokens were seen.

Solution: Dynamic universe sizing based on observed vocabulary: effectiveUniverse = max(1000, vocabSize × 2)

Results:

1k lines: 16.6 → 11.1 bits/surprise (33% reduction)
5k lines: 16.6 → 13.1 bits/surprise (21% reduction)
BPC improved but still had scaling issues

Spec: DS-020

DS-021: Compression Machine SUCCESS

Date: 2026-01-16 (mid)

Problem: Group-based compression alone couldn't exploit structural patterns like repetition and copying.

Solution: Procedural encoding layer with operators:

COPY: LZ77-style back-references (cost: ~14 bits vs 66 bits for 6 tokens)
REPEAT: Run-length encoding for patterns
TEMPLATE: Structure ready, learning pending
LITERAL: Fallback direct encoding

Results:

Program win rate: 85% on 5k lines
COPY operations: 3,637 uses, 26 bits savings each
Combined BPC: 2.98 → 2.79 (6.3% improvement)
But still had vocab explosion issue...

Spec: DS-021

Vocabulary Decoupling Fix BREAKTHROUGH

Date: 2026-01-16 (late)

Problem: N-gram tokenization (1-3 grams) caused vocabulary explosion:

Input: "the cat sat" → 6 tokens: [the, cat, sat, the_cat, cat_sat, the_cat_sat]
5k lines → 4,483 n-gram tokens vs ~1,200 actual words
CompressionMachine used full n-gram vocab for cost calculation
Penalty: ~2.2 bits/token overpayment

Solution: Decoupled vocabularies:

BSPEngine.vocabTracker: All tokens (n-grams) for universe sizing
CompressionMachine.wordVocab: Only unigrams for cost calculation
Updated all operators to use effectiveVocab consistently

Results:

Before Fix

1k: 2.27 BPC

5k: 2.79 BPC

Degraded with more data ❌

After Fix

1k: 2.04 BPC

5k: 2.20 BPC

Scales correctly ✅

1k lines: 10.1% improvement (2.27 → 2.04)
5k lines: 21.1% improvement (2.79 → 2.20)
Program win rate: 48% → 85% as data scales
Both configurations now beat Gzip (2.41 BPC)

Files Modified:

CompressionMachine.mjs: _tryRepeatEncoding, _tryTemplateEncoding, _matchTemplate

Performance Metrics

Compression Quality (5k lines training)

Metric	Value	Notes
BPC (Combined)	2.20	Beats Gzip (2.41) by 8.6%
BPC (Groups Only)	2.98	26% worse than combined
Shannon Entropy	4.38	Character-level baseline
Program Win Rate	85.0%	Machine dominates
COPY Operations	3,637	26 bits savings/op

System Performance

Metric	Value	Status
Throughput	338 lines/sec	⚠️ Needs suffix array optimization
Groups Learned	1,144	✅ Linear scaling (~0.23/line)
Vocab Size	4,483 (n-grams)	✅ For semantic grouping
Word Vocab	~1,200 (unigrams)	✅ For compression costs

BLiMP Grammatical Competence

Task	Accuracy
Anaphor Agreement	46.0%
Determiner-Noun 1	17.2%
Determiner-Noun 2	41.3%
Average	34.8%

Key Insights

What Works

LZ77-style COPY: Dominates with 85% win rate on narrative text
Adaptive sizing: MDL principle reduces cost by 21-33%
Vocabulary decoupling: Critical for correct cost calculation
Hybrid architecture: Groups for semantics, programs for structure

What Doesn't Work Yet

REPEAT operations: Only 1 use in 5k lines (needs fuzzy matching)
Template learning: Structure ready but not active (needs implementation)
Group-only compression: 26% worse than combined (high surprise count)

Scaling Behavior

BPC increases slightly (2.04 → 2.20) but stays below Gzip
Program win rate increases dramatically (48% → 85%)
More training data → more context → better COPY matches
Throughput decreases due to O(N×M) COPY search

Future Optimizations

Priority 1: Template Learning TESTED - FAILED

Date Tested: 2026-01-16

Expected Impact: 2.20 → ~1.80 BPC (18% improvement)

Actual Results:

Templates learned: 15
Templates used: 0 ❌
BPC: 2.21 (+0.01, worse)
Throughput: 221 l/s (-35%) ❌

Why It Failed:

Train/test mismatch - templates don't generalize
Too specific - exact phrase matching
Pure overhead - 28,905 failed match attempts

Decision: Feature disabled. See EXPERIMENT_TEMPLATE_LEARNING.md

Lesson: COPY works because it matches recent context (same document). Templates tried to match training data (different documents).

Suffix Array for COPY TESTED - PARTIAL

Date Tested: 2026-01-16

Expected Impact: 305 → 500+ lines/sec

Actual Results:

Quick test (1k): 458 l/s (+50%) ✅
Full test (5k): 232 l/s (-24%) ❌
BPC: No change

Why It Failed: Small context (256 tokens), frequent rebuilds, O(N log N) build cost > O(N×M) search savings

Decision: Disabled by default. See EXPERIMENT_SUFFIX_ARRAY.md

Lesson: Simple O(N) can beat complex O(log N) for small N

Priority 1 (NEW): Rolling Hash Map PLANNED

Expected Impact: 301 → 400+ lines/sec (33% throughput)

Why this approach: O(1) lookup with no build cost overhead

Solution:

Hash first 3 tokens at each context position
Lookup candidates in O(1) average case
Extend matches linearly

Complexity: Build O(N), Query O(1), Memory O(N)

Priority 3: Frequency-Weighted Coding PLANNED

Expected Impact: 2.20 → ~1.90 BPC (14% improvement)

Concept: High-frequency words should cost less than rare words

Current: All words cost log₂(vocab) ≈ 11 bits

Proposed: Huffman-style encoding

Top 100 words (50% of text): ~7 bits
Common words (30% of text): ~10 bits
Rare words (20% of text): ~14 bits
Average: ~9 bits (18% reduction)

Theoretical Limits

Shannon Entropy: 4.38 BPC (character-level)

Current: 2.20 BPC (50% of entropy)

Estimated Ceiling: ~1.50 BPC with all optimizations

Template learning: -0.30 BPC → 1.90
Frequency coding: -0.20 BPC → 1.70
Better grouping: -0.20 BPC → 1.50

Resources

Quick Commands

# Quick benchmark (1k lines, ~17s)
node evals/runLM_Comp.mjs --quick --retrain

# Full benchmark (5k lines, ~1.3min)
node evals/runLM_Comp.mjs --retrain

# View latest results
cat evals/lm_comparative/results/latest.json | jq