BSP Optimization Journey

Last Updated: 2026-01-16

Overview

This page documents the optimization experiments conducted on BSP, tracking what worked, what didn't, and the measurable impact of each change.

Initial State (Baseline)

BPC: 4.93

vs Gzip: -104%

Fixed universe (100k), no compression machine

Current State (2026-01-16)

BPC: 2.20

vs Gzip: +8.6%

Adaptive universe + compression machine + vocab fix

Optimization Timeline

DS-020: Adaptive Universe SUCCESS

Date: 2026-01-16 (early)

Problem: Fixed universe size of 100,000 meant every surprise bit cost log₂(100k) ≈ 16.6 bits, even when only 1,000 tokens were seen.

Solution: Dynamic universe sizing based on observed vocabulary: effectiveUniverse = max(1000, vocabSize × 2)

Results:

  • 1k lines: 16.6 → 11.1 bits/surprise (33% reduction)
  • 5k lines: 16.6 → 13.1 bits/surprise (21% reduction)
  • BPC improved but still had scaling issues

Spec: DS-020

DS-021: Compression Machine SUCCESS

Date: 2026-01-16 (mid)

Problem: Group-based compression alone couldn't exploit structural patterns like repetition and copying.

Solution: Procedural encoding layer with operators:

  • COPY: LZ77-style back-references (cost: ~14 bits vs 66 bits for 6 tokens)
  • REPEAT: Run-length encoding for patterns
  • TEMPLATE: Structure ready, learning pending
  • LITERAL: Fallback direct encoding

Results:

  • Program win rate: 85% on 5k lines
  • COPY operations: 3,637 uses, 26 bits savings each
  • Combined BPC: 2.98 → 2.79 (6.3% improvement)
  • But still had vocab explosion issue...

Spec: DS-021

Vocabulary Decoupling Fix BREAKTHROUGH

Date: 2026-01-16 (late)

Problem: N-gram tokenization (1-3 grams) caused vocabulary explosion:

  • Input: "the cat sat" → 6 tokens: [the, cat, sat, the_cat, cat_sat, the_cat_sat]
  • 5k lines → 4,483 n-gram tokens vs ~1,200 actual words
  • CompressionMachine used full n-gram vocab for cost calculation
  • Penalty: ~2.2 bits/token overpayment

Solution: Decoupled vocabularies:

  • BSPEngine.vocabTracker: All tokens (n-grams) for universe sizing
  • CompressionMachine.wordVocab: Only unigrams for cost calculation
  • Updated all operators to use effectiveVocab consistently

Results:

Before Fix

1k: 2.27 BPC

5k: 2.79 BPC

Degraded with more data ❌

After Fix

1k: 2.04 BPC

5k: 2.20 BPC

Scales correctly ✅

  • 1k lines: 10.1% improvement (2.27 → 2.04)
  • 5k lines: 21.1% improvement (2.79 → 2.20)
  • Program win rate: 48% → 85% as data scales
  • Both configurations now beat Gzip (2.41 BPC)

Files Modified:

  • CompressionMachine.mjs: _tryRepeatEncoding, _tryTemplateEncoding, _matchTemplate

Performance Metrics

Compression Quality (5k lines training)

Metric Value Notes
BPC (Combined) 2.20 Beats Gzip (2.41) by 8.6%
BPC (Groups Only) 2.98 26% worse than combined
Shannon Entropy 4.38 Character-level baseline
Program Win Rate 85.0% Machine dominates
COPY Operations 3,637 26 bits savings/op

System Performance

Metric Value Status
Throughput 338 lines/sec ⚠️ Needs suffix array optimization
Groups Learned 1,144 ✅ Linear scaling (~0.23/line)
Vocab Size 4,483 (n-grams) ✅ For semantic grouping
Word Vocab ~1,200 (unigrams) ✅ For compression costs

BLiMP Grammatical Competence

Task Accuracy
Anaphor Agreement 46.0%
Determiner-Noun 1 17.2%
Determiner-Noun 2 41.3%
Average 34.8%

Key Insights

What Works

  • LZ77-style COPY: Dominates with 85% win rate on narrative text
  • Adaptive sizing: MDL principle reduces cost by 21-33%
  • Vocabulary decoupling: Critical for correct cost calculation
  • Hybrid architecture: Groups for semantics, programs for structure

What Doesn't Work Yet

  • REPEAT operations: Only 1 use in 5k lines (needs fuzzy matching)
  • Template learning: Structure ready but not active (needs implementation)
  • Group-only compression: 26% worse than combined (high surprise count)

Scaling Behavior

  • BPC increases slightly (2.04 → 2.20) but stays below Gzip
  • Program win rate increases dramatically (48% → 85%)
  • More training data → more context → better COPY matches
  • Throughput decreases due to O(N×M) COPY search

Future Optimizations

Priority 1: Template Learning TESTED - FAILED

Date Tested: 2026-01-16

Expected Impact: 2.20 → ~1.80 BPC (18% improvement)

Actual Results:

  • Templates learned: 15
  • Templates used: 0
  • BPC: 2.21 (+0.01, worse)
  • Throughput: 221 l/s (-35%) ❌

Why It Failed:

  • Train/test mismatch - templates don't generalize
  • Too specific - exact phrase matching
  • Pure overhead - 28,905 failed match attempts

Decision: Feature disabled. See EXPERIMENT_TEMPLATE_LEARNING.md

Lesson: COPY works because it matches recent context (same document). Templates tried to match training data (different documents).

Suffix Array for COPY TESTED - PARTIAL

Date Tested: 2026-01-16

Expected Impact: 305 → 500+ lines/sec

Actual Results:

  • Quick test (1k): 458 l/s (+50%) ✅
  • Full test (5k): 232 l/s (-24%) ❌
  • BPC: No change

Why It Failed: Small context (256 tokens), frequent rebuilds, O(N log N) build cost > O(N×M) search savings

Decision: Disabled by default. See EXPERIMENT_SUFFIX_ARRAY.md

Lesson: Simple O(N) can beat complex O(log N) for small N

Priority 1 (NEW): Rolling Hash Map PLANNED

Expected Impact: 301 → 400+ lines/sec (33% throughput)

Why this approach: O(1) lookup with no build cost overhead

Solution:

  • Hash first 3 tokens at each context position
  • Lookup candidates in O(1) average case
  • Extend matches linearly

Complexity: Build O(N), Query O(1), Memory O(N)

Priority 3: Frequency-Weighted Coding PLANNED

Expected Impact: 2.20 → ~1.90 BPC (14% improvement)

Concept: High-frequency words should cost less than rare words

Current: All words cost log₂(vocab) ≈ 11 bits

Proposed: Huffman-style encoding

  • Top 100 words (50% of text): ~7 bits
  • Common words (30% of text): ~10 bits
  • Rare words (20% of text): ~14 bits
  • Average: ~9 bits (18% reduction)

Theoretical Limits

Shannon Entropy: 4.38 BPC (character-level)

Current: 2.20 BPC (50% of entropy)

Estimated Ceiling: ~1.50 BPC with all optimizations

  • Template learning: -0.30 BPC → 1.90
  • Frequency coding: -0.20 BPC → 1.70
  • Better grouping: -0.20 BPC → 1.50

Resources

Quick Commands

# Quick benchmark (1k lines, ~17s)
node evals/runLM_Comp.mjs --quick --retrain

# Full benchmark (5k lines, ~1.3min)
node evals/runLM_Comp.mjs --retrain

# View latest results
cat evals/lm_comparative/results/latest.json | jq