The Anti-Transformer Manifesto

Why Bitsets? Why CPU? Understanding the philosophy behind the architecture.

The Elephant in the Room

Modern AI is dominated by the Transformer architecture. It's designed for GPUs, relying on massive dense matrix multiplications (O(N²)). This makes them heavy, static, and opaque.

The Core Bet: Sparsity by Design

BSP starts from a simple premise: Real-world concepts are sparse.

Out of 100,000 possible words, a sentence only uses ~10. Out of 1,000,000 visual features, a scene only contains ~50.

Transformers represent this by taking a vector of 100,000 zeros and putting non-zero floats in 10 slots. Then they multiply this mostly-zero vector by a dense matrix. This is mathematically correct but computationally wasteful.

BSP represents this as a Set (Bitset).

Input: { "cat", "eats", "fish" } → IDs [42, 105, 88]
Representation: A list of 3 integers.
Operation: Intersection, Union, Difference.

Visual Comparison

graph TD subgraph Transformer [Transformer World (Dense)] T1[Input Vector] -->|MatMul| T2[Dense Layer] T2 -->|Attention| T3[Context Matrix] style T1 fill:#f9f,stroke:#333 style T2 fill:#f9f,stroke:#333 style T3 fill:#f9f,stroke:#333 end subgraph BSP [BSP World (Sparse)] B1[Input Bitset] -->|Lookup| B2[Inverted Index] B2 -->|Intersection| B3[Active Groups] style B1 fill:#9f9,stroke:#333 style B2 fill:#9f9,stroke:#333 style B3 fill:#9f9,stroke:#333 end

Figure 1: Dense Matrix Multiplication vs. Sparse Set Intersection

The "Learner" Philosophy (MDL)

How does it "learn" without Backpropagation? BSP relies on Minimum Description Length (MDL).

The brain is essentially a compression engine.

If I see "The cat eats" and I predict "fish", and the next word is indeed "fish", I am not surprised. My internal model already "compressed" that pattern.
If the next word is "lasagna", I am surprised.

The Online Learning Loop

Predict what comes next based on current Groups.
Measure Surprise: Input \ Predicted.
Minimize Future Surprise:
- If the pattern repeats, create a new Group combining these elements.
- If a Group predicted wrongly, weaken its link.

This is Online Learning. There is no "Training Run". Every interaction updates the model instantly.

Why This Matters

By shifting from Dense/Float/GPU to Sparse/Int/CPU, we unlock:

Run Anywhere: Raspberry Pi, Browser, Old Laptop.
Continuous Adaptation: The model learns your name in the first sentence and remembers it in the second.
Interpretability: You can look at Group #402 and see exactly: Members: {cat, dog, pet}, Salience: 0.8. No magic numbers.