The Anti-Transformer Manifesto

Why Bitsets? Why CPU? Understanding the philosophy behind the architecture.

The Elephant in the Room

Modern AI is dominated by the Transformer architecture. It's designed for GPUs, relying on massive dense matrix multiplications (O(N²)). This makes them heavy, static, and opaque.

The Core Bet: Sparsity by Design

BSP starts from a simple premise: Real-world concepts are sparse.

Out of 100,000 possible words, a sentence only uses ~10. Out of 1,000,000 visual features, a scene only contains ~50.

Transformers represent this by taking a vector of 100,000 zeros and putting non-zero floats in 10 slots. Then they multiply this mostly-zero vector by a dense matrix. This is mathematically correct but computationally wasteful.

BSP represents this as a Set (Bitset).

Visual Comparison

graph TD subgraph Transformer [Transformer World (Dense)] T1[Input Vector] -->|MatMul| T2[Dense Layer] T2 -->|Attention| T3[Context Matrix] style T1 fill:#f9f,stroke:#333 style T2 fill:#f9f,stroke:#333 style T3 fill:#f9f,stroke:#333 end subgraph BSP [BSP World (Sparse)] B1[Input Bitset] -->|Lookup| B2[Inverted Index] B2 -->|Intersection| B3[Active Groups] style B1 fill:#9f9,stroke:#333 style B2 fill:#9f9,stroke:#333 style B3 fill:#9f9,stroke:#333 end

Figure 1: Dense Matrix Multiplication vs. Sparse Set Intersection

The "Learner" Philosophy (MDL)

How does it "learn" without Backpropagation? BSP relies on Minimum Description Length (MDL).

The brain is essentially a compression engine.

The Online Learning Loop

  1. Predict what comes next based on current Groups.
  2. Measure Surprise: Input \ Predicted.
  3. Minimize Future Surprise:
    • If the pattern repeats, create a new Group combining these elements.
    • If a Group predicted wrongly, weaken its link.

This is Online Learning. There is no "Training Run". Every interaction updates the model instantly.

Why This Matters

By shifting from Dense/Float/GPU to Sparse/Int/CPU, we unlock: