Unified input representation
This page is a theory note. It expands the topic in short chapters and defines terminology without duplicating the formal specification documents.
The diagram has a transparent background and is intended to be read together with the caption and the sections below.
Related wiki pages: VM, event stream, VSA, bounded closure, consistency contract.
Related specs: DS001, DS010, DS011.
Overview
Multimodality becomes tractable when all inputs are mapped into a single canonical representation. VSAVM uses an event stream where each event is discrete and typed and carries an explicit structural context. This creates a shared substrate so that execution, closure, and auditing do not fragment across modality-specific pipelines.
Event stream structure
Each event within the stream carries three essential components:
- Type identifier: Specifies the nature of information (for example
text_token,separator,timestampper DS007). - Discrete payload: Contains actual data in standardized format that the VM can process directly.
- Structural context path: A hierarchical path (
contextPath) used to derive scope. The exact labels are produced by the ingest pipeline and must correspond to structural separators (DS010 / NFS11).
Modality-specific processing
- Text: token events (
text_token) plus structural boundaries (separator,header,list_item,quote,code_block). - Audio: token events (
audio_token) plus timestamps (timestamp) and speaker/segment separators. - Visual: token events (
visual_token) plus explicit separators for scenes/shots/regions when available. - Video: a mixture of visual tokens and timestamps, segmented by structural cuts (scene/shot/speaker changes).
Two granularities
The system operates on two granularities:
- Lexical layer: Stable, reversible tokens.
- Macro-unit layer (DS011): Reversible macro-units discovered by compression (MDL). Macro-units expand deterministically back into the lexical layer for scoring and audit.
Reversibility is essential: every macro unit must expand deterministically into elementary units.
VSA attachment
VSA attaches in parallel to each unit. Tokens and macro-units have deterministic hypervectors derived from stable hashes. Spans combine these through bundling with role and position signals. This hypervector is an associative index for fast retrieval and paraphrase clustering, not a direct representation of truth.
Implementation considerations
Representation fails when boundaries are ambiguous or when compression cannot expand deterministically. VSAVM therefore prioritizes deterministic segmentation and deterministic expansion. This makes later stages predictable and keeps the correctness contract enforceable.
References
Event stream processing (Wikipedia) Tokenization (Wikipedia) Multimodal learning (Wikipedia)