Question compilation pipeline
This page is a theory note. It expands the topic in short chapters and defines terminology without duplicating the formal specification documents.
The diagram has a transparent background and is intended to be read together with the caption and the sections below.
Related wiki pages: VM, event stream, VSA, bounded closure, consistency contract, query compiler, schema.
Related specs: DS003.
Overview
A question is treated as a request to produce an executable query program. The pipeline is explicit to support audit and control: normalization creates a structured span, retrieval proposes candidate schemas, slot filling binds discrete values, and compilation emits a program in the VM instruction set. This transformation operates through a learned pipeline rather than hand-coded rules.
Natural language to query compilation
The compilation process follows explicit stages:
- Query normalization: Convert input text to the standard event stream representation. Identify interrogative markers, entity references, relationship indicators, and logical connectives.
- Entity identification: Disambiguate entity mentions (specific individuals, general categories, abstract concepts) using local query context and global knowledge base context. Maintain coreference tracking for pronouns and definite descriptions.
- Schema retrieval: Use VSA similarity measures to identify candidate schemas from the library. Hypervector comparison produces a ranked list of potential matches that handle linguistic variations.
- Schema matching: Evaluate structural compatibility beyond surface similarity. Maintain multiple candidate schemas rather than committing to a single interpretation early.
Slot filling and program instantiation
Slot filling binds entities, roles, and references using discrete matching and coreference heuristics, augmented by associative retrieval:
- Direct matching: Query elements corresponding exactly to schema slots.
- Type-based inference: Use slot type constraints to identify appropriate elements even with different surface forms.
- Semantic association: VSA similarity measures identify related elements when direct matching fails.
Complex queries may require multiple schemas combined or nested. The composition system maintains explicit data flow graphs tracking information movement through composite reasoning.
Program instantiation translates filled schemas into executable VM instruction sequences, including optimization steps: common subexpression identification, redundant operation elimination, and operation reordering for cache locality.
Program search and selection
The search process explores the space of possible reasoning strategies:
- Candidate generation: Modify existing programs by changing parameters, reordering operations, or substituting alternative sub-programs. Learned heuristics guide exploration toward promising directions.
- Population management: Maintain candidate diversity through mutation and recombination. Use fitness-based selection while preserving potentially valuable less-fit candidates.
- MDL-based scoring: Minimum Description Length balances performance and simplicity. Score components include complexity (program length/intricacy), accuracy (correct results on test cases), and generality (performance on unseen examples). Computational efficiency is also weighted.
- Consistency checking: Each candidate is evaluated via bounded closure analysis to prevent logical contradictions.
- Beam pruning: Retain only the most promising candidates at each stage while maintaining diversity to avoid premature convergence.
Schema learning and consolidation
The schema learning process discovers recurring patterns in query-program relationships:
- Pattern recognition: Statistical analysis of compilation logs identifies correlations between linguistic patterns and reasoning strategies. Rigorous significance testing ensures genuine regularities.
- Compression-driven emergence: Schemas providing significant MDL compression are promoted. The analysis evaluates both individual schema benefits and interaction effects with other schemas.
- Schema abstraction: Hierarchical clustering of similar query-program pairs creates general patterns. Common structure is preserved while varying aspects are parameterized.
- Consolidation triggers: Conservative criteria require substantial evidence before creating or modifying schemas. Validation on held-out examples ensures generalization beyond training data.
- Schema generalization: Existing schemas can be extended for new query types through careful analysis of differences from existing patterns.
Multimodal query processing
Queries spanning multiple input modalities require sophisticated coordination:
- Cross-modal reference resolution: Determine when entities in different modalities refer to the same real-world objects. Combine explicit linking (demonstratives, temporal synchronization) with implicit similarity-based association. Maintain uncertainty estimates for correspondence hypotheses.
- Temporal and spatial slot filling: Resolve absolute and relative temporal references against audio/video timestamps. Align spatial references with coordinate systems and object locations in visual inputs.
- Unified execution: The VM operates seamlessly across different symbolic representations through the canonical fact format. Cross-modal consistency checking accounts for modality-specific error patterns and uncertainty characteristics.
- Modality-specific adaptations: Learned associations between reasoning strategies and modality characteristics enable optimized strategy selection.
Managing ambiguity
Instead of forcing a single interpretation, VSAVM carries multiple candidate programs in a beam. Candidates are evaluated by explanatory fit and by early closure checks that detect contradictions. This makes uncertainty explicit and supports conditional outputs when necessary.
Engineering implications
Because compilation is explicit, it is testable. You can measure how often a schema is retrieved, how often slot filling is ambiguous, and how often a candidate fails under closure. These metrics can guide consolidation and improve robustness over time.
References
Program synthesis (Wikipedia) Beam search (Wikipedia) Information retrieval (Wikipedia) Minimum description length (Wikipedia)