VSAVM Evaluation Framework

Rapid assessment of VM correctness plus budgeted continuation comparisons

Evaluation System Overview

The VSAVM evaluation framework provides two complementary tracks: (1) deterministic, synthetic capability checks for the VM core (FastEval) and (2) a reproducible continuation benchmark comparing VSAVM macro-units against a small TensorFlow baseline (eval_tinyLLM).

Key Objectives

Demonstrate that VSAVM can learn rules, compress data, perform reasoning, learn through RL, and respond to queries - all with measurable performance indicators and established thresholds.

๐Ÿง  Rule Learning

System learns rules on different types of deterministically generated synthetic data with >90% accuracy

๐Ÿ—œ๏ธ Data Compression

System compresses learned patterns effectively with >50% size reduction and <5% information loss

๐Ÿ” Reasoning

System performs logical inference and maintains >95% consistency in bounded closure

๐ŸŽฏ RL Prediction

System learns through reinforcement learning with <1000 episodes convergence for simple patterns

โ“ Query Response

System answers questions with <100ms response time for basic factual queries

๐Ÿงช Continuation Comparison

Budgeted byte-level continuation metrics (VSAVM macro-units vs TensorFlow baseline)

FastEval - Rapid Evaluation Suite

FastEval provides lightweight, deterministic tests to quickly assess VSAVM's core capabilities without extensive training or computational resources.

Design Principles

Test Categories

Rule Learning Tests

Arithmetic sequences, logical implications, temporal patterns

In Progress

Compression Tests

Pattern consolidation, schema reuse, MDL optimization

In Progress

Reasoning Tests

Deductive reasoning, consistency checking, bounded closure

In Progress

RL Prediction Tests

Shape learning, transfer learning, policy optimization

Planned

Query Response Tests

Factual queries, inferential queries, natural language compilation

Planned

Running FastEval

# Run full evaluation suite (from repo root) node evals/run.mjs # Run individual test categories node evals/run.mjs --category rule-learning node evals/run.mjs --category compression node evals/run.mjs --category reasoning node evals/run.mjs --category query-response # Emit JSON-only output node evals/run.mjs --json

eval_tinyLLM - Budgeted continuation comparison

eval_tinyLLM is a reproducible training + comparison sandbox. It measures byte-level continuation quality under strict time/token budgets and compares VSAVMโ€™s DS011 MacroUnitModel against a small TensorFlow baseline on the same prepared dataset.

What it is (and what it is not)

Running eval_tinyLLM

# Prepare dataset split (writes eval_tinyLLM/cache/datasets/<datasetId>/...) node eval_tinyLLM/tools/fetch-and-prepare.mjs --max-bytes 50000000 # Train VSAVM macro-unit model (DS011) node eval_tinyLLM/tools/train-vsavm.mjs --max-bytes 50000000 --skip-ingest --tag large # Train TensorFlow baseline node eval_tinyLLM/tools/train-tf.mjs --max-bytes 50000000 --epochs 2 --steps 2000 --tag large # Compare under identical budgets and write a timestamped report node eval_tinyLLM/tools/compare.mjs --max-bytes 50000000 --reference

Outputs

Evaluation Metrics

Comprehensive metrics system tracking learning, compression, reasoning, and technical performance indicators.

Learning Metrics

Metric Description Threshold Literature Basis
Rule Extraction Rate Percentage of underlying rules correctly identified >90% Symbolic AI benchmarks
Convergence Speed Training steps required to reach threshold performance <1000 episodes RL convergence studies
Generalization Performance on unseen but similar patterns >85% Transfer learning research

Compression Metrics

Metric Description Threshold Literature Basis
Compression Ratio Original data size / compressed representation size >50% Data compression standards
MDL Score Minimum Description Length principle compliance Minimize MDL theory (Rissanen)
Schema Efficiency Reusability of learned schemas across contexts >70% Schema learning studies

Reasoning Metrics

Metric Description Threshold Literature Basis
Inference Accuracy Correctness of logical deductions >95% Logic programming benchmarks
Consistency Score Absence of contradictions in derived facts >95% Knowledge base consistency
Bounded Closure Completeness within computational budget >90% Anytime algorithms

Technical Metrics

Metric Description Threshold Literature Basis
Memory Usage Peak and average memory consumption <500MB Embedded AI constraints
Execution Time Processing speed for different operations <100ms queries Real-time system requirements
Scalability Performance degradation with data size Sub-linear Algorithm complexity theory

Benchmarks and Thresholds

Established performance baselines derived from literature review and system requirements analysis.

Threshold Justification

Rule Learning: >90%

Based on symbolic AI benchmarks where deterministic rule extraction should achieve near-perfect accuracy on synthetic data

Compression: >50%

Derived from information theory bounds and practical compression algorithms on structured data

Reasoning: >95%

Logic programming systems achieve near-perfect consistency on well-formed knowledge bases

RL Convergence: <1000 episodes

Standard benchmark for simple pattern learning tasks in reinforcement learning literature

Regression Detection

The system tracks performance over time and alerts on regressions using statistical analysis:

Results and Analysis

FastEval prints results to stdout (and can emit JSON). eval_tinyLLM writes timestamped HTML+JSON comparison reports under eval_tinyLLM/results/.

Implementation Status

FastEval and eval_tinyLLM are both usable today. FastEval focuses on deterministic VM properties; eval_tinyLLM focuses on continuation-quality metrics under identical budgets.

Planned Results Visualization

๐Ÿ“Š Performance Dashboards

Real-time metrics tracking and historical trend analysis

๐Ÿ“ˆ Regression Reports

Automated detection and reporting of performance degradations

๐Ÿ” Detailed Analysis

Deep-dive analysis of specific test failures and performance bottlenecks

๐Ÿ“‹ Comparative Studies

Benchmarking against established AI systems and theoretical bounds

Next Steps

  1. Complete implementation of RL prediction and query response tests
  2. Integrate with actual VSAVM implementation
  3. Establish baseline performance measurements
  4. Implement continuous integration and automated reporting
  5. Conduct comparative analysis with existing systems