VSAVM Evaluation Framework

Evaluation System Overview

The VSAVM evaluation framework provides two complementary tracks: (1) deterministic, synthetic capability checks for the VM core (FastEval) and (2) a reproducible continuation benchmark comparing VSAVM macro-units against a small TensorFlow baseline (eval_tinyLLM).

Key Objectives

Demonstrate that VSAVM can learn rules, compress data, perform reasoning, learn through RL, and respond to queries - all with measurable performance indicators and established thresholds.

🧠 Rule Learning

System learns rules on different types of deterministically generated synthetic data with >90% accuracy

🗜️ Data Compression

System compresses learned patterns effectively with >50% size reduction and <5% information loss

🔍 Reasoning

System performs logical inference and maintains >95% consistency in bounded closure

🎯 RL Prediction

System learns through reinforcement learning with <1000 episodes convergence for simple patterns

❓ Query Response

System answers questions with <100ms response time for basic factual queries

🧪 Continuation Comparison

Budgeted byte-level continuation metrics (VSAVM macro-units vs TensorFlow baseline)

FastEval - Rapid Evaluation Suite

FastEval provides lightweight, deterministic tests to quickly assess VSAVM's core capabilities without extensive training or computational resources.

Design Principles

Minimal Computation: Tests run in seconds to minutes, not hours
Deterministic: Reproducible results across runs and environments
Synthetic Data: Controlled, generated datasets with known ground truth
Progressive Complexity: Tests start simple and increase in difficulty
Clear Metrics: Quantitative measures with established thresholds

Test Categories

Rule Learning Tests

Arithmetic sequences, logical implications, temporal patterns

In Progress

Compression Tests

Pattern consolidation, schema reuse, MDL optimization

In Progress

Reasoning Tests

Deductive reasoning, consistency checking, bounded closure

In Progress

RL Prediction Tests

Shape learning, transfer learning, policy optimization

Planned

Query Response Tests

Factual queries, inferential queries, natural language compilation

Planned

Running FastEval

# Run full evaluation suite (from repo root)
node evals/run.mjs

# Run individual test categories
node evals/run.mjs --category rule-learning
node evals/run.mjs --category compression
node evals/run.mjs --category reasoning
node evals/run.mjs --category query-response

# Emit JSON-only output
node evals/run.mjs --json
            

eval_tinyLLM - Budgeted continuation comparison

eval_tinyLLM is a reproducible training + comparison sandbox. It measures byte-level continuation quality under strict time/token budgets and compares VSAVM’s DS011 MacroUnitModel against a small TensorFlow baseline on the same prepared dataset.

What it is (and what it is not)

Measures: perplexity, throughput, repetition, distinct n-grams, reference match, and VSAVM macro-unit compression.
Does not measure: VM/closure correctness of factual claims (that remains the VM + DS004 bounded closure responsibility).
Artifact discipline: datasets and trained models are cached under eval_tinyLLM/cache/ keyed by datasetId and modelId, so multiple dataset sizes and training variants can coexist without overwriting.

Running eval_tinyLLM

# Prepare dataset split (writes eval_tinyLLM/cache/datasets/<datasetId>/...)
node eval_tinyLLM/tools/fetch-and-prepare.mjs --max-bytes 50000000

# Train VSAVM macro-unit model (DS011)
node eval_tinyLLM/tools/train-vsavm.mjs --max-bytes 50000000 --skip-ingest --tag large

# Train TensorFlow baseline
node eval_tinyLLM/tools/train-tf.mjs --max-bytes 50000000 --epochs 2 --steps 2000 --tag large

# Compare under identical budgets and write a timestamped report
node eval_tinyLLM/tools/compare.mjs --max-bytes 50000000 --reference
            

Outputs

Reports: written to eval_tinyLLM/results/<timestamp>_results.html and .json.
Artifacts: cached under eval_tinyLLM/cache/models/{vsavm,tf}/<datasetId>/<modelId>/.

Evaluation Metrics

Comprehensive metrics system tracking learning, compression, reasoning, and technical performance indicators.

Learning Metrics

Metric	Description	Threshold	Literature Basis
Rule Extraction Rate	Percentage of underlying rules correctly identified	>90%	Symbolic AI benchmarks
Convergence Speed	Training steps required to reach threshold performance	<1000 episodes	RL convergence studies
Generalization	Performance on unseen but similar patterns	>85%	Transfer learning research

Compression Metrics

Metric	Description	Threshold	Literature Basis
Compression Ratio	Original data size / compressed representation size	>50%	Data compression standards
MDL Score	Minimum Description Length principle compliance	Minimize	MDL theory (Rissanen)
Schema Efficiency	Reusability of learned schemas across contexts	>70%	Schema learning studies

Reasoning Metrics

Metric	Description	Threshold	Literature Basis
Inference Accuracy	Correctness of logical deductions	>95%	Logic programming benchmarks
Consistency Score	Absence of contradictions in derived facts	>95%	Knowledge base consistency
Bounded Closure	Completeness within computational budget	>90%	Anytime algorithms

Technical Metrics

Metric	Description	Threshold	Literature Basis
Memory Usage	Peak and average memory consumption	<500MB	Embedded AI constraints
Execution Time	Processing speed for different operations	<100ms queries	Real-time system requirements
Scalability	Performance degradation with data size	Sub-linear	Algorithm complexity theory

Benchmarks and Thresholds

Established performance baselines derived from literature review and system requirements analysis.

Threshold Justification

Rule Learning: >90%

Based on symbolic AI benchmarks where deterministic rule extraction should achieve near-perfect accuracy on synthetic data

Compression: >50%

Derived from information theory bounds and practical compression algorithms on structured data

Reasoning: >95%

Logic programming systems achieve near-perfect consistency on well-formed knowledge bases

RL Convergence: <1000 episodes

Standard benchmark for simple pattern learning tasks in reinforcement learning literature

Regression Detection

The system tracks performance over time and alerts on regressions using statistical analysis:

Performance Baselines: Established benchmarks for each test category
Trend Analysis: Statistical analysis of performance changes over time
Alert Thresholds: Configurable sensitivity for regression detection (default: 5% degradation)
Historical Comparison: Performance comparison across versions and builds

Results and Analysis

FastEval prints results to stdout (and can emit JSON). eval_tinyLLM writes timestamped HTML+JSON comparison reports under eval_tinyLLM/results/.

Implementation Status

FastEval and eval_tinyLLM are both usable today. FastEval focuses on deterministic VM properties; eval_tinyLLM focuses on continuation-quality metrics under identical budgets.

Planned Results Visualization

📊 Performance Dashboards

Real-time metrics tracking and historical trend analysis

📈 Regression Reports

Automated detection and reporting of performance degradations

🔍 Detailed Analysis

Deep-dive analysis of specific test failures and performance bottlenecks

📋 Comparative Studies

Benchmarking against established AI systems and theoretical bounds

Next Steps

Complete implementation of RL prediction and query response tests
Integrate with actual VSAVM implementation
Establish baseline performance measurements
Implement continuous integration and automated reporting
Conduct comparative analysis with existing systems