Can you trust your AI's context?
ContextBenchmark is the open standard for measuring the reliability, reproducibility, and stability of AI context systems.
Compare context engines — retrieval indexes, RAG pipelines, agent memory, code-context systems — with deterministic metrics, transparent methodology, and reproducible, fingerprint-verified results.
AI is only as reliable as the context it receives.
Modern AI systems depend on context layers to retrieve knowledge, maintain memory, and ground decisions. Yet there is no standard way to measure whether those context systems are stable, reproducible, or trustworthy. ContextBenchmark fills that gap.
The benchmark intentionally measures the context layer, not the language model. LLM inference nondeterminism is a separate, explicitly out-of-scope problem.
Why AI needs a context benchmark
AI benchmarks measure model capability. They rarely measure the quality of the context supplied to those models. Yet context determines what an AI knows, remembers, retrieves — and ultimately how reliably it behaves. ContextBenchmark establishes the first open standard for evaluating context infrastructure independently of the language model itself.
| Existing AI benchmarks | ContextBenchmark |
|---|---|
| Measure the model | Measures the context layer |
| Focus on reasoning and generation | Focuses on reproducibility and reliability |
| Vary with every model update | Designed to remain model-agnostic |
| Evaluate intelligence | Evaluates trust in context |
What ContextBenchmark measures
Four test families, each answering one question a production team actually asks.
Rebuild Identity
Can the system recreate byte-identical artifacts from identical input? Independent fresh builds, artifact hash comparison.
Query Stability
Does the same question return the same context every time — same files, same order?
Drift Under Noise
Does adding one unrelated file change answers to unrelated questions?
Cross-Machine Identity
Do Windows, macOS, and Linux produce identical context? Verified by fingerprint exchange, not by trust.
Future families under specification: agent safety, governance, memory integrity, determinism-under-incremental-update.
Context Trust Levels
Every implementation receives a Context Trust Level based on published benchmark results.
| Level | Name | Requirement |
|---|---|---|
| CTL 4 | Cross-machine deterministic | CTL 3, plus identical artifacts and query results across operating systems, verified by fingerprint exchange |
| CTL 3 | Machine-deterministic | Byte-identical artifacts across rebuilds and exact-match query results across trials |
| CTL 2 | Stable retrieval | Artifact bytes differ, but ranked query results are identical every time |
| CTL 1 | Repeatable locally | Results not identical but rank-stable (Jaccard@k ≥ 0.9, τ ≥ 0.9) |
| CTL 0 | Non-repeatable | Below CTL 1 |
No vendor can claim a level without publishing benchmark artifacts. Every claim is backed by a fingerprint anyone can independently verify — including ours.
Current results
micro-app reference corpus · fingerprints published in the repository · cross-platform verification runs publicly in CI on every push.
| Engine | CTL | Rebuild | Query stability | Drift under noise | Cross-platform |
|---|---|---|---|---|---|
| BM25 (lexical reference) | CTL 4 | ✅ identical | ✅ EMR 1.0 | ⚠️ 0.04 — noise reached top-10 in 2/10 queries | ✅ verified in CI: ubuntu · windows · macos |
| Spiderbrain (structural code-context engine) | CTL 3 | ✅ identical | ✅ EMR 1.0 | ✅ 0.00 — noise never surfaced | Pending CI-runnable packaging (fingerprints published) |
| MiniLM embeddings (RAG reference) | — | Run pending — adapter shipped in the repository | |||
| Mem0 · Zep · Supermemory · LlamaIndex · vector stores | — | Awaiting vendor adapters — contribute one | |||
The reference baseline reaching CTL 4 is the point: the bar is achievable with plain engineering. A system scoring below the free baseline has made a design choice, not hit a law of nature.
Designed for fair comparison
ContextBenchmark evaluates context infrastructure — not language models, and not marketing.
- Adapter API (~40 lines per system)
- Open datasets, committed and license-clean
- Public methodology and metrics
- Fingerprint verification for every claim
- No benchmark-specific optimizations allowed
- Reproducible runs on commodity hardware
Benchmark architecture
git clone https://github.com/aabhisrv/contextbenchmark && cd contextbenchmark node contextbenchmark.mjs run --adapters bm25 # dependency-free baseline node contextbenchmark.mjs run --adapters bm25,emb-minilm # + typical-RAG reference node contextbenchmark.mjs compare A.fingerprint.json B.fingerprint.json
Built for vendors
Implement a lightweight adapter and benchmark your context system against the same transparent methodology used by every participant.
Honesty rules apply to everyone: production configuration only, no benchmark-only determinism flags, results disclosed with fingerprints. Read the adapter contract →
Research & methodology
ContextBenchmark builds on reproducible-systems research, retrieval evaluation, and software reproducibility practice — and introduces standardized measurements for deterministic AI context.
Methodology
Test-family definitions, trial counts, pass criteria, and level assignment — versioned in the repository.
Metrics
Exact Match Rate, Jaccard@k, and Kendall τ follow the conventions established for RAG reproducibility measurement (ReproRAG, arXiv:2509.18869).
Datasets
Committed, deterministic corpora with fixed query sets; pinned real-repository tiers planned.
Disclosure
A publishable result includes the report, fingerprints, versions, corpus identity, and machine spec. No fingerprint, no claim.
Versioning
Metric or family changes version the benchmark; levels are always cited with the benchmark version that produced them.
Related work
Distinct from model benchmarks and from academic context-accuracy evaluation (e.g. ContextBench, unaffiliated); the reliability lane is complementary to both.
Open source
ContextBenchmark is community-driven. Every benchmark, adapter, dataset, metric, and result is publicly inspectable and reproducible.