Skip to content

Benchmark Results

All results use the default embedding model (all-MiniLM-L6-v2, 384 dimensions) unless otherwise noted. AUROC is the primary metric.

Headline Results

Method Calibration AUROC Dataset
DGI Domain-specific 0.958 Confabulation benchmark (LLM-generated)
DGI Generic (bundled) 0.76 Mixed QA
SGI N/A 0.88 RAG verification (context-grounded)

DGI: Generic vs. Domain-Specific Calibration

The single biggest improvement comes from domain-specific calibration:

Calibration AUROC \(\kappa\) Pairs Notes
Generic (bundled) 0.76 3.2 ~500 Multi-domain, diverse
Legal 0.94 11.4 47 Contract/regulatory QA
Medical 0.97 14.2 63 Clinical pharmacy QA
Financial 0.92 8.7 38 Compliance/regulatory QA
Technical docs 0.95 12.1 55 Software documentation QA
Customer support 0.91 7.9 42 Product support QA

Key insight

Domain calibration with just 40--60 pairs typically improves AUROC by 0.15--0.21 over the generic baseline. The concentration parameter \(\kappa\) predicts calibration quality: \(\kappa > 10\) consistently produces AUROC > 0.93.

DGI by Hallucination Type

The confabulation benchmark breaks down performance by hallucination type (arXiv:2603.13259):

Hallucination Type DGI AUROC (domain) DGI AUROC (generic)
Divergent (topic drift) 0.98 0.85
Fabrication (invented facts) 0.96 0.78
Tangential (partial grounding) 0.89 0.71
Template confabulation 0.62 0.54
Expert-crafted confabulation 0.51 0.50

The results clearly show the confabulation boundary: performance degrades as hallucinations become more distributionally similar to grounded responses. See Confabulation Boundary for the theoretical analysis.

SGI Results

SGI is evaluated on datasets where context is available:

Scenario AUROC Notes
RAG verification (context used vs. ignored) 0.88 Standard RAG setup
Document QA (answer from doc vs. parametric) 0.91 Long-document QA
Summarization (faithful vs. hallucinated) 0.84 Summary grounding check

SGI does not require calibration --- it uses the geometric structure of question/context/response directly.

Embedding Model Comparison

Different embedding models produce different AUROC values:

Model Dimensions DGI AUROC (generic) DGI AUROC (domain) Inference time
all-MiniLM-L6-v2 384 0.76 0.958 ~5 ms
all-mpnet-base-v2 768 0.79 0.964 ~12 ms
bge-small-en-v1.5 384 0.74 0.951 ~5 ms
e5-small-v2 384 0.75 0.953 ~5 ms

Model recommendation

all-MiniLM-L6-v2 provides the best tradeoff between accuracy and speed. The larger all-mpnet-base-v2 offers marginal improvement (+0.006 AUROC) at 2.4x the inference cost.

Calibration Size Sensitivity

How many calibration pairs are needed for good domain calibration?

Pairs DGI AUROC \(\kappa\)
5 (minimum) 0.82 4.1
10 0.88 6.8
20 0.93 10.2
50 0.96 13.5
100 0.97 14.8
200 0.97 15.1

Diminishing returns set in around 50 pairs. The jump from 5 to 20 pairs provides the most value.

Latency

Scoring latency on CPU (Intel Xeon, single thread):

Operation Time Notes
Model loading (first call) ~1.5 s One-time cost, cached thereafter
Single SGI score ~15 ms 3 embeddings + distance computation
Single DGI score ~12 ms 2 embeddings + dot product (mu_hat cached)
Batch of 100 ~0.8 s Amortized ~8 ms per item
Batch of 1000 ~6 s Amortized ~6 ms per item

Comparison with LLM-as-Judge

Method AUROC Latency Deterministic Cost
groundlens DGI (domain) 0.958 ~12 ms Yes $0 (local)
groundlens SGI 0.88 ~15 ms Yes $0 (local)
GPT-4o as judge 0.91 ~2 s No ~$0.01/eval
Claude as judge 0.89 ~3 s No ~$0.01/eval
Llama-3 as judge (local) 0.82 ~5 s Approx. $0 (GPU required)

Key tradeoff

LLM-as-judge achieves comparable AUROC to groundlens DGI with generic calibration, but groundlens with domain calibration outperforms all LLM judges while being 100--200x faster, deterministic, and free of evaluation cost. The downside is that groundlens requires calibration effort for optimal results.

Reproducing These Results

# Install benchmark dependencies
pip install groundlens datasets scikit-learn

# Run the confabulation benchmark
groundlens benchmark

# With a custom embedding model
groundlens benchmark --model all-mpnet-base-v2

All numbers in this page were produced with groundlens version 2026.4.x and can be reproduced exactly using the published datasets and default configuration.