Benchmark Results¶

All results use the default embedding model (all-MiniLM-L6-v2, 384 dimensions) unless otherwise noted. AUROC is the primary metric.

Headline Results¶

Method	Calibration	AUROC	Dataset
DGI	Domain-specific	0.958	Confabulation benchmark (LLM-generated)
DGI	Generic (bundled)	0.76	Mixed QA
SGI	N/A	0.88	RAG verification (context-grounded)

DGI: Generic vs. Domain-Specific Calibration¶

The single biggest improvement comes from domain-specific calibration:

Calibration	AUROC	$\kappa$	Pairs	Notes
Generic (bundled)	0.76	3.2	~500	Multi-domain, diverse
Legal	0.94	11.4	47	Contract/regulatory QA
Medical	0.97	14.2	63	Clinical pharmacy QA
Financial	0.92	8.7	38	Compliance/regulatory QA
Technical docs	0.95	12.1	55	Software documentation QA
Customer support	0.91	7.9	42	Product support QA

Key insight

Domain calibration with just 40--60 pairs typically improves AUROC by 0.15--0.21 over the generic baseline. The concentration parameter $\kappa$ predicts calibration quality: $\kappa > 10$ consistently produces AUROC > 0.93.

DGI by Hallucination Type¶

The confabulation benchmark breaks down performance by hallucination type (arXiv:2603.13259):

Hallucination Type	DGI AUROC (domain)	DGI AUROC (generic)
Divergent (topic drift)	0.98	0.85
Fabrication (invented facts)	0.96	0.78
Tangential (partial grounding)	0.89	0.71
Template confabulation	0.62	0.54
Expert-crafted confabulation	0.51	0.50

The results clearly show the confabulation boundary: performance degrades as hallucinations become more distributionally similar to grounded responses. See Confabulation Boundary for the theoretical analysis.

SGI Results¶

SGI is evaluated on datasets where context is available:

Scenario	AUROC	Notes
RAG verification (context used vs. ignored)	0.88	Standard RAG setup
Document QA (answer from doc vs. parametric)	0.91	Long-document QA
Summarization (faithful vs. hallucinated)	0.84	Summary grounding check

SGI does not require calibration --- it uses the geometric structure of question/context/response directly.

Embedding Model Comparison¶

Different embedding models produce different AUROC values:

Model	Dimensions	DGI AUROC (generic)	DGI AUROC (domain)	Inference time
all-MiniLM-L6-v2	384	0.76	0.958	~5 ms
all-mpnet-base-v2	768	0.79	0.964	~12 ms
bge-small-en-v1.5	384	0.74	0.951	~5 ms
e5-small-v2	384	0.75	0.953	~5 ms

Model recommendation

all-MiniLM-L6-v2 provides the best tradeoff between accuracy and speed. The larger all-mpnet-base-v2 offers marginal improvement (+0.006 AUROC) at 2.4x the inference cost.

Calibration Size Sensitivity¶

How many calibration pairs are needed for good domain calibration?

Pairs	DGI AUROC	$\kappa$
5 (minimum)	0.82	4.1
10	0.88	6.8
20	0.93	10.2
50	0.96	13.5
100	0.97	14.8
200	0.97	15.1

Diminishing returns set in around 50 pairs. The jump from 5 to 20 pairs provides the most value.

Latency¶

Scoring latency on CPU (Intel Xeon, single thread):

Operation	Time	Notes
Model loading (first call)	~1.5 s	One-time cost, cached thereafter
Single SGI score	~15 ms	3 embeddings + distance computation
Single DGI score	~12 ms	2 embeddings + dot product (mu_hat cached)
Batch of 100	~0.8 s	Amortized ~8 ms per item
Batch of 1000	~6 s	Amortized ~6 ms per item

Comparison with LLM-as-Judge¶

Method	AUROC	Latency	Deterministic	Cost
groundlens DGI (domain)	0.958	~12 ms	Yes	$0 (local)
groundlens SGI	0.88	~15 ms	Yes	$0 (local)
GPT-4o as judge	0.91	~2 s	No	~$0.01/eval
Claude as judge	0.89	~3 s	No	~$0.01/eval
Llama-3 as judge (local)	0.82	~5 s	Approx.	$0 (GPU required)

Key tradeoff

LLM-as-judge achieves comparable AUROC to groundlens DGI with generic calibration, but groundlens with domain calibration outperforms all LLM judges while being 100--200x faster, deterministic, and free of evaluation cost. The downside is that groundlens requires calibration effort for optimal results.

Reproducing These Results¶

# Install benchmark dependencies
pip install groundlens datasets scikit-learn

# Run the confabulation benchmark
groundlens benchmark

# With a custom embedding model
groundlens benchmark --model all-mpnet-base-v2

All numbers in this page were produced with groundlens version 2026.4.x and can be reproduced exactly using the published datasets and default configuration.