Benchmarks Overview¶
groundlens benchmarks measure how well SGI and DGI discriminate between grounded and hallucinated responses. The primary metric is AUROC (Area Under the Receiver Operating Characteristic curve), which measures the probability that a randomly chosen grounded response scores higher than a randomly chosen hallucinated response.
What We Measure¶
AUROC (Area Under ROC)¶
AUROC ranges from 0.0 to 1.0:
| AUROC | Interpretation |
|---|---|
| 1.00 | Perfect discrimination |
| 0.90--0.99 | Excellent --- suitable for production |
| 0.80--0.90 | Good --- useful for triage with some noise |
| 0.70--0.80 | Fair --- informative but not reliable alone |
| 0.50 | Random chance --- no discrimination |
Why AUROC?¶
AUROC is threshold-independent: it evaluates the scoring function's ability to rank grounded responses above hallucinated ones, regardless of where you set the decision threshold. This is important because different deployments may use different thresholds based on their risk tolerance.
Benchmark Datasets¶
Confabulation Benchmark¶
The primary benchmark dataset, published alongside arXiv:2603.13259. It contains:
- Verified grounded (question, response) pairs
- LLM-generated hallucinations (produced by instructing models to answer without access to correct information)
- Template-based confabulations (factual substitutions in correct response templates)
- Context-annotated examples (for SGI evaluation)
Dataset: cert-framework/human-confabulation-benchmark on HuggingFace.
Domain-Specific Benchmarks¶
Additional benchmark datasets for specific verticals:
| Domain | Pairs | Grounded | Hallucinated | Available |
|---|---|---|---|---|
| General | 200 | 100 | 100 | Bundled |
| Legal | 150 | 75 | 75 | On request |
| Medical | 180 | 90 | 90 | On request |
| Financial | 120 | 60 | 60 | On request |
How to Run Benchmarks¶
CLI¶
# Default benchmark (confabulation benchmark)
groundlens benchmark
# Custom dataset
groundlens benchmark --dataset cert-framework/human-confabulation-benchmark
# Custom model
groundlens benchmark --model all-mpnet-base-v2
Python API¶
from groundlens import compute_sgi, compute_dgi
from sklearn.metrics import roc_auc_score
# Load your benchmark dataset
dataset = load_benchmark() # Your loading logic
sgi_scores, sgi_labels = [], []
dgi_scores, dgi_labels = [], []
for item in dataset:
question = item["question"]
response = item["response"]
context = item.get("context")
label = item["label"] # 1 = grounded, 0 = hallucinated
# SGI (when context is available)
if context:
sgi_result = compute_sgi(question=question, context=context, response=response)
sgi_scores.append(sgi_result.value)
sgi_labels.append(label)
# DGI (always)
dgi_result = compute_dgi(question=question, response=response)
dgi_scores.append(dgi_result.value)
dgi_labels.append(label)
# Compute AUROC
print(f"SGI AUROC: {roc_auc_score(sgi_labels, sgi_scores):.4f}")
print(f"DGI AUROC: {roc_auc_score(dgi_labels, dgi_scores):.4f}")
Requirements¶
Evaluation Protocol¶
To ensure fair comparison, all benchmarks follow the same protocol:
- Fixed embedding model: Default
all-MiniLM-L6-v2unless stated otherwise. - No threshold tuning on test data: Thresholds are fixed before evaluation.
- Separate calibration and test sets: For DGI, calibration pairs are never in the test set.
- Stratified evaluation: AUROC is computed separately for each hallucination type (divergent, tangential, confabulation).
Reproducing Results¶
All reported results can be reproduced exactly because:
- groundlens scoring is deterministic (no sampling)
- Benchmark datasets are versioned and publicly available
- The embedding model is fixed and downloadable
# Reproduce the headline DGI AUROC 0.958 result
groundlens benchmark --dataset cert-framework/human-confabulation-benchmark
See Results for the full numbers.