Benchmarks Overview¶

groundlens benchmarks measure how well SGI and DGI discriminate between grounded and hallucinated responses. The primary metric is AUROC (Area Under the Receiver Operating Characteristic curve), which measures the probability that a randomly chosen grounded response scores higher than a randomly chosen hallucinated response.

What We Measure¶

AUROC (Area Under ROC)¶

AUROC ranges from 0.0 to 1.0:

AUROC	Interpretation
1.00	Perfect discrimination
0.90--0.99	Excellent --- suitable for production
0.80--0.90	Good --- useful for triage with some noise
0.70--0.80	Fair --- informative but not reliable alone
0.50	Random chance --- no discrimination

Why AUROC?¶

AUROC is threshold-independent: it evaluates the scoring function's ability to rank grounded responses above hallucinated ones, regardless of where you set the decision threshold. This is important because different deployments may use different thresholds based on their risk tolerance.

Benchmark Datasets¶

Confabulation Benchmark¶

The primary benchmark dataset, published alongside arXiv:2603.13259. It contains:

Verified grounded (question, response) pairs
LLM-generated hallucinations (produced by instructing models to answer without access to correct information)
Template-based confabulations (factual substitutions in correct response templates)
Context-annotated examples (for SGI evaluation)

Dataset: cert-framework/human-confabulation-benchmark on HuggingFace.

Domain-Specific Benchmarks¶

Additional benchmark datasets for specific verticals:

Domain	Pairs	Grounded	Hallucinated	Available
General	200	100	100	Bundled
Legal	150	75	75	On request
Medical	180	90	90	On request
Financial	120	60	60	On request

How to Run Benchmarks¶

CLI¶

# Default benchmark (confabulation benchmark)
groundlens benchmark

# Custom dataset
groundlens benchmark --dataset cert-framework/human-confabulation-benchmark

# Custom model
groundlens benchmark --model all-mpnet-base-v2

Python API¶

from groundlens import compute_sgi, compute_dgi
from sklearn.metrics import roc_auc_score

# Load your benchmark dataset
dataset = load_benchmark()  # Your loading logic

sgi_scores, sgi_labels = [], []
dgi_scores, dgi_labels = [], []

for item in dataset:
    question = item["question"]
    response = item["response"]
    context = item.get("context")
    label = item["label"]  # 1 = grounded, 0 = hallucinated

    # SGI (when context is available)
    if context:
        sgi_result = compute_sgi(question=question, context=context, response=response)
        sgi_scores.append(sgi_result.value)
        sgi_labels.append(label)

    # DGI (always)
    dgi_result = compute_dgi(question=question, response=response)
    dgi_scores.append(dgi_result.value)
    dgi_labels.append(label)

# Compute AUROC
print(f"SGI AUROC: {roc_auc_score(sgi_labels, sgi_scores):.4f}")
print(f"DGI AUROC: {roc_auc_score(dgi_labels, dgi_scores):.4f}")

Requirements¶

pip install datasets scikit-learn  # Required for benchmark command

Evaluation Protocol¶

To ensure fair comparison, all benchmarks follow the same protocol:

Fixed embedding model: Default all-MiniLM-L6-v2 unless stated otherwise.
No threshold tuning on test data: Thresholds are fixed before evaluation.
Separate calibration and test sets: For DGI, calibration pairs are never in the test set.
Stratified evaluation: AUROC is computed separately for each hallucination type (divergent, tangential, confabulation).

Reproducing Results¶

All reported results can be reproduced exactly because:

groundlens scoring is deterministic (no sampling)
Benchmark datasets are versioned and publicly available
The embedding model is fixed and downloadable

# Reproduce the headline DGI AUROC 0.958 result
groundlens benchmark --dataset cert-framework/human-confabulation-benchmark

See Results for the full numbers.