Skip to content

Calibration

DGI accuracy depends on the quality of the reference direction \(\hat{\mu}\). Domain-specific calibration is the single most impactful step you can take to improve groundlens accuracy in production.

Why Calibrate?

The bundled generic reference direction is trained on diverse question-answer pairs across many domains. It captures a "universal" grounded displacement direction that achieves AUROC ~0.76 --- useful for prototyping but insufficient for production.

Different domains have different displacement patterns:

  • Legal: Questions about statutes produce responses with specific citation patterns
  • Medical: Clinical questions produce responses with diagnostic terminology shifts
  • Financial: Regulatory questions produce responses with compliance-specific elaboration

Domain-specific calibration captures these patterns, typically improving AUROC to 0.90--0.99.

Domain Generic AUROC Calibrated AUROC Improvement
Generic 0.76 --- Baseline
Legal 0.76 0.94 +0.18
Medical 0.76 0.97 +0.21
Financial 0.76 0.92 +0.16
Technical docs 0.76 0.95 +0.19

How to Collect Calibration Pairs

You need 20--100 verified (question, response) pairs where the response is known to be factually grounded. Sources:

  1. Existing QA datasets: If you have a validated QA dataset for your domain, use it directly.
  2. Human-verified LLM outputs: Run your LLM on representative questions and have a subject-matter expert verify the answers.
  3. Documentation extraction: Extract question-answer pairs from official documentation, FAQs, or knowledge bases.

Quality over quantity

20 high-quality pairs outperform 200 noisy pairs. Each pair should represent a genuine question and a verified correct response from your target domain.

CSV Format

question,response
"What is the recommended dosage for ibuprofen?","The recommended dosage is 200-400mg every 4-6 hours for adults."
"What are the contraindications for aspirin?","Aspirin is contraindicated in patients with aspirin allergy, active bleeding, or hemophilia."

The calibrate() API

from groundlens import calibrate

# From a CSV file
result = calibrate(csv_path="my_domain_pairs.csv")

# From pairs directly
result = calibrate(
    pairs=[
        ("What is the dosage for X?", "The recommended dosage is Y."),
        ("What are the side effects?", "Common side effects include Z."),
        # ... at least 5 pairs, ideally 20-100
    ],
    metadata={"domain": "pharmacy", "date": "2026-04-22"},
)

print(f"Pairs:         {result.n_pairs}")
print(f"Embedding dim: {result.embedding_dim}")
print(f"Concentration: {result.concentration:.2f}")

Understanding the Result

The CalibrationResult contains:

Field Type Description
model str Sentence-transformer model used
n_pairs int Number of calibration pairs
embedding_dim int Dimensionality of the embedding space
mu_hat ndarray The computed reference direction vector
concentration float Estimated \(\kappa\) parameter of the von Mises-Fisher distribution
metadata dict User-attached metadata

The Concentration Parameter (\(\kappa\))

The concentration parameter indicates how consistent the displacement directions are in your calibration data:

  • \(\kappa\) > 10: Highly consistent --- your domain has a strong, clear grounded direction. Expect good discrimination.
  • \(\kappa\) 5--10: Moderately consistent --- reasonable calibration quality.
  • \(\kappa\) < 5: Low consistency --- the calibration pairs may be too diverse, noisy, or from mixed domains. Consider filtering.

Saving and Loading

# Save calibration for production use
result.save("calibration_pharmacy.json")

# Load in production
from groundlens.calibrate import CalibrationResult
loaded = CalibrationResult.load("calibration_pharmacy.json")
print(loaded.concentration)

The saved JSON contains all fields needed to reconstruct the reference direction without recomputing from pairs.

Using Calibration in Production

from groundlens import compute_dgi

result = compute_dgi(
    question="What is the dosage for X?",
    response="The recommended dosage is Y.",
    reference_csv="my_domain_pairs.csv",
)
from groundlens import DGI

dgi = DGI(reference_csv="my_domain_pairs.csv")
result = dgi.score(question="...", response="...")
from groundlens import evaluate

score = evaluate(
    question="...",
    response="...",
    reference_csv="my_domain_pairs.csv",
)

Model consistency

The calibration must use the same embedding model as the scoring. If you calibrate with all-MiniLM-L6-v2, you must score with all-MiniLM-L6-v2. Mixing models produces undefined behavior because the embedding spaces are geometrically different.

Fitting decision thresholds (fit_thresholds)

Calibrating mu_hat sets the reference direction. Choosing the cutoff at which a score flags for review is a separate decision. fit_thresholds fits both the SGI review threshold and the DGI pass threshold from a small labeled set by maximizing Youden's J for the rule "value >= threshold implies grounded".

Each example is a mapping with question, response, and label (1 = ungrounded / hallucinated, 0 = grounded). Add context to also fit an SGI threshold:

from groundlens import fit_thresholds

examples = [
    {"question": "Q?", "context": "C.", "response": "grounded A.", "label": 0},
    {"question": "Q?", "context": "C.", "response": "off-topic A.", "label": 1},
    # ... ideally 20+ examples spanning both classes
]

fit = fit_thresholds(examples)
fit.dgi_pass     # fitted DGI cutoff
fit.sgi_review   # fitted SGI cutoff (None if no contexts were supplied)
fit.n            # number of examples used
fit.metric       # "youden_j"

fit_thresholds accepts the same model=, reference_csv=, and encoder= arguments as the scoring functions, so you can fit thresholds with the exact encoder and reference direction you score with. It raises ValueError if both classes are not present.

Thresholds and mu_hat are encoder-specific

The bundled SGI/DGI thresholds and the bundled DGI mu_hat are calibrated for the default encoder. If you change the encoder or model, you must re-fit: pass your encoder to both calibrate(...) (for mu_hat) and fit_thresholds(...) (for the cutoffs). groundlens emits a one-time UserWarning when you score with a non-default encoder/model against the bundled constants. See the Custom Encoders guide.

Next Steps