Skip to content

Domain Calibration Guide

This step-by-step guide walks through the complete process of calibrating DGI for a specific domain, evaluating the improvement, and deploying the calibration to production.

Overview

Domain calibration replaces the generic reference direction \(\hat{\boldsymbol{\mu}}\) with one computed from verified (question, response) pairs from your specific domain. This typically improves AUROC from ~0.76 to 0.90--0.99.

Step 1: Collect Calibration Pairs

You need 20--100 verified (question, response) pairs where the response is known to be factually correct and grounded. Quality matters more than quantity.

Sources of Calibration Pairs

Source Pros Cons
Existing QA datasets Pre-verified, diverse May not match your domain exactly
Human-verified LLM outputs Domain-matched Requires expert review effort
Documentation FAQs High quality, authoritative Limited to documented topics
Support ticket resolutions Real-world domain coverage May need cleaning

Prepare the CSV

question,response
"What is the recommended dosage for metformin?","The initial dose is 500mg twice daily or 850mg once daily, titrated up to 2000mg/day."
"What are the contraindications for ACE inhibitors?","ACE inhibitors are contraindicated in bilateral renal artery stenosis, pregnancy, and history of angioedema."
"How should warfarin therapy be monitored?","INR should be monitored at least weekly during initiation and monthly once stable."

Pair quality checklist

  • [ ] Each response is factually correct (verified by a domain expert)
  • [ ] Questions are representative of real usage in your domain
  • [ ] Responses are in the style your LLM produces (not copy-pasted from textbooks unless that is your use case)
  • [ ] No duplicate or near-duplicate pairs
  • [ ] At least 20 pairs (50+ recommended)

Step 2: Run Calibration

from groundlens import calibrate

result = calibrate(
    csv_path="medical_pairs.csv",
    metadata={
        "domain": "clinical-pharmacy",
        "source": "verified-qa-dataset-v2",
        "date": "2026-04-22",
    },
)

print(f"Pairs:         {result.n_pairs}")
print(f"Embedding dim: {result.embedding_dim}")
print(f"Concentration: {result.concentration:.2f}")
groundlens calibrate \
    --pairs medical_pairs.csv \
    --output calibration_medical.json

Evaluate the Concentration Parameter

The concentration \(\kappa\) tells you how consistent your calibration data is:

\(\kappa\) Quality Action
> 10 Excellent Proceed to evaluation
5--10 Good Proceed, but consider adding more pairs
1--5 Weak Review pairs for noise or mixed domains
< 1 Poor Calibration data is too diverse; split into sub-domains

Step 3: Evaluate Improvement

Compare generic vs. calibrated DGI on a held-out test set.

from groundlens import compute_dgi
from sklearn.metrics import roc_auc_score

# Load test data: list of (question, response, is_grounded) triples
test_data = load_test_set("medical_test.csv")

# Score with generic calibration
generic_scores = []
for q, r, label in test_data:
    result = compute_dgi(question=q, response=r)
    generic_scores.append((result.value, label))

# Score with domain calibration
calibrated_scores = []
for q, r, label in test_data:
    result = compute_dgi(
        question=q,
        response=r,
        reference_csv="medical_pairs.csv",
    )
    calibrated_scores.append((result.value, label))

# Compare AUROC
generic_auroc = roc_auc_score(
    [s[1] for s in generic_scores],
    [s[0] for s in generic_scores],
)
calibrated_auroc = roc_auc_score(
    [s[1] for s in calibrated_scores],
    [s[0] for s in calibrated_scores],
)

print(f"Generic AUROC:    {generic_auroc:.4f}")
print(f"Calibrated AUROC: {calibrated_auroc:.4f}")
print(f"Improvement:      +{calibrated_auroc - generic_auroc:.4f}")

Use a separate test set

Never evaluate on the same data you used for calibration. The calibration pairs define the reference direction; evaluating on them would be circular.

Step 4: Save for Production

# Save the calibration result
result.save("calibration_medical.json")

# Verify it loads correctly
from groundlens.calibrate import CalibrationResult
loaded = CalibrationResult.load("calibration_medical.json")
print(f"Loaded: {loaded.n_pairs} pairs, kappa={loaded.concentration:.2f}")

Step 5: Deploy

Use the calibration CSV in production scoring:

from groundlens import evaluate

score = evaluate(
    question=user_question,
    response=llm_response,
    reference_csv="medical_pairs.csv",
)

Or with the class API:

from groundlens import DGI

# Initialize once at startup
dgi = DGI(reference_csv="medical_pairs.csv")

# Score each response
result = dgi.score(question=q, response=r)

Recalibration Schedule

Recalibrate when:

  • Domain shifts: Your domain evolves (new regulations, new terminology)
  • Model changes: You switch to a different sentence-transformer model
  • Performance degradation: Monitoring shows declining discrimination
  • Quarterly: As a general best practice, recalibrate every 3 months

Multi-Domain Calibration

For systems that serve multiple domains, maintain separate calibration files:

from groundlens import DGI

# Initialize domain-specific scorers
dgi_medical = DGI(reference_csv="calibration_medical.csv")
dgi_legal = DGI(reference_csv="calibration_legal.csv")
dgi_finance = DGI(reference_csv="calibration_finance.csv")

# Route based on domain detection
def score_by_domain(question, response, domain):
    scorers = {
        "medical": dgi_medical,
        "legal": dgi_legal,
        "finance": dgi_finance,
    }
    scorer = scorers.get(domain, DGI())  # fall back to generic
    return scorer.score(question=question, response=response)

Troubleshooting

Low \(\kappa\) after calibration

Cause: Calibration pairs span multiple topics or have inconsistent quality.

Fix: Filter pairs to a narrower domain, remove outliers, or split into sub-domains.

No AUROC improvement

Cause: The test set may not match the calibration domain, or the generic direction already captures the relevant pattern.

Fix: Verify the test set is from the same domain as the calibration data. Check that test set labels are accurate.

DGI scores cluster near zero

Cause: The displacement vectors are nearly orthogonal to the reference direction.

Fix: This usually indicates a calibration/scoring domain mismatch. Verify you are using the correct calibration file.