Skip to content

API Reference

This page provides the complete API reference for groundlens. All public classes and functions are documented with their signatures, parameters, return types, and examples.

For auto-generated documentation from source docstrings, ensure mkdocstrings is configured in your MkDocs build.

Core Functions

compute_sgi

groundlens.sgi.compute_sgi(question: str, context: str, response: str, *, model: str = DEFAULT_MODEL) -> SGIResult

Compute the Semantic Grounding Index for a response.

Parameters:

Name Type Description Default
question str

The input query.

required
context str

Source document, retrieved chunks, or reference text.

required
response str

The LLM output to evaluate.

required
model str

Sentence transformer model name. Default all-MiniLM-L6-v2.

DEFAULT_MODEL

Returns:

Type Description
SGIResult

SGIResult with raw score, normalized score, and flag status.

Raises:

Type Description
ValueError

If any input string is empty.

Example

from groundlens import compute_sgi result = compute_sgi( ... question="What is the capital of France?", ... context="France is in Western Europe. Its capital is Paris.", ... response="The capital of France is Paris.", ... ) result.flagged False

Source code in src/groundlens/sgi.py
def compute_sgi(
    question: str,
    context: str,
    response: str,
    *,
    model: str = DEFAULT_MODEL,
) -> SGIResult:
    """Compute the Semantic Grounding Index for a response.

    Args:
        question: The input query.
        context: Source document, retrieved chunks, or reference text.
        response: The LLM output to evaluate.
        model: Sentence transformer model name. Default ``all-MiniLM-L6-v2``.

    Returns:
        SGIResult with raw score, normalized score, and flag status.

    Raises:
        ValueError: If any input string is empty.

    Example:
        >>> from groundlens import compute_sgi
        >>> result = compute_sgi(
        ...     question="What is the capital of France?",
        ...     context="France is in Western Europe. Its capital is Paris.",
        ...     response="The capital of France is Paris.",
        ... )
        >>> result.flagged
        False
    """
    if not question.strip():
        msg = "question must be a non-empty string."
        raise ValueError(msg)
    if not context.strip():
        msg = "context must be a non-empty string."
        raise ValueError(msg)
    if not response.strip():
        msg = "response must be a non-empty string."
        raise ValueError(msg)

    embeddings = encode_texts([question, context, response], model_name=model)
    q_emb, ctx_emb, resp_emb = embeddings[0], embeddings[1], embeddings[2]

    q_dist = euclidean_distance(resp_emb, q_emb)
    ctx_dist = euclidean_distance(resp_emb, ctx_emb)

    # Degenerate case: response identical to context.
    if ctx_dist < 1e-8:
        return SGIResult(
            value=10.0,
            normalized=1.0,
            flagged=False,
            q_dist=round(q_dist, 4),
            ctx_dist=round(ctx_dist, 4),
        )

    # Degenerate case: response identical to question.
    if q_dist < 1e-8:
        return SGIResult(
            value=0.0,
            normalized=0.0,
            flagged=True,
            q_dist=round(q_dist, 4),
            ctx_dist=round(ctx_dist, 4),
        )

    raw = q_dist / ctx_dist
    normalized = normalize_sgi(raw)

    return SGIResult(
        value=round(raw, 4),
        normalized=round(normalized, 4),
        flagged=raw < SGI_REVIEW,
        q_dist=round(q_dist, 4),
        ctx_dist=round(ctx_dist, 4),
    )

compute_dgi

groundlens.dgi.compute_dgi(question: str, response: str, *, model: str = DEFAULT_MODEL, reference_csv: str | None = None) -> DGIResult

Compute the Directional Grounding Index for a response.

Parameters:

Name Type Description Default
question str

The input query.

required
response str

The LLM output to evaluate.

required
model str

Sentence transformer model name.

DEFAULT_MODEL
reference_csv str | None

Path to domain-specific calibration CSV. If None, uses the bundled dataset.

None

Returns:

Type Description
DGIResult

DGIResult with raw score, normalized score, and flag status.

Raises:

Type Description
ValueError

If question or response is empty.

Example

from groundlens import compute_dgi result = compute_dgi( ... question="What causes seasons on Earth?", ... response="Seasons are caused by Earth's 23.5-degree axial tilt.", ... ) result.flagged False

Source code in src/groundlens/dgi.py
def compute_dgi(
    question: str,
    response: str,
    *,
    model: str = DEFAULT_MODEL,
    reference_csv: str | None = None,
) -> DGIResult:
    """Compute the Directional Grounding Index for a response.

    Args:
        question: The input query.
        response: The LLM output to evaluate.
        model: Sentence transformer model name.
        reference_csv: Path to domain-specific calibration CSV.
            If ``None``, uses the bundled dataset.

    Returns:
        DGIResult with raw score, normalized score, and flag status.

    Raises:
        ValueError: If question or response is empty.

    Example:
        >>> from groundlens import compute_dgi
        >>> result = compute_dgi(
        ...     question="What causes seasons on Earth?",
        ...     response="Seasons are caused by Earth's 23.5-degree axial tilt.",
        ... )
        >>> result.flagged
        False
    """
    if not question.strip():
        msg = "question must be a non-empty string."
        raise ValueError(msg)
    if not response.strip():
        msg = "response must be a non-empty string."
        raise ValueError(msg)

    mu_hat = _get_mu_hat(model, reference_csv)
    embeddings = encode_texts([question, response], model_name=model)
    q_emb, r_emb = embeddings[0], embeddings[1]

    delta = displacement_vector(q_emb, r_emb)
    magnitude = float(np.linalg.norm(delta))

    # Degenerate case: response identical to question.
    if magnitude < 1e-8:
        return DGIResult(value=0.0, normalized=0.0, flagged=True)

    delta_hat = delta / magnitude
    gamma = float(np.dot(delta_hat, mu_hat))

    if math.isnan(gamma):
        logger.warning("DGI produced NaN — check embedding dimensions.")
        return DGIResult(value=0.0, normalized=0.0, flagged=True)

    normalized = round(normalize_dgi(gamma), 4)

    return DGIResult(
        value=round(gamma, 4),
        normalized=normalized,
        flagged=gamma < DGI_PASS,
    )

evaluate

groundlens.evaluate.evaluate(question: str, response: str, context: str | None = None, *, model: str = DEFAULT_MODEL, reference_csv: str | None = None) -> GroundlensScore

Evaluate a single LLM response for hallucination risk.

Auto-selects scoring method
  • SGI when context is provided (grounded verification).
  • DGI when context is None (context-free verification).

Parameters:

Name Type Description Default
question str

The input query.

required
response str

The LLM output to evaluate.

required
context str | None

Source document or retrieved text. If provided, SGI is used.

None
model str

Sentence transformer model name.

DEFAULT_MODEL
reference_csv str | None

DGI calibration CSV path (only used when context is None).

None

Returns:

Type Description
GroundlensScore

GroundlensScore with method, value, flag, and explanation.

Example

from groundlens import evaluate

With context → SGI

score = evaluate("Q?", "A.", context="Source text.") score.method 'sgi'

Without context → DGI

score = evaluate("Q?", "A.") score.method 'dgi'

Source code in src/groundlens/evaluate.py
def evaluate(
    question: str,
    response: str,
    context: str | None = None,
    *,
    model: str = DEFAULT_MODEL,
    reference_csv: str | None = None,
) -> GroundlensScore:
    """Evaluate a single LLM response for hallucination risk.

    Auto-selects scoring method:
        - **SGI** when ``context`` is provided (grounded verification).
        - **DGI** when ``context`` is ``None`` (context-free verification).

    Args:
        question: The input query.
        response: The LLM output to evaluate.
        context: Source document or retrieved text. If provided, SGI is used.
        model: Sentence transformer model name.
        reference_csv: DGI calibration CSV path (only used when context is None).

    Returns:
        GroundlensScore with method, value, flag, and explanation.

    Example:
        >>> from groundlens import evaluate
        >>> # With context → SGI
        >>> score = evaluate("Q?", "A.", context="Source text.")
        >>> score.method
        'sgi'
        >>> # Without context → DGI
        >>> score = evaluate("Q?", "A.")
        >>> score.method
        'dgi'
    """
    result: SGIResult | DGIResult
    if context is not None and context.strip():
        result = compute_sgi(
            question=question,
            context=context,
            response=response,
            model=model,
        )
    else:
        result = compute_dgi(
            question=question,
            response=response,
            model=model,
            reference_csv=reference_csv,
        )

    return GroundlensScore(
        value=result.value,
        normalized=result.normalized,
        flagged=result.flagged,
        method=result.method,
        explanation=result.explanation,
        detail=result,
    )

evaluate_batch

groundlens.evaluate.evaluate_batch(items: list[dict[str, str]], *, model: str = DEFAULT_MODEL, reference_csv: str | None = None) -> list[GroundlensScore]

Evaluate a batch of LLM responses.

Each item in the list is a dict with keys
  • question (required)
  • response (required)
  • context (optional — triggers SGI when present)

Parameters:

Name Type Description Default
items list[dict[str, str]]

List of dicts, each containing question, response, and optionally context.

required
model str

Sentence transformer model name.

DEFAULT_MODEL
reference_csv str | None

DGI calibration CSV path.

None

Returns:

Type Description
list[GroundlensScore]

List of GroundlensScore results, one per input item.

Raises:

Type Description
KeyError

If any item is missing question or response.

Example

from groundlens import evaluate_batch items = [ ... {"question": "Q1?", "response": "A1.", "context": "C1."}, ... {"question": "Q2?", "response": "A2."}, ... ] results = evaluate_batch(items) len(results) 2

Source code in src/groundlens/evaluate.py
def evaluate_batch(
    items: list[dict[str, str]],
    *,
    model: str = DEFAULT_MODEL,
    reference_csv: str | None = None,
) -> list[GroundlensScore]:
    """Evaluate a batch of LLM responses.

    Each item in the list is a dict with keys:
        - ``question`` (required)
        - ``response`` (required)
        - ``context`` (optional — triggers SGI when present)

    Args:
        items: List of dicts, each containing question, response, and
            optionally context.
        model: Sentence transformer model name.
        reference_csv: DGI calibration CSV path.

    Returns:
        List of GroundlensScore results, one per input item.

    Raises:
        KeyError: If any item is missing ``question`` or ``response``.

    Example:
        >>> from groundlens import evaluate_batch
        >>> items = [
        ...     {"question": "Q1?", "response": "A1.", "context": "C1."},
        ...     {"question": "Q2?", "response": "A2."},
        ... ]
        >>> results = evaluate_batch(items)
        >>> len(results)
        2
    """
    results: list[GroundlensScore] = []

    for i, item in enumerate(items):
        if "question" not in item:
            msg = f"Item {i} missing required key 'question'."
            raise KeyError(msg)
        if "response" not in item:
            msg = f"Item {i} missing required key 'response'."
            raise KeyError(msg)

        score = evaluate(
            question=item["question"],
            response=item["response"],
            context=item.get("context"),
            model=model,
            reference_csv=reference_csv,
        )
        results.append(score)

    logger.info(
        "Evaluated %d items (%d flagged).", len(results), sum(1 for r in results if r.flagged)
    )

    return results

calibrate

groundlens.calibrate.calibrate(pairs: list[tuple[str, str]] | None = None, csv_path: str | None = None, *, model: str = DEFAULT_MODEL, metadata: dict[str, str] | None = None) -> CalibrationResult

Compute a DGI reference direction from calibration data.

Provide either pairs directly or a csv_path to a file with verified grounded (question, response) pairs.

Parameters:

Name Type Description Default
pairs list[tuple[str, str]] | None

List of (question, response) tuples.

None
csv_path str | None

Path to a CSV file with question and response columns.

None
model str

Sentence transformer model to use for embedding.

DEFAULT_MODEL
metadata dict[str, str] | None

Optional metadata to attach (domain name, date, notes).

None

Returns:

Type Description
CalibrationResult

CalibrationResult with computed reference direction and statistics.

Raises:

Type Description
ValueError

If neither pairs nor csv_path is provided, or if the data contains fewer than 5 pairs.

Example

result = calibrate(pairs=[("Q?", "A.") for _ in range(20)]) result.n_pairs 20

Source code in src/groundlens/calibrate.py
def calibrate(
    pairs: list[tuple[str, str]] | None = None,
    csv_path: str | None = None,
    *,
    model: str = DEFAULT_MODEL,
    metadata: dict[str, str] | None = None,
) -> CalibrationResult:
    """Compute a DGI reference direction from calibration data.

    Provide either ``pairs`` directly or a ``csv_path`` to a file
    with verified grounded (question, response) pairs.

    Args:
        pairs: List of (question, response) tuples.
        csv_path: Path to a CSV file with ``question`` and ``response`` columns.
        model: Sentence transformer model to use for embedding.
        metadata: Optional metadata to attach (domain name, date, notes).

    Returns:
        CalibrationResult with computed reference direction and statistics.

    Raises:
        ValueError: If neither ``pairs`` nor ``csv_path`` is provided,
            or if the data contains fewer than 5 pairs.

    Example:
        >>> result = calibrate(pairs=[("Q?", "A.") for _ in range(20)])
        >>> result.n_pairs
        20
    """
    if csv_path is not None:
        from groundlens._internal.csv_loader import load_reference_pairs

        pairs = load_reference_pairs(csv_path)
    elif pairs is None:
        msg = "Provide either 'pairs' or 'csv_path'."
        raise ValueError(msg)

    if len(pairs) < 5:
        msg = (
            f"Calibration requires at least 5 pairs, got {len(pairs)}. "
            "More pairs (20-100) produce better reference directions."
        )
        raise ValueError(msg)

    logger.info("Calibrating DGI with %d pairs using model %s.", len(pairs), model)

    mu_hat = _compute_reference_direction(pairs, model)

    # Estimate concentration parameter (kappa) from resultant length.
    # This is a rough estimate — the true MLE for von Mises-Fisher is
    # more complex, but the resultant length R-bar is a sufficient
    # indicator of calibration quality.
    from groundlens._internal.embeddings import encode_texts
    from groundlens._internal.geometry import displacement_vector, unit_normalize

    texts: list[str] = []
    for q, r in pairs:
        texts.extend([q, r])
    embeddings = encode_texts(texts, model_name=model)

    unit_displacements = []
    for i in range(len(pairs)):
        delta = displacement_vector(embeddings[i * 2], embeddings[i * 2 + 1])
        norm = float(np.linalg.norm(delta))
        if norm > 1e-8:
            unit_displacements.append(unit_normalize(delta))

    if unit_displacements:
        r_bar = float(np.linalg.norm(np.mean(np.stack(unit_displacements), axis=0)))
    else:
        r_bar = 0.0

    # Approximate kappa from R-bar (Sra, 2012).
    d = mu_hat.shape[0]
    kappa = r_bar * (d - r_bar**2) / (1 - r_bar**2) if r_bar < 0.99 else 100.0

    return CalibrationResult(
        model=model,
        n_pairs=len(pairs),
        embedding_dim=int(mu_hat.shape[0]),
        mu_hat=mu_hat,
        concentration=round(kappa, 2),
        metadata=metadata or {},
    )

Core Classes

SGI

groundlens.sgi.SGI(model: str = DEFAULT_MODEL)

Reusable SGI scorer with a pre-configured embedding model.

Use this class when evaluating multiple responses with the same model to avoid repeating the model parameter.

Example

sgi = SGI(model="all-MiniLM-L6-v2") result = sgi.score( ... question="What is X?", ... context="X is Y.", ... response="X is Y.", ... ) result.flagged False

Initialize SGI scorer.

Parameters:

Name Type Description Default
model str

Sentence transformer model name or path.

DEFAULT_MODEL
Source code in src/groundlens/sgi.py
def __init__(self, model: str = DEFAULT_MODEL) -> None:
    """Initialize SGI scorer.

    Args:
        model: Sentence transformer model name or path.
    """
    self.model = model

Functions

score(question: str, context: str, response: str) -> SGIResult

Compute SGI for a single response.

Parameters:

Name Type Description Default
question str

The input query.

required
context str

Source document or reference text.

required
response str

The LLM output to evaluate.

required

Returns:

Type Description
SGIResult

SGIResult with score and flag status.

Source code in src/groundlens/sgi.py
def score(
    self,
    question: str,
    context: str,
    response: str,
) -> SGIResult:
    """Compute SGI for a single response.

    Args:
        question: The input query.
        context: Source document or reference text.
        response: The LLM output to evaluate.

    Returns:
        SGIResult with score and flag status.
    """
    return compute_sgi(
        question=question,
        context=context,
        response=response,
        model=self.model,
    )

DGI

groundlens.dgi.DGI(model: str = DEFAULT_MODEL, reference_csv: str | None = None)

Reusable DGI scorer with pre-configured model and calibration.

Use this class when evaluating multiple responses against the same reference direction. Supports both bundled and custom calibration.

Example

dgi = DGI() result = dgi.score( ... question="What is ML?", ... response="ML is a branch of AI.", ... ) result.flagged False

dgi = DGI(reference_csv="my_domain_pairs.csv") result = dgi.score(question="...", response="...")

Initialize DGI scorer.

Parameters:

Name Type Description Default
model str

Sentence transformer model name.

DEFAULT_MODEL
reference_csv str | None

Path to domain-specific calibration CSV.

None
Source code in src/groundlens/dgi.py
def __init__(
    self,
    model: str = DEFAULT_MODEL,
    reference_csv: str | None = None,
) -> None:
    """Initialize DGI scorer.

    Args:
        model: Sentence transformer model name.
        reference_csv: Path to domain-specific calibration CSV.
    """
    self.model = model
    self.reference_csv = reference_csv

Functions

calibrate(pairs: list[tuple[str, str]] | None = None, csv_path: str | None = None) -> None

Set custom calibration data.

Either provide pairs directly or a path to a CSV file. This replaces any previously cached reference direction.

Parameters:

Name Type Description Default
pairs list[tuple[str, str]] | None

List of verified (question, response) tuples.

None
csv_path str | None

Path to a calibration CSV file.

None

Raises:

Type Description
ValueError

If neither pairs nor csv_path is provided.

Source code in src/groundlens/dgi.py
def calibrate(
    self,
    pairs: list[tuple[str, str]] | None = None,
    csv_path: str | None = None,
) -> None:
    """Set custom calibration data.

    Either provide pairs directly or a path to a CSV file.
    This replaces any previously cached reference direction.

    Args:
        pairs: List of verified (question, response) tuples.
        csv_path: Path to a calibration CSV file.

    Raises:
        ValueError: If neither ``pairs`` nor ``csv_path`` is provided.
    """
    if csv_path is not None:
        self.reference_csv = csv_path
        # Force recomputation on next score() call.
        cache_key = (self.model, csv_path)
        _mu_hat_cache.pop(cache_key, None)
        return

    if pairs is not None:
        # Compute and cache the reference direction directly.
        mu = _compute_reference_direction(pairs, self.model)
        cache_key = (self.model, "__inline__")
        _mu_hat_cache[cache_key] = mu
        self.reference_csv = "__inline__"
        return

    msg = "Provide either 'pairs' or 'csv_path' for calibration."
    raise ValueError(msg)

score(question: str, response: str) -> DGIResult

Compute DGI for a single response.

Parameters:

Name Type Description Default
question str

The input query.

required
response str

The LLM output to evaluate.

required

Returns:

Type Description
DGIResult

DGIResult with score and flag status.

Source code in src/groundlens/dgi.py
def score(self, question: str, response: str) -> DGIResult:
    """Compute DGI for a single response.

    Args:
        question: The input query.
        response: The LLM output to evaluate.

    Returns:
        DGIResult with score and flag status.
    """
    ref = self.reference_csv if self.reference_csv != "__inline__" else None
    if self.reference_csv == "__inline__":
        # Use the inline-calibrated mu_hat.
        cache_key = (self.model, "__inline__")
        if cache_key not in _mu_hat_cache:
            msg = "Call calibrate() before score() when using inline pairs."
            raise RuntimeError(msg)

    return compute_dgi(
        question=question,
        response=response,
        model=self.model,
        reference_csv=ref,
    )

Result Types

SGIResult

groundlens.score.SGIResult(value: float, normalized: float, flagged: bool, q_dist: float, ctx_dist: float, method: str = 'sgi', explanation: str = '') dataclass

Result of Semantic Grounding Index computation.

SGI measures whether a response engaged with the provided context or stayed anchored to the question. Higher values indicate stronger context engagement (grounded).

Attributes:

Name Type Description
value float

Raw SGI score = dist(response, question) / dist(response, context).

normalized float

Score mapped to [0, 1] via tanh normalization.

flagged bool

True if the score is below the review threshold.

q_dist float

Euclidean distance from response to question embedding.

ctx_dist float

Euclidean distance from response to context embedding.

method str

Always "sgi".

explanation str

Human-readable interpretation of the score.

Functions

__post_init__() -> None

Generate explanation from score if not provided.

Source code in src/groundlens/score.py
def __post_init__(self) -> None:
    """Generate explanation from score if not provided."""
    if not self.explanation:
        if self.value >= SGI_STRONG_PASS:
            expl = f"SGI={self.value:.3f} — strong context engagement (pass)"
        elif self.value >= SGI_REVIEW:
            expl = f"SGI={self.value:.3f} — partial engagement (review recommended)"
        else:
            expl = f"SGI={self.value:.3f} — weak context engagement (flagged)"
        object.__setattr__(self, "explanation", expl)

DGIResult

groundlens.score.DGIResult(value: float, normalized: float, flagged: bool, method: str = 'dgi', explanation: str = '') dataclass

Result of Directional Grounding Index computation.

DGI measures whether the question-to-response displacement vector aligns with the mean displacement of verified grounded pairs. Higher values indicate alignment with grounded patterns.

Attributes:

Name Type Description
value float

Raw DGI score = cosine similarity to reference direction. Range: [-1, 1].

normalized float

Score mapped to [0, 1] via linear normalization.

flagged bool

True if the score is below the pass threshold.

method str

Always "dgi".

explanation str

Human-readable interpretation of the score.

Functions

__post_init__() -> None

Generate explanation from score if not provided.

Source code in src/groundlens/score.py
def __post_init__(self) -> None:
    """Generate explanation from score if not provided."""
    if not self.explanation:
        if self.value >= DGI_PASS:
            expl = f"DGI={self.value:.3f} — aligns with grounded patterns (pass)"
        elif self.value >= 0.0:
            expl = f"DGI={self.value:.3f} — weak alignment (flagged)"
        else:
            expl = f"DGI={self.value:.3f} — opposes grounded direction (high risk)"
        object.__setattr__(self, "explanation", expl)

GroundlensScore

groundlens.score.GroundlensScore(value: float, normalized: float, flagged: bool, method: str, explanation: str, detail: SGIResult | DGIResult) dataclass

Unified score container returned by high-level evaluate() calls.

Wraps either an SGIResult or DGIResult with additional metadata.

Attributes:

Name Type Description
value float

Raw score from the underlying method.

normalized float

Score in [0, 1].

flagged bool

Whether human review is recommended.

method str

"sgi" or "dgi".

explanation str

Human-readable interpretation.

detail SGIResult | DGIResult

The full SGIResult or DGIResult for method-specific fields.

CalibrationResult

groundlens.calibrate.CalibrationResult(model: str, n_pairs: int, embedding_dim: int, mu_hat: NDArray[np.float32], concentration: float, metadata: dict[str, str] = dict()) dataclass

Result of DGI calibration.

Attributes:

Name Type Description
model str

Sentence transformer model used for calibration.

n_pairs int

Number of (question, response) pairs used.

embedding_dim int

Dimensionality of the embedding space.

mu_hat NDArray[float32]

The computed reference direction vector.

concentration float

Estimated concentration parameter (kappa) of the von Mises-Fisher distribution. Higher values indicate more consistent displacement directions in the reference data.

Functions

save(path: str | Path) -> None

Save calibration result to JSON.

Parameters:

Name Type Description Default
path str | Path

Output file path. The mu_hat vector is stored as a list.

required
Source code in src/groundlens/calibrate.py
def save(self, path: str | Path) -> None:
    """Save calibration result to JSON.

    Args:
        path: Output file path. The mu_hat vector is stored as a list.
    """
    data = {
        "model": self.model,
        "n_pairs": self.n_pairs,
        "embedding_dim": self.embedding_dim,
        "mu_hat": self.mu_hat.tolist(),
        "concentration": self.concentration,
        "metadata": self.metadata,
    }
    Path(path).write_text(json.dumps(data, indent=2), encoding="utf-8")
    logger.info("Calibration saved to %s.", path)

load(path: str | Path) -> CalibrationResult classmethod

Load a saved calibration result.

Parameters:

Name Type Description Default
path str | Path

Path to JSON calibration file.

required

Returns:

Type Description
CalibrationResult

CalibrationResult instance with restored mu_hat vector.

Source code in src/groundlens/calibrate.py
@classmethod
def load(cls, path: str | Path) -> CalibrationResult:
    """Load a saved calibration result.

    Args:
        path: Path to JSON calibration file.

    Returns:
        CalibrationResult instance with restored mu_hat vector.
    """
    data = json.loads(Path(path).read_text(encoding="utf-8"))
    return cls(
        model=data["model"],
        n_pairs=data["n_pairs"],
        embedding_dim=data["embedding_dim"],
        mu_hat=np.array(data["mu_hat"], dtype=np.float32),
        concentration=data["concentration"],
        metadata=data.get("metadata", {}),
    )

Providers

GroundlensOpenAI

groundlens.providers.openai.GroundlensOpenAI(api_key: str, model: str = 'gpt-4o', groundlens_model: str = 'all-MiniLM-L6-v2', groundlens_threshold: float = 0.45)

OpenAI LLM provider with built-in groundlens scoring.

Wraps the OpenAI chat completions API and automatically evaluates each response for hallucination risk.

Parameters:

Name Type Description Default
api_key str

OpenAI API key.

required
model str

Chat model to use for generation. Defaults to "gpt-4o".

'gpt-4o'
groundlens_model str

Sentence-transformer model for groundlens scoring. Defaults to "all-MiniLM-L6-v2".

'all-MiniLM-L6-v2'
groundlens_threshold float

Score threshold override (reserved for future use). Defaults to 0.45.

0.45
Example

llm = GroundlensOpenAI(api_key="sk-...") resp = llm.chat("Summarize this document.", context="The document text.") print(resp.groundlens_score.explanation)

Source code in src/groundlens/providers/openai.py
def __init__(
    self,
    api_key: str,
    model: str = "gpt-4o",
    groundlens_model: str = "all-MiniLM-L6-v2",
    groundlens_threshold: float = 0.45,
) -> None:
    self._client = _get_openai_client(api_key)
    self._model = model
    self._groundlens_model = groundlens_model
    self._groundlens_threshold = groundlens_threshold

Functions

chat(prompt: str, context: str | None = None, **kwargs: Any) -> LLMResponse

Send a chat completion request and score the response.

Parameters:

Name Type Description Default
prompt str

The user message content.

required
context str | None

Optional source document. When provided, SGI scoring is used; otherwise DGI scoring is applied.

None
**kwargs Any

Additional keyword arguments forwarded to the OpenAI chat.completions.create call.

{}

Returns:

Type Description
LLMResponse

LLMResponse containing the generated text, model identifier,

LLMResponse

usage metadata, and a groundlens hallucination score.

Raises:

Type Description
OpenAIError

If the API call fails.

Example

llm = GroundlensOpenAI(api_key="sk-...") resp = llm.chat("What causes tides?") resp.text 'Tides are primarily caused by...'

Source code in src/groundlens/providers/openai.py
def chat(
    self,
    prompt: str,
    context: str | None = None,
    **kwargs: Any,
) -> LLMResponse:
    """Send a chat completion request and score the response.

    Args:
        prompt: The user message content.
        context: Optional source document. When provided, SGI scoring
            is used; otherwise DGI scoring is applied.
        **kwargs: Additional keyword arguments forwarded to the
            OpenAI ``chat.completions.create`` call.

    Returns:
        LLMResponse containing the generated text, model identifier,
        usage metadata, and a groundlens hallucination score.

    Raises:
        openai.OpenAIError: If the API call fails.

    Example:
        >>> llm = GroundlensOpenAI(api_key="sk-...")
        >>> resp = llm.chat("What causes tides?")
        >>> resp.text
        'Tides are primarily caused by...'
    """
    messages: list[dict[str, str]] = [{"role": "user", "content": prompt}]

    logger.debug("Calling OpenAI model=%s prompt_len=%d", self._model, len(prompt))

    completion = self._client.chat.completions.create(
        model=self._model,
        messages=messages,
        **kwargs,
    )

    choice = completion.choices[0]
    text = choice.message.content or ""

    usage: dict[str, Any] = {}
    if completion.usage is not None:
        usage = {
            "prompt_tokens": completion.usage.prompt_tokens,
            "completion_tokens": completion.usage.completion_tokens,
            "total_tokens": completion.usage.total_tokens,
        }

    score = evaluate(
        question=prompt,
        response=text,
        context=context,
        model=self._groundlens_model,
    )

    logger.info(
        "OpenAI response scored: method=%s value=%.3f flagged=%s",
        score.method,
        score.value,
        score.flagged,
    )

    return LLMResponse(
        text=text,
        model=self._model,
        usage=usage,
        groundlens_score=score,
    )

complete(prompt: str, context: str | None = None) -> LLMResponse

Generate a completion for the given prompt.

Convenience method that delegates to :meth:chat.

Parameters:

Name Type Description Default
prompt str

The user prompt or instruction.

required
context str | None

Optional source document for grounded evaluation.

None

Returns:

Type Description
LLMResponse

LLMResponse with generated text and groundlens score.

Source code in src/groundlens/providers/openai.py
def complete(
    self,
    prompt: str,
    context: str | None = None,
) -> LLMResponse:
    """Generate a completion for the given prompt.

    Convenience method that delegates to :meth:`chat`.

    Args:
        prompt: The user prompt or instruction.
        context: Optional source document for grounded evaluation.

    Returns:
        LLMResponse with generated text and groundlens score.
    """
    return self.chat(prompt, context=context)

GroundlensAnthropic

groundlens.providers.anthropic.GroundlensAnthropic(api_key: str, model: str = 'claude-sonnet-4-20250514', groundlens_model: str = 'all-MiniLM-L6-v2', groundlens_threshold: float = 0.45)

Anthropic Claude provider with built-in groundlens scoring.

Wraps the Anthropic messages API and automatically evaluates each response for hallucination risk.

Parameters:

Name Type Description Default
api_key str

Anthropic API key.

required
model str

Claude model to use for generation. Defaults to "claude-sonnet-4-20250514".

'claude-sonnet-4-20250514'
groundlens_model str

Sentence-transformer model for groundlens scoring. Defaults to "all-MiniLM-L6-v2".

'all-MiniLM-L6-v2'
groundlens_threshold float

Score threshold override (reserved for future use). Defaults to 0.45.

0.45
Example

llm = GroundlensAnthropic(api_key="sk-ant-...") resp = llm.chat("Summarize this.", context="Source text here.") print(resp.groundlens_score.explanation)

Source code in src/groundlens/providers/anthropic.py
def __init__(
    self,
    api_key: str,
    model: str = "claude-sonnet-4-20250514",
    groundlens_model: str = "all-MiniLM-L6-v2",
    groundlens_threshold: float = 0.45,
) -> None:
    self._client = _get_anthropic_client(api_key)
    self._model = model
    self._groundlens_model = groundlens_model
    self._groundlens_threshold = groundlens_threshold

Functions

chat(prompt: str, context: str | None = None, **kwargs: Any) -> LLMResponse

Send a message to Claude and score the response.

Parameters:

Name Type Description Default
prompt str

The user message content.

required
context str | None

Optional source document. When provided, SGI scoring is used; otherwise DGI scoring is applied.

None
**kwargs Any

Additional keyword arguments forwarded to the Anthropic messages.create call.

{}

Returns:

Type Description
LLMResponse

LLMResponse containing the generated text, model identifier,

LLMResponse

usage metadata, and a groundlens hallucination score.

Raises:

Type Description
APIError

If the API call fails.

Example

llm = GroundlensAnthropic(api_key="sk-ant-...") resp = llm.chat("Explain photosynthesis.") resp.text 'Photosynthesis is the process by which...'

Source code in src/groundlens/providers/anthropic.py
def chat(
    self,
    prompt: str,
    context: str | None = None,
    **kwargs: Any,
) -> LLMResponse:
    """Send a message to Claude and score the response.

    Args:
        prompt: The user message content.
        context: Optional source document. When provided, SGI scoring
            is used; otherwise DGI scoring is applied.
        **kwargs: Additional keyword arguments forwarded to the
            Anthropic ``messages.create`` call.

    Returns:
        LLMResponse containing the generated text, model identifier,
        usage metadata, and a groundlens hallucination score.

    Raises:
        anthropic.APIError: If the API call fails.

    Example:
        >>> llm = GroundlensAnthropic(api_key="sk-ant-...")
        >>> resp = llm.chat("Explain photosynthesis.")
        >>> resp.text
        'Photosynthesis is the process by which...'
    """
    messages: list[dict[str, str]] = [{"role": "user", "content": prompt}]

    logger.debug("Calling Anthropic model=%s prompt_len=%d", self._model, len(prompt))

    max_tokens = kwargs.pop("max_tokens", 4096)

    message = self._client.messages.create(
        model=self._model,
        max_tokens=max_tokens,
        messages=messages,
        **kwargs,
    )

    text = ""
    for block in message.content:
        if hasattr(block, "text"):
            text += block.text

    usage: dict[str, Any] = {
        "input_tokens": message.usage.input_tokens,
        "output_tokens": message.usage.output_tokens,
    }

    score = evaluate(
        question=prompt,
        response=text,
        context=context,
        model=self._groundlens_model,
    )

    logger.info(
        "Anthropic response scored: method=%s value=%.3f flagged=%s",
        score.method,
        score.value,
        score.flagged,
    )

    return LLMResponse(
        text=text,
        model=self._model,
        usage=usage,
        groundlens_score=score,
    )

complete(prompt: str, context: str | None = None) -> LLMResponse

Generate a completion for the given prompt.

Convenience method that delegates to :meth:chat.

Parameters:

Name Type Description Default
prompt str

The user prompt or instruction.

required
context str | None

Optional source document for grounded evaluation.

None

Returns:

Type Description
LLMResponse

LLMResponse with generated text and groundlens score.

Source code in src/groundlens/providers/anthropic.py
def complete(
    self,
    prompt: str,
    context: str | None = None,
) -> LLMResponse:
    """Generate a completion for the given prompt.

    Convenience method that delegates to :meth:`chat`.

    Args:
        prompt: The user prompt or instruction.
        context: Optional source document for grounded evaluation.

    Returns:
        LLMResponse with generated text and groundlens score.
    """
    return self.chat(prompt, context=context)

GroundlensGemini

groundlens.providers.google.GroundlensGemini(api_key: str, model: str = 'gemini-2.0-flash', groundlens_model: str = 'all-MiniLM-L6-v2', groundlens_threshold: float = 0.45)

Google Gemini provider with built-in groundlens scoring.

Wraps the Google Generative AI SDK and automatically evaluates each response for hallucination risk.

Parameters:

Name Type Description Default
api_key str

Google AI API key.

required
model str

Gemini model to use for generation. Defaults to "gemini-2.0-flash".

'gemini-2.0-flash'
groundlens_model str

Sentence-transformer model for groundlens scoring. Defaults to "all-MiniLM-L6-v2".

'all-MiniLM-L6-v2'
groundlens_threshold float

Score threshold override (reserved for future use). Defaults to 0.45.

0.45
Example

llm = GroundlensGemini(api_key="AI...") resp = llm.chat("Summarize this.", context="Source text here.") print(resp.groundlens_score.explanation)

Source code in src/groundlens/providers/google.py
def __init__(
    self,
    api_key: str,
    model: str = "gemini-2.0-flash",
    groundlens_model: str = "all-MiniLM-L6-v2",
    groundlens_threshold: float = 0.45,
) -> None:
    self._genai = _configure_genai(api_key)
    self._model_name = model
    self._generative_model = self._genai.GenerativeModel(model)
    self._groundlens_model = groundlens_model
    self._groundlens_threshold = groundlens_threshold

Functions

chat(prompt: str, context: str | None = None, **kwargs: Any) -> LLMResponse

Send a prompt to Gemini and score the response.

Parameters:

Name Type Description Default
prompt str

The user message content.

required
context str | None

Optional source document. When provided, SGI scoring is used; otherwise DGI scoring is applied.

None
**kwargs Any

Additional keyword arguments forwarded to the Gemini generate_content call.

{}

Returns:

Type Description
LLMResponse

LLMResponse containing the generated text, model identifier,

LLMResponse

usage metadata, and a groundlens hallucination score.

Raises:

Type Description
GoogleAPIError

If the API call fails.

Example

llm = GroundlensGemini(api_key="AI...") resp = llm.chat("Explain gravity.") resp.text 'Gravity is a fundamental force...'

Source code in src/groundlens/providers/google.py
def chat(
    self,
    prompt: str,
    context: str | None = None,
    **kwargs: Any,
) -> LLMResponse:
    """Send a prompt to Gemini and score the response.

    Args:
        prompt: The user message content.
        context: Optional source document. When provided, SGI scoring
            is used; otherwise DGI scoring is applied.
        **kwargs: Additional keyword arguments forwarded to the
            Gemini ``generate_content`` call.

    Returns:
        LLMResponse containing the generated text, model identifier,
        usage metadata, and a groundlens hallucination score.

    Raises:
        google.api_core.exceptions.GoogleAPIError: If the API call fails.

    Example:
        >>> llm = GroundlensGemini(api_key="AI...")
        >>> resp = llm.chat("Explain gravity.")
        >>> resp.text
        'Gravity is a fundamental force...'
    """
    logger.debug("Calling Gemini model=%s prompt_len=%d", self._model_name, len(prompt))

    response = self._generative_model.generate_content(prompt, **kwargs)

    text = response.text or ""

    usage: dict[str, Any] = {}
    if hasattr(response, "usage_metadata") and response.usage_metadata is not None:
        usage = {
            "prompt_token_count": response.usage_metadata.prompt_token_count,
            "candidates_token_count": response.usage_metadata.candidates_token_count,
            "total_token_count": response.usage_metadata.total_token_count,
        }

    score = evaluate(
        question=prompt,
        response=text,
        context=context,
        model=self._groundlens_model,
    )

    logger.info(
        "Gemini response scored: method=%s value=%.3f flagged=%s",
        score.method,
        score.value,
        score.flagged,
    )

    return LLMResponse(
        text=text,
        model=self._model_name,
        usage=usage,
        groundlens_score=score,
    )

complete(prompt: str, context: str | None = None) -> LLMResponse

Generate a completion for the given prompt.

Convenience method that delegates to :meth:chat.

Parameters:

Name Type Description Default
prompt str

The user prompt or instruction.

required
context str | None

Optional source document for grounded evaluation.

None

Returns:

Type Description
LLMResponse

LLMResponse with generated text and groundlens score.

Source code in src/groundlens/providers/google.py
def complete(
    self,
    prompt: str,
    context: str | None = None,
) -> LLMResponse:
    """Generate a completion for the given prompt.

    Convenience method that delegates to :meth:`chat`.

    Args:
        prompt: The user prompt or instruction.
        context: Optional source document for grounded evaluation.

    Returns:
        LLMResponse with generated text and groundlens score.
    """
    return self.chat(prompt, context=context)

Integrations

GroundlensEvaluator (LangChain)

groundlens.integrations.langchain.evaluator.GroundlensEvaluator(groundlens_model: str = 'all-MiniLM-L6-v2', input_key: str = 'question', output_key: str = 'output', context_key: str = 'context')

LangSmith run evaluator that scores outputs with groundlens.

Extracts input, output, and optional context from LangSmith runs and examples, then computes SGI (when context is available) or DGI (context-free) scores.

Parameters:

Name Type Description Default
groundlens_model str

Sentence-transformer model for groundlens scoring. Defaults to "all-MiniLM-L6-v2".

'all-MiniLM-L6-v2'
input_key str

Key to extract the question from run inputs. Defaults to "question".

'question'
output_key str

Key to extract the response from run outputs. Defaults to "output".

'output'
context_key str

Key to extract context from example inputs. Defaults to "context".

'context'
Example

evaluator = GroundlensEvaluator()

Typically used with LangSmith evaluate():

from langsmith import evaluate

evaluate(chain, data="dataset", evaluators=[evaluator])

Source code in src/groundlens/integrations/langchain/evaluator.py
def __init__(
    self,
    groundlens_model: str = "all-MiniLM-L6-v2",
    input_key: str = "question",
    output_key: str = "output",
    context_key: str = "context",
) -> None:
    self._groundlens_model = groundlens_model
    self._input_key = input_key
    self._output_key = output_key
    self._context_key = context_key

Functions

evaluate_run(run: Run, example: Example | None = None) -> Any

Evaluate a LangSmith run for hallucination risk.

Extracts the question from run inputs, the response from run outputs, and optionally context from the example inputs. Returns a LangSmith EvaluationResult with the groundlens score.

Parameters:

Name Type Description Default
run Run

The LangSmith run to evaluate. Must have inputs and outputs dicts.

required
example Example | None

Optional LangSmith example providing ground truth or context for SGI evaluation.

None

Returns:

Type Description
Any

An EvaluationResult with key "groundlens", the normalized

Any

score, and a comment containing the explanation.

Example

evaluator = GroundlensEvaluator() result = evaluator.evaluate_run(run, example) result.key 'groundlens'

Source code in src/groundlens/integrations/langchain/evaluator.py
def evaluate_run(
    self,
    run: Run,
    example: Example | None = None,
) -> Any:
    """Evaluate a LangSmith run for hallucination risk.

    Extracts the question from run inputs, the response from run
    outputs, and optionally context from the example inputs. Returns
    a LangSmith ``EvaluationResult`` with the groundlens score.

    Args:
        run: The LangSmith run to evaluate. Must have ``inputs``
            and ``outputs`` dicts.
        example: Optional LangSmith example providing ground truth
            or context for SGI evaluation.

    Returns:
        An ``EvaluationResult`` with key ``"groundlens"``, the normalized
        score, and a comment containing the explanation.

    Example:
        >>> evaluator = GroundlensEvaluator()
        >>> result = evaluator.evaluate_run(run, example)
        >>> result.key
        'groundlens'
    """
    (evaluation_result_cls,) = _import_langsmith_types()

    inputs = run.inputs or {}
    outputs = run.outputs or {}

    question = inputs.get(self._input_key, "")
    response = outputs.get(self._output_key, "")

    if not question:
        for key in ("input", "query", "prompt"):
            question = inputs.get(key, "")
            if question:
                break

    if not response:
        for key in ("answer", "result", "text", "response"):
            response = outputs.get(key, "")
            if response:
                break

    context: str | None = None
    if example is not None and example.inputs:
        context = example.inputs.get(self._context_key)

    if not question or not response:
        logger.warning(
            "GroundlensEvaluator: missing question or response for run %s",
            run.id,
        )
        return evaluation_result_cls(
            key="groundlens",
            score=None,
            comment="Missing question or response — could not evaluate.",
        )

    score: GroundlensScore = evaluate(
        question=str(question),
        response=str(response),
        context=str(context) if context else None,
        model=self._groundlens_model,
    )

    logger.info(
        "GroundlensEvaluator run=%s method=%s value=%.3f flagged=%s",
        run.id,
        score.method,
        score.value,
        score.flagged,
    )

    return evaluation_result_cls(
        key="groundlens",
        score=score.normalized,
        comment=score.explanation,
    )

GroundlensCallback (LangChain)

groundlens.integrations.langchain.callback.GroundlensCallback(groundlens_model: str = 'all-MiniLM-L6-v2', context_key: str = 'context')

LangChain callback handler that scores every LLM response with groundlens.

Stores prompts on on_llm_start and evaluates responses on on_llm_end. Flagged results are logged as warnings. Scores are accumulated in :attr:scores for later inspection.

Parameters:

Name Type Description Default
groundlens_model str

Sentence-transformer model for groundlens scoring. Defaults to "all-MiniLM-L6-v2".

'all-MiniLM-L6-v2'
context_key str

Metadata key to look for context in kwargs. Defaults to "context".

'context'
Example

cb = GroundlensCallback()

Use as a LangChain callback

from langchain_openai import ChatOpenAI llm = ChatOpenAI(callbacks=[cb]) result = llm.invoke("Summarize the document.")

Inspect scores after execution

for run_id, score in cb.scores.items(): ... print(f"{run_id}: {score.explanation}")

Source code in src/groundlens/integrations/langchain/callback.py
def __init__(
    self,
    groundlens_model: str = "all-MiniLM-L6-v2",
    context_key: str = "context",
) -> None:
    self._groundlens_model = groundlens_model
    self._context_key = context_key
    self._prompts: dict[UUID, list[str]] = {}
    self._contexts: dict[UUID, str | None] = {}
    self.scores: dict[UUID, GroundlensScore] = {}

Functions

on_llm_start(serialized: dict[str, Any], prompts: list[str], *, run_id: UUID, **kwargs: Any) -> None

Store prompts when an LLM call begins.

Parameters:

Name Type Description Default
serialized dict[str, Any]

Serialized LLM configuration.

required
prompts list[str]

List of prompt strings sent to the LLM.

required
run_id UUID

Unique identifier for this LLM run.

required
**kwargs Any

Additional keyword arguments from LangChain.

{}
Source code in src/groundlens/integrations/langchain/callback.py
def on_llm_start(
    self,
    serialized: dict[str, Any],
    prompts: list[str],
    *,
    run_id: UUID,
    **kwargs: Any,
) -> None:
    """Store prompts when an LLM call begins.

    Args:
        serialized: Serialized LLM configuration.
        prompts: List of prompt strings sent to the LLM.
        run_id: Unique identifier for this LLM run.
        **kwargs: Additional keyword arguments from LangChain.
    """
    self._prompts[run_id] = prompts
    metadata = kwargs.get("metadata") or {}
    self._contexts[run_id] = metadata.get(self._context_key)
    logger.debug("on_llm_start run_id=%s prompts=%d", run_id, len(prompts))

on_llm_end(response: LLMResult, *, run_id: UUID, **kwargs: Any) -> None

Evaluate the LLM response for hallucination risk.

Parameters:

Name Type Description Default
response LLMResult

The LLM result containing generated text.

required
run_id UUID

Unique identifier for this LLM run.

required
**kwargs Any

Additional keyword arguments from LangChain.

{}
Source code in src/groundlens/integrations/langchain/callback.py
def on_llm_end(
    self,
    response: LLMResult,
    *,
    run_id: UUID,
    **kwargs: Any,
) -> None:
    """Evaluate the LLM response for hallucination risk.

    Args:
        response: The LLM result containing generated text.
        run_id: Unique identifier for this LLM run.
        **kwargs: Additional keyword arguments from LangChain.
    """
    prompts = self._prompts.pop(run_id, [])
    context = self._contexts.pop(run_id, None)

    if not prompts or not response.generations:
        logger.debug("on_llm_end run_id=%s — no prompts or generations", run_id)
        return

    prompt = prompts[0]
    generation = response.generations[0]
    if not generation:
        return

    text = generation[0].text

    score = evaluate(
        question=prompt,
        response=text,
        context=context,
        model=self._groundlens_model,
    )

    self.scores[run_id] = score

    if score.flagged:
        logger.warning(
            "Groundlens FLAGGED run_id=%s method=%s value=%.3f%s",
            run_id,
            score.method,
            score.value,
            score.explanation,
        )
    else:
        logger.info(
            "Groundlens OK run_id=%s method=%s value=%.3f",
            run_id,
            score.method,
            score.value,
        )

on_llm_error(error: BaseException, *, run_id: UUID, **kwargs: Any) -> None

Clean up state when an LLM call fails.

Parameters:

Name Type Description Default
error BaseException

The exception that caused the LLM call to fail.

required
run_id UUID

Unique identifier for this LLM run.

required
**kwargs Any

Additional keyword arguments from LangChain.

{}
Source code in src/groundlens/integrations/langchain/callback.py
def on_llm_error(
    self,
    error: BaseException,
    *,
    run_id: UUID,
    **kwargs: Any,
) -> None:
    """Clean up state when an LLM call fails.

    Args:
        error: The exception that caused the LLM call to fail.
        run_id: Unique identifier for this LLM run.
        **kwargs: Additional keyword arguments from LangChain.
    """
    self._prompts.pop(run_id, None)
    self._contexts.pop(run_id, None)
    logger.error("on_llm_error run_id=%s error=%s", run_id, error)

GroundlensTool (CrewAI)

groundlens.integrations.crewai.tool.GroundlensTool(name: str = 'groundlens_verify', description: str | None = None, groundlens_model: str = 'all-MiniLM-L6-v2')

CrewAI tool for verifying LLM outputs using groundlens.

Extends the CrewAI tool pattern to let agents self-verify their outputs. The tool evaluates a question-response pair (with optional context) and returns a human-readable verification summary.

Parameters:

Name Type Description Default
name str

Tool name visible to the agent. Defaults to "groundlens_verify".

'groundlens_verify'
description str | None

Tool description for agent tool selection.

None
groundlens_model str

Sentence-transformer model for groundlens scoring. Defaults to "all-MiniLM-L6-v2".

'all-MiniLM-L6-v2'
Example

from groundlens.integrations.crewai import GroundlensTool tool = GroundlensTool()

Agent uses the tool to verify its own output

result = tool._run( ... question="What causes rain?", ... response="Rain is caused by condensation.", ... context="Water cycle: evaporation, condensation, precipitation.", ... ) "PASS" in result or "FLAGGED" in result True

Source code in src/groundlens/integrations/crewai/tool.py
def __init__(
    self,
    name: str = "groundlens_verify",
    description: str | None = None,
    groundlens_model: str = "all-MiniLM-L6-v2",
) -> None:
    self.name = name
    if description is not None:
        self.description = description
    self._groundlens_model = groundlens_model

GroundlensFilter (Semantic Kernel)

groundlens.integrations.semantic_kernel.filter.GroundlensFilter(groundlens_model: str = 'all-MiniLM-L6-v2', input_key: str = 'input', context_key: str = 'context')

Semantic Kernel function invocation filter with groundlens scoring.

Intercepts function invocation results and evaluates them for hallucination risk. Scores are attached to the invocation context metadata under the "groundlens_score" key and stored in :attr:scores for later inspection.

Parameters:

Name Type Description Default
groundlens_model str

Sentence-transformer model for groundlens scoring. Defaults to "all-MiniLM-L6-v2".

'all-MiniLM-L6-v2'
input_key str

Key to extract the question from function arguments. Defaults to "input".

'input'
context_key str

Key to extract context from function arguments. Defaults to "context".

'context'
Example

filt = GroundlensFilter()

Register with a Semantic Kernel instance

kernel.add_filter("function_invocation", filt)

After invocation, inspect scores:

for fn_name, score in filt.scores: ... print(f"{fn_name}: {score.explanation}")

Source code in src/groundlens/integrations/semantic_kernel/filter.py
def __init__(
    self,
    groundlens_model: str = "all-MiniLM-L6-v2",
    input_key: str = "input",
    context_key: str = "context",
) -> None:
    self._groundlens_model = groundlens_model
    self._input_key = input_key
    self._context_key = context_key
    self.scores: list[tuple[str, GroundlensScore]] = []

Functions

on_function_invocation(context: Any, next_handler: Callable[..., Awaitable[None]]) -> None async

Intercept a function invocation and evaluate the result.

Calls the next filter/function in the pipeline, then evaluates the result with groundlens. Attaches the score to the context metadata.

Parameters:

Name Type Description Default
context Any

The Semantic Kernel FunctionInvocationContext containing function arguments and result.

required
next_handler Callable[..., Awaitable[None]]

The next handler in the filter pipeline.

required
Example
This method is called automatically by Semantic Kernel
when registered as a function invocation filter.
Source code in src/groundlens/integrations/semantic_kernel/filter.py
async def on_function_invocation(
    self,
    context: Any,
    next_handler: Callable[..., Awaitable[None]],
) -> None:
    """Intercept a function invocation and evaluate the result.

    Calls the next filter/function in the pipeline, then evaluates
    the result with groundlens. Attaches the score to the context
    metadata.

    Args:
        context: The Semantic Kernel ``FunctionInvocationContext``
            containing function arguments and result.
        next_handler: The next handler in the filter pipeline.

    Example:
        >>> # This method is called automatically by Semantic Kernel
        >>> # when registered as a function invocation filter.
    """
    await next_handler(context)

    function_name = getattr(context, "function_name", "unknown")
    arguments = getattr(context, "arguments", {}) or {}
    result = getattr(context, "result", None)

    if result is None:
        logger.debug("GroundlensFilter: no result for function %s", function_name)
        return

    result_value = getattr(result, "value", None)
    result_value = str(result) if result_value is None else str(result_value)

    question = str(arguments.get(self._input_key, ""))
    context_text: str | None = arguments.get(self._context_key)
    if context_text is not None:
        context_text = str(context_text)

    if not question:
        logger.debug(
            "GroundlensFilter: no input found for function %s, skipping",
            function_name,
        )
        return

    score: GroundlensScore = evaluate(
        question=question,
        response=result_value,
        context=context_text,
        model=self._groundlens_model,
    )

    self.scores.append((function_name, score))

    metadata = getattr(context, "metadata", None)
    if metadata is not None and isinstance(metadata, dict):
        metadata["groundlens_score"] = score

    if score.flagged:
        logger.warning(
            "GroundlensFilter FLAGGED function=%s method=%s value=%.3f%s",
            function_name,
            score.method,
            score.value,
            score.explanation,
        )
    else:
        logger.info(
            "GroundlensFilter OK function=%s method=%s value=%.3f",
            function_name,
            score.method,
            score.value,
        )

GroundlensChecker (AutoGen)

groundlens.integrations.autogen.checker.GroundlensChecker(groundlens_model: str = 'all-MiniLM-L6-v2', context_key: str = 'context')

AutoGen reply checker that evaluates messages with groundlens.

Designed to be used as a reply validation step in AutoGen agent conversations. Evaluates the last assistant message against the preceding user message for hallucination risk.

Parameters:

Name Type Description Default
groundlens_model str

Sentence-transformer model for groundlens scoring. Defaults to "all-MiniLM-L6-v2".

'all-MiniLM-L6-v2'
context_key str

Key to look for context in message metadata. Defaults to "context".

'context'
Example

checker = GroundlensChecker() messages = [ ... {"role": "user", "content": "Summarize this document."}, ... {"role": "assistant", "content": "The document discusses..."}, ... ] result = checker.check(messages, sender=None) result["method"] 'dgi' result["flagged"] False

Source code in src/groundlens/integrations/autogen/checker.py
def __init__(
    self,
    groundlens_model: str = "all-MiniLM-L6-v2",
    context_key: str = "context",
) -> None:
    self._groundlens_model = groundlens_model
    self._context_key = context_key

Functions

check(messages: list[dict[str, Any]], sender: Any, **kwargs: Any) -> dict[str, Any]

Evaluate the last message in the conversation.

Extracts the last assistant message as the response and the most recent preceding user message as the question. If context is found in message metadata, SGI scoring is used; otherwise DGI is applied.

Parameters:

Name Type Description Default
messages list[dict[str, Any]]

List of conversation message dicts. Each dict should have "role" and "content" keys.

required
sender Any

The AutoGen agent that sent the last message. Used for logging; can be None.

required
**kwargs Any

Additional keyword arguments. If a "context" key is present, it is used for SGI evaluation.

{}

Returns:

Type Description
dict[str, Any]

A dict containing: - "score": The raw groundlens score value. - "normalized": Score mapped to [0, 1]. - "flagged": Whether human review is recommended. - "method": Scoring method used ("sgi" or "dgi"). - "explanation": Human-readable interpretation.

Example

checker = GroundlensChecker() result = checker.check( ... messages=[ ... {"role": "user", "content": "What is 2+2?"}, ... {"role": "assistant", "content": "2+2 equals 4."}, ... ], ... sender=None, ... ) isinstance(result["score"], float) True

Source code in src/groundlens/integrations/autogen/checker.py
def check(
    self,
    messages: list[dict[str, Any]],
    sender: Any,
    **kwargs: Any,
) -> dict[str, Any]:
    """Evaluate the last message in the conversation.

    Extracts the last assistant message as the response and the
    most recent preceding user message as the question. If context
    is found in message metadata, SGI scoring is used; otherwise
    DGI is applied.

    Args:
        messages: List of conversation message dicts. Each dict should
            have ``"role"`` and ``"content"`` keys.
        sender: The AutoGen agent that sent the last message.
            Used for logging; can be ``None``.
        **kwargs: Additional keyword arguments. If a ``"context"``
            key is present, it is used for SGI evaluation.

    Returns:
        A dict containing:
            - ``"score"``: The raw groundlens score value.
            - ``"normalized"``: Score mapped to [0, 1].
            - ``"flagged"``: Whether human review is recommended.
            - ``"method"``: Scoring method used (``"sgi"`` or ``"dgi"``).
            - ``"explanation"``: Human-readable interpretation.

    Example:
        >>> checker = GroundlensChecker()
        >>> result = checker.check(
        ...     messages=[
        ...         {"role": "user", "content": "What is 2+2?"},
        ...         {"role": "assistant", "content": "2+2 equals 4."},
        ...     ],
        ...     sender=None,
        ... )
        >>> isinstance(result["score"], float)
        True
    """
    if not messages:
        logger.warning("GroundlensChecker.check called with empty messages")
        return {
            "score": None,
            "normalized": None,
            "flagged": None,
            "method": None,
            "explanation": "No messages to evaluate.",
        }

    last_message = messages[-1]
    response = str(last_message.get("content", ""))

    question = ""
    for msg in reversed(messages[:-1]):
        if msg.get("role") == "user":
            question = str(msg.get("content", ""))
            break

    if not question:
        question = response

    context: str | None = kwargs.get(self._context_key)

    if context is None:
        for msg in reversed(messages):
            msg_metadata = msg.get("metadata", {})
            if isinstance(msg_metadata, dict) and self._context_key in msg_metadata:
                context = str(msg_metadata[self._context_key])
                break

    sender_name = getattr(sender, "name", str(sender)) if sender else "unknown"
    logger.debug(
        "GroundlensChecker.check sender=%s messages=%d context=%s",
        sender_name,
        len(messages),
        "provided" if context else "none",
    )

    score: GroundlensScore = evaluate(
        question=question,
        response=response,
        context=context,
        model=self._groundlens_model,
    )

    if score.flagged:
        logger.warning(
            "GroundlensChecker FLAGGED sender=%s method=%s value=%.3f%s",
            sender_name,
            score.method,
            score.value,
            score.explanation,
        )
    else:
        logger.info(
            "GroundlensChecker OK sender=%s method=%s value=%.3f",
            sender_name,
            score.method,
            score.value,
        )

    return {
        "score": score.value,
        "normalized": score.normalized,
        "flagged": score.flagged,
        "method": score.method,
        "explanation": score.explanation,
    }

Internal Modules

Internal API

The following modules are internal implementation details. They are documented here for completeness but are not part of the public API and may change without notice.

Geometry Primitives

groundlens._internal.geometry

Geometric primitives for embedding space operations.

This module provides the mathematical building blocks used by SGI and DGI. All operations are on vectors in R^n (the embedding space of a sentence transformer), which can be understood geometrically on the unit hypersphere S^(n-1) when vectors are L2-normalized.

Key concepts:

  • Euclidean distance in R^n is used by SGI to compare how far the response embedding is from the question vs. the context.

  • Displacement vectors (r - q) capture the semantic "movement" from question to response. DGI projects these onto a reference direction.

  • Unit normalization maps vectors to S^(n-1). On the unit hypersphere, dot product equals cosine similarity, and Euclidean distance is a monotonic function of angular distance.

References

Marin (2025). Semantic Grounding Index. arXiv:2512.13771. Marin (2026). A Geometric Taxonomy of Hallucinations. arXiv:2602.13224v3.

Functions

euclidean_distance(a: EmbeddingVector, b: EmbeddingVector) -> float

Compute Euclidean distance between two embedding vectors.

Parameters:

Name Type Description Default
a EmbeddingVector

First embedding vector, shape (d,).

required
b EmbeddingVector

Second embedding vector, shape (d,).

required

Returns:

Type Description
float

Non-negative scalar distance.

Source code in src/groundlens/_internal/geometry.py
def euclidean_distance(a: EmbeddingVector, b: EmbeddingVector) -> float:
    """Compute Euclidean distance between two embedding vectors.

    Args:
        a: First embedding vector, shape (d,).
        b: Second embedding vector, shape (d,).

    Returns:
        Non-negative scalar distance.
    """
    return float(np.linalg.norm(a - b))

unit_normalize(v: EmbeddingVector) -> EmbeddingVector

Project vector onto the unit hypersphere S^(n-1).

Parameters:

Name Type Description Default
v EmbeddingVector

Input vector, shape (d,).

required

Returns:

Type Description
EmbeddingVector

Unit vector v / ||v||, or the zero vector if ||v|| < epsilon.

Source code in src/groundlens/_internal/geometry.py
def unit_normalize(v: EmbeddingVector) -> EmbeddingVector:
    """Project vector onto the unit hypersphere S^(n-1).

    Args:
        v: Input vector, shape (d,).

    Returns:
        Unit vector v / ||v||, or the zero vector if ||v|| < epsilon.
    """
    norm = float(np.linalg.norm(v))
    if norm < _EPSILON:
        return v
    return v / norm

displacement_vector(question_emb: EmbeddingVector, response_emb: EmbeddingVector) -> EmbeddingVector

Compute the displacement from question to response in embedding space.

The displacement delta = phi(response) - phi(question) captures the semantic transformation applied by the LLM when generating a response. In grounded responses, this displacement aligns with a characteristic reference direction.

Parameters:

Name Type Description Default
question_emb EmbeddingVector

Question embedding, shape (d,).

required
response_emb EmbeddingVector

Response embedding, shape (d,).

required

Returns:

Type Description
EmbeddingVector

Displacement vector, shape (d,).

Source code in src/groundlens/_internal/geometry.py
def displacement_vector(
    question_emb: EmbeddingVector,
    response_emb: EmbeddingVector,
) -> EmbeddingVector:
    """Compute the displacement from question to response in embedding space.

    The displacement delta = phi(response) - phi(question) captures the
    semantic transformation applied by the LLM when generating a response.
    In grounded responses, this displacement aligns with a characteristic
    reference direction.

    Args:
        question_emb: Question embedding, shape (d,).
        response_emb: Response embedding, shape (d,).

    Returns:
        Displacement vector, shape (d,).
    """
    return response_emb - question_emb

cosine_similarity(a: EmbeddingVector, b: EmbeddingVector) -> float

Compute cosine similarity between two vectors.

Parameters:

Name Type Description Default
a EmbeddingVector

First vector, shape (d,).

required
b EmbeddingVector

Second vector, shape (d,).

required

Returns:

Type Description
float

Cosine similarity in [-1, 1]. Returns 0.0 if either vector

float

has near-zero norm.

Source code in src/groundlens/_internal/geometry.py
def cosine_similarity(a: EmbeddingVector, b: EmbeddingVector) -> float:
    """Compute cosine similarity between two vectors.

    Args:
        a: First vector, shape (d,).
        b: Second vector, shape (d,).

    Returns:
        Cosine similarity in [-1, 1]. Returns 0.0 if either vector
        has near-zero norm.
    """
    norm_a = float(np.linalg.norm(a))
    norm_b = float(np.linalg.norm(b))
    if norm_a < _EPSILON or norm_b < _EPSILON:
        return 0.0
    return float(np.dot(a, b) / (norm_a * norm_b))

mean_direction(vectors: list[EmbeddingVector]) -> EmbeddingVector

Compute the mean direction of a set of unit vectors.

This is the maximum-likelihood estimate of the mean direction parameter mu of a von Mises-Fisher distribution on S^(n-1).

Parameters:

Name Type Description Default
vectors list[EmbeddingVector]

List of unit-normalized vectors, each shape (d,).

required

Returns:

Type Description
EmbeddingVector

Unit-normalized mean direction, shape (d,). Zero vector if

EmbeddingVector

the input vectors cancel out.

Raises:

Type Description
ValueError

If the input list is empty.

Source code in src/groundlens/_internal/geometry.py
def mean_direction(vectors: list[EmbeddingVector]) -> EmbeddingVector:
    """Compute the mean direction of a set of unit vectors.

    This is the maximum-likelihood estimate of the mean direction
    parameter mu of a von Mises-Fisher distribution on S^(n-1).

    Args:
        vectors: List of unit-normalized vectors, each shape (d,).

    Returns:
        Unit-normalized mean direction, shape (d,). Zero vector if
        the input vectors cancel out.

    Raises:
        ValueError: If the input list is empty.
    """
    if not vectors:
        msg = "Cannot compute mean direction of empty vector list."
        raise ValueError(msg)

    mu: EmbeddingVector = np.mean(np.stack(vectors), axis=0)
    return unit_normalize(mu)

Thresholds

groundlens._internal.thresholds

Threshold constants and normalization functions.

All thresholds are derived empirically from the experiments reported in arXiv:2512.13771 (SGI) and arXiv:2602.13224v3 (DGI).

These constants define the decision boundaries for flagging LLM outputs as potential hallucinations. They are intentionally conservative: the default behavior is to flag for human review rather than silently pass.

Attributes

SGI_STRONG_PASS: float = 1.2 module-attribute

SGI score indicating strong context engagement. Green zone.

SGI_REVIEW: float = 0.95 module-attribute

SGI score below which output is flagged for human review. Red zone.

DGI_PASS: float = 0.3 module-attribute

DGI score indicating alignment with grounded reference direction. Green zone.

Functions

normalize_sgi(raw_sgi: float) -> float

Normalize raw SGI score to [0, 1] range.

Uses tanh mapping with offset to produce a smooth sigmoid curve

normalized = tanh(max(0, raw - 0.3))

This maps the raw SGI range (~0.5 to ~2.0) into a [0, 1] range suitable for dashboards and threshold comparison.

Mapping reference points

SGI 0.30 → 0.000 (floor) SGI 0.95 → 0.457 (review threshold) SGI 1.20 → 0.604 (strong pass) SGI 2.00 → 0.885 (very strong)

Parameters:

Name Type Description Default
raw_sgi float

The raw SGI ratio (q_dist / ctx_dist).

required

Returns:

Type Description
float

Score in [0.0, 1.0].

Source code in src/groundlens/_internal/thresholds.py
def normalize_sgi(raw_sgi: float) -> float:
    """Normalize raw SGI score to [0, 1] range.

    Uses tanh mapping with offset to produce a smooth sigmoid curve:
        normalized = tanh(max(0, raw - 0.3))

    This maps the raw SGI range (~0.5 to ~2.0) into a [0, 1] range
    suitable for dashboards and threshold comparison.

    Mapping reference points:
        SGI 0.30 → 0.000 (floor)
        SGI 0.95 → 0.457 (review threshold)
        SGI 1.20 → 0.604 (strong pass)
        SGI 2.00 → 0.885 (very strong)

    Args:
        raw_sgi: The raw SGI ratio (q_dist / ctx_dist).

    Returns:
        Score in [0.0, 1.0].
    """
    shifted = max(0.0, raw_sgi - 0.3)
    return min(1.0, max(0.0, math.tanh(shifted)))

normalize_dgi(raw_dgi: float) -> float

Normalize raw DGI score from [-1, 1] to [0, 1] range.

Simple linear mapping: normalized = (raw + 1) / 2.

Mapping reference points

DGI -1.0 → 0.000 (opposite to grounded direction) DGI 0.0 → 0.500 (orthogonal) DGI 0.3 → 0.650 (pass threshold) DGI 1.0 → 1.000 (perfectly aligned)

Parameters:

Name Type Description Default
raw_dgi float

The raw DGI cosine similarity to reference direction.

required

Returns:

Type Description
float

Score in [0.0, 1.0].

Source code in src/groundlens/_internal/thresholds.py
def normalize_dgi(raw_dgi: float) -> float:
    """Normalize raw DGI score from [-1, 1] to [0, 1] range.

    Simple linear mapping: normalized = (raw + 1) / 2.

    Mapping reference points:
        DGI -1.0 → 0.000 (opposite to grounded direction)
        DGI  0.0 → 0.500 (orthogonal)
        DGI  0.3 → 0.650 (pass threshold)
        DGI  1.0 → 1.000 (perfectly aligned)

    Args:
        raw_dgi: The raw DGI cosine similarity to reference direction.

    Returns:
        Score in [0.0, 1.0].
    """
    return min(1.0, max(0.0, (raw_dgi + 1.0) / 2.0))

Constants

Constant Value Module Description
SGI_STRONG_PASS 1.20 groundlens._internal.thresholds SGI strong pass threshold
SGI_REVIEW 0.95 groundlens._internal.thresholds SGI review/flag threshold
DGI_PASS 0.30 groundlens._internal.thresholds DGI pass threshold
DEFAULT_MODEL "all-MiniLM-L6-v2" groundlens._internal.embeddings Default sentence-transformer model

Type Summary

Type Description Key fields
SGIResult SGI computation result value, normalized, flagged, q_dist, ctx_dist
DGIResult DGI computation result value, normalized, flagged
GroundlensScore Unified evaluation result value, normalized, flagged, method, explanation, detail
CalibrationResult DGI calibration output model, n_pairs, embedding_dim, mu_hat, concentration
LLMResponse Provider response wrapper text, model, usage, groundlens_score