Skip to content

Index

groundlens

Groundlens — Verifiable agent triage.

Deterministic. Auditable. No second LLM in the loop.

Groundlens triages outputs from individual LLMs and from multi-agent pipelines (routing, RAG, specialized / tool-using agents). Two layers:

  • Geometric layer. SGI and DGI score grounding via embedding geometry, sub-second and deterministic. Apply to any agent's natural-language output.
  • Rule-based layer. Domain-specific rule sets with per-rule citations to academic, industrial, and regulatory sources. Per-agent factories live in :mod:groundlens.agents: :func:groundlens.agents.routing_rules, :func:groundlens.agents.rag_rules, :func:groundlens.agents.specialized_agent_rules.

Quick start::

>>> from groundlens import compute_sgi, compute_dgi, evaluate
>>>
>>> # With context (RAG verification) — uses SGI
>>> result = compute_sgi(
...     question="What is the capital of France?",
...     context="France is in Western Europe. Its capital is Paris.",
...     response="The capital of France is Paris.",
... )
>>> result.flagged
False
>>>
>>> # Without context — uses DGI
>>> result = compute_dgi(
...     question="What causes seasons?",
...     response="Seasons are caused by Earth's 23.5-degree axial tilt.",
... )
>>> result.flagged
False
>>>
>>> # Auto-select method
>>> score = evaluate(question="Q?", response="A.", context="Source.")
>>> score.method
'sgi'
>>>
>>> # Agent-specific rule triage
>>> from groundlens.agents import routing_rules, rag_rules, specialized_agent_rules
>>> rag = rag_rules()
>>> rag.name
'groundlens_banking_v1'
References

Marin (2025). Semantic Grounding Index. arXiv:2512.13771. Marin (2026). A Geometric Taxonomy of Hallucinations. arXiv:2602.13224v3. Marin (2026). Rotational Dynamics of Factual Constraint Processing. arXiv:2603.13259. Marin (2026). Defendable Rules for LLM Rationale Evaluation in Banking Governance: A Multi-Source Provenance Framework.

Attributes

DEFAULT_MODEL: str = 'Snowflake/snowflake-arctic-embed-l-v2.0' module-attribute

Default sentence transformer model.

Snowflake Arctic Embed L v2.0 — 1024 dims, 568M params, multilingual (100+ languages including Spanish/Catalan/Galician/English/Portuguese), 8192 token context window. Requires trust_remote_code=True on load (the model ships custom pooling code).

Why this is the default:

  • Verified on RAGTruth (n=2,700) and RAGBench (n=8,838) with consistent SGI/DGI behavior; calibrations in cookbooks ship against this encoder.
  • L2-normalizes embeddings naturally (contrastive training), which keeps the canonical angular SGI formulation numerically stable.
  • Multilingual out-of-the-box — relevant for European bank deployments.

When to override:

  • Lightweight deployment (CPU-only, latency-critical): use LIGHTWEIGHT_MINILM = "all-MiniLM-L6-v2" (22M params, 384 dims). The previous default through 2026.6.17.
  • Spanish/multilingual smaller footprint: use MULTILINGUAL_MINI (118M params, 384 dims).
  • Higher quality multilingual at higher cost: use MULTILINGUAL_E5 (560M params, 1024 dims) with required "query: "/"passage: " prefixes.

To override globally, pass model="..." to compute_sgi, compute_dgi, or the corresponding scorer classes.

LIGHTWEIGHT_MINILM: str = 'all-MiniLM-L6-v2' module-attribute

Lightweight English-only encoder (22M params, 384 dims). Was the default through groundlens 2026.6.17. Use for latency-critical CPU-only deployments where the trade-off in grounding signal quality is acceptable.

MULTILINGUAL_E5: str = 'intfloat/multilingual-e5-large' module-attribute

Multilingual E5 (560M params, 1024 dims, 100+ languages). Higher quality than MULTILINGUAL_MINI at ~5x the inference cost. Choose when latency budget allows it (e.g. batch evaluation, audit replay) and the deployment domain has shown weak separation under MiniLM. Requires prefixing queries with "query: " and passages with "passage: " to match the encoder's training recipe; see model card on HuggingFace.

MULTILINGUAL_MINI: str = 'paraphrase-multilingual-MiniLM-L12-v2' module-attribute

Multilingual MiniLM (118M params, 384 dims, 50+ languages including Spanish, Catalan, Galician, English). Sub-second on CPU. Recommended default for European-bank customer-support deployments where the WhatsApp / app channel receives queries across the bank's operating languages. Calibrate mu_hat and SGI threshold on a multilingual verified-grounded corpus for the expected query distribution.

Classes

CalibrationResult(model: str, n_pairs: int, embedding_dim: int, mu_hat: NDArray[np.float32], concentration: float, metadata: dict[str, str] = dict()) dataclass

Result of DGI calibration.

Attributes:

Name Type Description
model str

Sentence transformer model used for calibration.

n_pairs int

Number of (question, response) pairs used.

embedding_dim int

Dimensionality of the embedding space.

mu_hat NDArray[float32]

The computed reference direction vector.

concentration float

Estimated concentration parameter (kappa) of the von Mises-Fisher distribution. Higher values indicate more consistent displacement directions in the reference data.

Methods:
save(path: str | Path) -> None

Save calibration result to JSON.

Parameters:

Name Type Description Default
path str | Path

Output file path. The mu_hat vector is stored as a list.

required
Source code in src/groundlens/calibrate.py
def save(self, path: str | Path) -> None:
    """Save calibration result to JSON.

    Args:
        path: Output file path. The mu_hat vector is stored as a list.
    """
    data = {
        "model": self.model,
        "n_pairs": self.n_pairs,
        "embedding_dim": self.embedding_dim,
        "mu_hat": self.mu_hat.tolist(),
        "concentration": self.concentration,
        "metadata": self.metadata,
    }
    Path(path).write_text(json.dumps(data, indent=2), encoding="utf-8")
    logger.info("Calibration saved to %s.", path)
load(path: str | Path) -> CalibrationResult classmethod

Load a saved calibration result.

Parameters:

Name Type Description Default
path str | Path

Path to JSON calibration file.

required

Returns:

Type Description
CalibrationResult

CalibrationResult instance with restored mu_hat vector.

Source code in src/groundlens/calibrate.py
@classmethod
def load(cls, path: str | Path) -> CalibrationResult:
    """Load a saved calibration result.

    Args:
        path: Path to JSON calibration file.

    Returns:
        CalibrationResult instance with restored mu_hat vector.
    """
    data = json.loads(Path(path).read_text(encoding="utf-8"))
    return cls(
        model=data["model"],
        n_pairs=data["n_pairs"],
        embedding_dim=data["embedding_dim"],
        mu_hat=np.array(data["mu_hat"], dtype=np.float32),
        concentration=data["concentration"],
        metadata=data.get("metadata", {}),
    )

ThresholdFit(sgi_review: float | None, dgi_pass: float | None, n: int, model: str, metric: str = 'youden_j') dataclass

Fitted decision thresholds for SGI and DGI on a labeled set.

Thresholds are chosen by maximizing Youden's J for the rule "value >= threshold implies grounded" over the supplied examples.

Attributes:

Name Type Description
sgi_review float | None

Fitted SGI review threshold, or None if no contexts were supplied (SGI requires context).

dgi_pass float | None

Fitted DGI pass threshold, or None if it could not be estimated.

n int

Number of examples used for fitting.

model str

Sentence transformer model the scores were computed with.

metric str

Name of the criterion used to pick thresholds.

DGI(model: str = DEFAULT_MODEL, reference_csv: str | None = None, encoder: EmbeddingFn | None = None)

Reusable DGI scorer with pre-configured model and calibration.

Use this class when evaluating multiple responses against the same reference direction. Supports both bundled and custom calibration.

Example

dgi = DGI() result = dgi.score( ... question="What is ML?", ... response="ML is a branch of AI.", ... ) result.flagged False

dgi = DGI(reference_csv="my_domain_pairs.csv") result = dgi.score(question="...", response="...")

Initialize DGI scorer.

Parameters:

Name Type Description Default
model str

Sentence transformer model name.

DEFAULT_MODEL
reference_csv str | None

Path to domain-specific calibration CSV.

None
encoder EmbeddingFn | None

Optional bring-your-own-embeddings callable. When set, both calibration and scoring bypass sentence-transformers (no torch required).

None
Source code in src/groundlens/dgi.py
def __init__(
    self,
    model: str = DEFAULT_MODEL,
    reference_csv: str | None = None,
    encoder: EmbeddingFn | None = None,
) -> None:
    """Initialize DGI scorer.

    Args:
        model: Sentence transformer model name.
        reference_csv: Path to domain-specific calibration CSV.
        encoder: Optional bring-your-own-embeddings callable. When set,
            both calibration and scoring bypass sentence-transformers
            (no torch required).
    """
    self.model = model
    self.reference_csv = reference_csv
    self.encoder = encoder
Methods:
calibrate(pairs: list[tuple[str, str]] | None = None, csv_path: str | None = None) -> None

Set custom calibration data.

Either provide pairs directly or a path to a CSV file. This replaces any previously cached reference direction.

Parameters:

Name Type Description Default
pairs list[tuple[str, str]] | None

List of verified (question, response) tuples.

None
csv_path str | None

Path to a calibration CSV file.

None

Raises:

Type Description
ValueError

If neither pairs nor csv_path is provided.

Source code in src/groundlens/dgi.py
def calibrate(
    self,
    pairs: list[tuple[str, str]] | None = None,
    csv_path: str | None = None,
) -> None:
    """Set custom calibration data.

    Either provide pairs directly or a path to a CSV file.
    This replaces any previously cached reference direction.

    Args:
        pairs: List of verified (question, response) tuples.
        csv_path: Path to a calibration CSV file.

    Raises:
        ValueError: If neither ``pairs`` nor ``csv_path`` is provided.
    """
    enc_id = id(self.encoder) if self.encoder is not None else None

    if csv_path is not None:
        self.reference_csv = csv_path
        # Force recomputation on next score() call.
        cache_key = (self.model, csv_path, enc_id)
        _mu_hat_cache.pop(cache_key, None)
        return

    if pairs is not None:
        # Compute and cache the reference direction directly.
        mu = _compute_reference_direction(pairs, self.model, encoder=self.encoder)
        cache_key = (self.model, "__inline__", enc_id)
        _mu_hat_cache[cache_key] = mu
        self.reference_csv = "__inline__"
        return

    msg = "Provide either 'pairs' or 'csv_path' for calibration."
    raise ValueError(msg)
score(question: str, response: str) -> DGIResult

Compute DGI for a single response.

Parameters:

Name Type Description Default
question str

The input query.

required
response str

The LLM output to evaluate.

required

Returns:

Type Description
DGIResult

DGIResult with score and flag status.

Raises:

Type Description
RuntimeError

If calibrate(pairs=...) has not been called yet on this instance and reference_csv is the inline sentinel.

Source code in src/groundlens/dgi.py
def score(self, question: str, response: str) -> DGIResult:
    """Compute DGI for a single response.

    Args:
        question: The input query.
        response: The LLM output to evaluate.

    Returns:
        DGIResult with score and flag status.

    Raises:
        RuntimeError: If ``calibrate(pairs=...)`` has not been called
            yet on this instance and ``reference_csv`` is the inline
            sentinel.
    """
    if self.reference_csv == "__inline__":
        # Guard: the inline mu_hat must already be in the cache, since
        # there is no on-disk CSV to fall back to.
        enc_id = id(self.encoder) if self.encoder is not None else None
        cache_key = (self.model, "__inline__", enc_id)
        if cache_key not in _mu_hat_cache:
            msg = "Call calibrate() before score() when using inline pairs."
            raise RuntimeError(msg)

    # Pass reference_csv through unchanged. ``_get_mu_hat`` resolves:
    #   None         -> bundled mu_hat
    #   real path    -> load CSV, compute mu_hat
    #   "__inline__" -> hit the cache populated by calibrate(pairs=...)
    return compute_dgi(
        question=question,
        response=response,
        model=self.model,
        reference_csv=self.reference_csv,
        encoder=self.encoder,
    )
propose_labels(*, seeds: list[SeedExample], llm_generate: Callable[[str], str], n_candidates: int = 50, n_to_label: int = 10, strategies: str | tuple[str | tuple[str, str], ...] = 'default', diverse_fraction: float = 0.3, seed: int = 42) -> PropositionBatch

Active-learning bootstrap of a verified-grounded calibration set.

Given 10-50 verified-grounded :class:SeedExample triples and a text-generation callable, this method:

  1. Picks a seed at random for each candidate and rewrites its grounded response under one of the named confabulation strategies, using the seed's own context as the source of truth in the prompt. Coherence is preserved by design -- the prompt never sees a mismatched context+question pair.
  2. Scores each generated candidate with this DGI.
  3. Ranks candidates by acquisition score (70% uncertainty / 30% strategy diversity) and returns the top n_to_label for a human reviewer.

The method DOES NOT label and DOES NOT calibrate. The human reviewer assigns the labels; the caller then passes the labelled grounded subset to :meth:calibrate. The loop is non-circular by design.

Parameters:

Name Type Description Default
seeds list[SeedExample]

10-50 verified-grounded :class:SeedExample triples. Each carries its own context, question and grounded response, so the generation prompt is always coherent.

required
llm_generate Callable[[str], str]

A callable (prompt: str) -> str that the user provides (an OpenAI / Anthropic / local LLM wrapper). groundlens does not embed an LLM.

required
n_candidates int

Total candidates to generate across all strategies. Default 50 (≈5 minutes at 4 s/call).

50
n_to_label int

How many candidates the batch should contain. Default 10. The rest are returned in batch.all_candidates for audit.

10
strategies str | tuple[str | tuple[str, str], ...]

"default" (all five strategies from groundlens-dev/grounding-benchmark), or a tuple of strategy names, or a tuple of (name, prompt_template) custom pairs. Templates accept the slots {context}, {question}, {grounded}.

'default'
diverse_fraction float

Fraction of the batch reserved for strategy diversity (the rest is filled by uncertainty). Default 0.3.

0.3
seed int

Random seed for sampling seeds across rounds. Determinism is required for reproducible audits.

42

Returns:

Name Type Description
A PropositionBatch

class:groundlens.PropositionBatch ready for human review.

Raises:

Type Description
ValueError

If seeds is empty or n_candidates < 1.

TypeError

If llm_generate is not callable, or any element of seeds is not a SeedExample.

Source code in src/groundlens/dgi.py
def propose_labels(
    self,
    *,
    seeds: list[SeedExample],
    llm_generate: Callable[[str], str],
    n_candidates: int = 50,
    n_to_label: int = 10,
    strategies: str | tuple[str | tuple[str, str], ...] = "default",
    diverse_fraction: float = 0.3,
    seed: int = 42,
) -> PropositionBatch:
    """Active-learning bootstrap of a verified-grounded calibration set.

    Given 10-50 verified-grounded :class:`SeedExample` triples and a
    text-generation callable, this method:

    1. Picks a seed at random for each candidate and rewrites its
       ``grounded`` response under one of the named confabulation
       strategies, using the seed's own ``context`` as the source
       of truth in the prompt. Coherence is preserved by design --
       the prompt never sees a mismatched context+question pair.
    2. Scores each generated candidate with this DGI.
    3. Ranks candidates by acquisition score (70% uncertainty /
       30% strategy diversity) and returns the top ``n_to_label``
       for a human reviewer.

    The method DOES NOT label and DOES NOT calibrate. The human
    reviewer assigns the labels; the caller then passes the labelled
    grounded subset to :meth:`calibrate`. The loop is non-circular
    by design.

    Args:
        seeds: 10-50 verified-grounded :class:`SeedExample` triples.
            Each carries its own ``context``, ``question`` and
            ``grounded`` response, so the generation prompt is
            always coherent.
        llm_generate: A callable ``(prompt: str) -> str`` that the
            user provides (an OpenAI / Anthropic / local LLM
            wrapper). groundlens does not embed an LLM.
        n_candidates: Total candidates to generate across all
            strategies. Default 50 (≈5 minutes at 4 s/call).
        n_to_label: How many candidates the batch should contain.
            Default 10. The rest are returned in
            ``batch.all_candidates`` for audit.
        strategies: ``"default"`` (all five strategies from
            ``groundlens-dev/grounding-benchmark``), or a tuple of
            strategy names, or a tuple of ``(name, prompt_template)``
            custom pairs. Templates accept the slots ``{context}``,
            ``{question}``, ``{grounded}``.
        diverse_fraction: Fraction of the batch reserved for
            strategy diversity (the rest is filled by uncertainty).
            Default 0.3.
        seed: Random seed for sampling seeds across rounds.
            Determinism is required for reproducible audits.

    Returns:
        A :class:`groundlens.PropositionBatch` ready for human review.

    Raises:
        ValueError: If ``seeds`` is empty or ``n_candidates`` < 1.
        TypeError: If ``llm_generate`` is not callable, or any
            element of ``seeds`` is not a ``SeedExample``.
    """
    import random
    import warnings

    from groundlens._internal.strategies import resolve_strategies
    from groundlens.propose import (
        ProposedLabel,
        PropositionBatch,
        SeedExample,
        _uncertainty,
        build_review_template,
        rank_for_labelling,
    )

    if not seeds:
        msg = "seeds must contain at least one SeedExample."
        raise ValueError(msg)
    if not all(isinstance(s, SeedExample) for s in seeds):
        msg = (
            "Every item in seeds must be a SeedExample(context=..., "
            "question=..., grounded=...) instance."
        )
        raise TypeError(msg)
    if n_candidates < 1:
        msg = "n_candidates must be >= 1."
        raise ValueError(msg)
    if not callable(llm_generate):
        msg = "llm_generate must be a callable (prompt: str) -> str."
        raise TypeError(msg)

    resolved_strategies = resolve_strategies(strategies)
    if not resolved_strategies:
        msg = "At least one strategy must be specified."
        raise ValueError(msg)

    # Threshold: median DGI score on the seed grounded pairs. This is
    # a reasonable proxy for the boundary between grounded and
    # ungrounded when no calibrated threshold is available yet.
    seed_scores = [self.score(s.question, s.grounded).normalized for s in seeds]
    sorted_scores = sorted(seed_scores)
    n = len(sorted_scores)
    median = (
        sorted_scores[n // 2]
        if n % 2 == 1
        else 0.5 * (sorted_scores[n // 2 - 1] + sorted_scores[n // 2])
    )
    threshold = float(median)

    rng = random.Random(seed)

    # Round-robin across strategies. For each candidate, sample ONE
    # seed and pass its OWN (context, question, grounded) through
    # the strategy template. No more mismatched context/seed pairs.
    candidates: list[ProposedLabel] = []
    per_strategy = max(1, n_candidates // len(resolved_strategies))
    for strat_name, template in resolved_strategies:
        for _ in range(per_strategy):
            if len(candidates) >= n_candidates:
                break
            anchor = rng.choice(seeds)
            prompt = template.format(
                context=anchor.context,
                question=anchor.question,
                grounded=anchor.grounded,
            )
            try:
                candidate_resp = llm_generate(prompt)
            except Exception as exc:
                msg = (
                    f"llm_generate raised {type(exc).__name__}: {exc}. "
                    "Skipping this candidate."
                )
                warnings.warn(msg, RuntimeWarning, stacklevel=2)
                continue

            if not isinstance(candidate_resp, str) or not candidate_resp.strip():
                continue

            score = self.score(anchor.question, candidate_resp).normalized
            candidates.append(
                ProposedLabel(
                    question=anchor.question,
                    candidate_response=candidate_resp.strip(),
                    dgi_score=float(score),
                    strategy=strat_name,
                    context_excerpt=anchor.context,
                    uncertainty=_uncertainty(float(score), threshold),
                )
            )

    # Rank for labelling (uncertainty + diversity).
    ranked = rank_for_labelling(
        candidates,
        n_to_label=n_to_label,
        diverse_fraction=diverse_fraction,
    )

    # Audit: keep all candidates, ordered by uncertainty.
    all_ordered = sorted(candidates, key=lambda c: c.uncertainty)

    return PropositionBatch(
        items=tuple(ranked),
        review_template=build_review_template(ranked),
        all_candidates=tuple(all_ordered),
        strategies_used=tuple(name for name, _ in resolved_strategies),
    )

ProposedLabel(question: str, candidate_response: str, dgi_score: float, strategy: str, context_excerpt: str, uncertainty: float) dataclass

One candidate (question, response) pair ready for human review.

Attributes:

Name Type Description
question str

A question grounded in one of the FAQ-corpus entries.

candidate_response str

A confabulated response written by the generation LLM under the named strategy.

dgi_score float

The DGI normalized score of the candidate against the current mu_hat. Lower scores mean stronger deferral signal.

strategy str

The name of the confabulation strategy that produced this candidate (e.g. "redefinition").

context_excerpt str

The FAQ excerpt the question was anchored to.

uncertainty float

Distance of dgi_score from the threshold used for ranking. Smaller = more uncertain = higher priority.

PropositionBatch(items: tuple[ProposedLabel, ...], review_template: str, all_candidates: tuple[ProposedLabel, ...] = tuple(), strategies_used: tuple[str, ...] = tuple()) dataclass

A batch of candidates returned by :meth:groundlens.DGI.propose_labels.

Attributes:

Name Type Description
items tuple[ProposedLabel, ...]

Candidates ordered by acquisition score (most useful to label first). Length up to n_to_label.

review_template str

A Markdown template instructing the human reviewer how to label the items in the batch.

all_candidates tuple[ProposedLabel, ...]

Every candidate generated in the round, ordered by acquisition score. Useful for audit and debugging.

strategies_used tuple[str, ...]

The tuple of strategy names actually used.

SeedExample(context: str, question: str, grounded: str) dataclass

One verified-grounded triple you supply to DGI.propose_labels.

A SeedExample binds a FAQ paragraph (context) to a question that paragraph answers (question) and the verified-grounded response to that question (grounded). Bundling the three together is what keeps the candidate generation coherent: the confabulation prompt receives the same context, question and grounded answer rather than randomly-paired pieces.

Attributes:

Name Type Description
context str

A paragraph from the deployment's FAQ corpus that supports the grounded response.

question str

A question whose answer is contained in context.

grounded str

The verified-grounded response to question given context. The confabulation strategies rewrite this response under specific failure modes.

Raises:

Type Description
ValueError

If any field is empty or whitespace-only.

Methods:
__post_init__() -> None

Validate that every field is a non-empty, non-whitespace string.

Source code in src/groundlens/propose.py
def __post_init__(self) -> None:
    """Validate that every field is a non-empty, non-whitespace string."""
    for name in ("context", "question", "grounded"):
        value = getattr(self, name)
        if not isinstance(value, str) or not value.strip():
            msg = f"SeedExample.{name} must be a non-empty string."
            raise ValueError(msg)

ChecklistRule(id: str, description: str, weight: float, sub_score: str, check: Callable[[str, str, str | None, dict[str, Any]], RuleEvidence], citation: str = '') dataclass

A single rule with an id, a pattern check, and a weight.

Rules are designed to be readable: id and description are surfaced verbatim in the audit explanation. The check callable returns a :class:RuleEvidence so the audit trail records why the rule fired, not just that it did.

Attributes:

Name Type Description
id str

Stable identifier (e.g. "spec.reg_flag"). Used in audit logs.

description str

One-line human-readable description of the rule.

weight float

Contribution to the parent sub-score when matched, in [0, 1]. Sub-scores are capped at 1.0 even when weights sum higher.

sub_score str

Which sub-score this rule contributes to. For the legacy banking_rules() set: "spec", "expl", or "bshift". For the current groundlens_banking_rules() set: "groundedness", "completeness", "calibration", "traceability", or "robustness". Custom rule sets may define additional categories.

check Callable[[str, str, str | None, dict[str, Any]], RuleEvidence]

Pure function (question, response, context, metadata) -> RuleEvidence. Must be deterministic.

citation str

Free-text academic / industry / regulatory provenance for the rule, suitable for inclusion in an audit explanation or a regulatory submission. Empty string when no citation is provided. Example: "RAGAs (Es et al., EACL 2024) §3 Faithfulness".

RuleEvidence(matched: bool, span: str, explanation: str) dataclass

A single piece of evidence supporting a rule's pass/fail decision.

Attributes:

Name Type Description
matched bool

Whether the rule pattern matched the input text.

span str

The substring (lowercased) that triggered the match, or "" if no match was found.

explanation str

Short human-readable note describing what was checked.

RuleResult(rule_id: str, sub_score: str, weight: float, matched: bool, evidence_span: str, explanation: str) dataclass

Outcome of evaluating a single rule.

Attributes:

Name Type Description
rule_id str

The :attr:ChecklistRule.id that produced this result.

sub_score str

Which sub-score this rule contributes to.

weight float

The weight of the rule (echo of :attr:ChecklistRule.weight).

matched bool

Whether the rule fired.

evidence_span str

The substring that triggered the match, if any.

explanation str

The rule's human-readable explanation.

RuleSet(name: str, rules: tuple[ChecklistRule, ...], sub_scores: tuple[str, ...] = ('spec', 'expl', 'bshift'), quality_floor: float = _DEFAULT_QUALITY_FLOOR, flag_predicate: Callable[[dict[str, float]], bool] | None = None) dataclass

A collection of rules evaluated together against a (q, r, ctx) triple.

Use :func:groundlens_banking_rules for the current canonical five-category ruleset, :func:banking_rules for the legacy three-category ruleset, or construct your own by passing a sequence of :class:ChecklistRule along with the list of sub-score categories the rules contribute to.

Attributes:

Name Type Description
name str

Identifier (e.g. "groundlens_banking_v1"). Surfaced in audit logs.

rules tuple[ChecklistRule, ...]

The rules to evaluate.

sub_scores tuple[str, ...]

Ordered tuple of sub-score category names this ruleset produces. Rules whose sub_score field is not in this tuple are ignored at aggregation time (their evidence is still recorded in :attr:RuleSetResult.rule_results). Default ("spec", "expl", "bshift") preserves legacy behavior.

quality_floor float

Default flag-predicate threshold below which a sub-score triggers the audit-deficiency flag. Applied to spec and expl only when :attr:flag_predicate is None.

flag_predicate Callable[[dict[str, float]], bool] | None

Optional pure function dict[str, float] -> bool that decides whether the aggregated result is flagged. When None, the default legacy predicate is used: flagged iff spec < quality_floor or expl < quality_floor.

Methods:
evaluate(*, question: str, response: str, context: str | None = None, metadata: dict[str, Any] | None = None) -> RuleSetResult

Evaluate the ruleset against a single (question, response) pair.

Parameters:

Name Type Description Default
question str

The user query / prompt the LLM received.

required
response str

The LLM's rationale text being audited.

required
context str | None

Optional retrieved context (RAG-style). May be None when no retrieval was performed.

None
metadata dict[str, Any] | None

Optional dict carrying domain-specific structured data that some rules may consult (e.g. the case parameters in a banking decision: risk score, flags, amount, etc.).

None

Returns:

Name Type Description
A RuleSetResult

class:RuleSetResult with all sub-scores, the aggregated

RuleSetResult

quality, and a full audit explanation.

Raises:

Type Description
ValueError

If response is empty.

Source code in src/groundlens/rules.py
def evaluate(
    self,
    *,
    question: str,
    response: str,
    context: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> RuleSetResult:
    """Evaluate the ruleset against a single (question, response) pair.

    Args:
        question: The user query / prompt the LLM received.
        response: The LLM's rationale text being audited.
        context: Optional retrieved context (RAG-style). May be ``None``
            when no retrieval was performed.
        metadata: Optional dict carrying domain-specific structured data
            that some rules may consult (e.g. the case parameters in a
            banking decision: risk score, flags, amount, etc.).

    Returns:
        A :class:`RuleSetResult` with all sub-scores, the aggregated
        quality, and a full audit explanation.

    Raises:
        ValueError: If ``response`` is empty.
    """
    if not response.strip():
        msg = "response must be a non-empty string."
        raise ValueError(msg)

    meta = metadata or {}

    results: list[RuleResult] = []
    weights_by_sub: dict[str, float] = dict.fromkeys(self.sub_scores, 0.0)

    for rule in self.rules:
        evidence = rule.check(question, response, context, meta)
        results.append(
            RuleResult(
                rule_id=rule.id,
                sub_score=rule.sub_score,
                weight=rule.weight,
                matched=evidence.matched,
                evidence_span=evidence.span,
                explanation=evidence.explanation,
            )
        )
        if evidence.matched and rule.sub_score in weights_by_sub:
            weights_by_sub[rule.sub_score] += rule.weight

    sub_scores: dict[str, float] = {
        name: round(min(1.0, weights_by_sub[name]), 4) for name in self.sub_scores
    }

    product = 1.0
    for value in sub_scores.values():
        product *= value
    n = len(sub_scores)
    quality = round(product ** (1.0 / n), 4) if product > 0 and n > 0 else 0.0

    if self.flag_predicate is not None:
        flagged = bool(self.flag_predicate(sub_scores))
    else:
        # Legacy default: flagged iff spec or expl below quality_floor.
        flagged = (sub_scores.get("spec", 0.0) < self.quality_floor) or (
            sub_scores.get("expl", 0.0) < self.quality_floor
        )

    audit = _format_audit_explanation(
        ruleset_name=self.name,
        sub_scores=sub_scores,
        quality=quality,
        flagged=flagged,
        quality_floor=self.quality_floor,
        results=results,
    )

    return RuleSetResult(
        sub_scores=sub_scores,
        quality=quality,
        flagged=flagged,
        rule_results=tuple(results),
        audit_explanation=audit,
    )

RuleSetResult(sub_scores: dict[str, float], quality: float, flagged: bool, rule_results: tuple[RuleResult, ...], audit_explanation: str) dataclass

Aggregated result of evaluating a :class:RuleSet against a response.

Each sub-score is a capped weight sum of matched rules in that category, stored in the :attr:sub_scores mapping. quality is the geometric mean of all sub-score values: any zero sub-score yields quality = 0.0, reflecting that a rationale missing any audited dimension is structurally incomplete for human review.

Backward-compatible read accessors are exposed for the legacy De-La-Chica style sub-scores (spec, expl, bshift) and for the current GroundLens five-category skeleton (groundedness, completeness, calibration, traceability, robustness). Accessors return 0.0 when the underlying ruleset did not define the requested sub-score.

Attributes:

Name Type Description
sub_scores dict[str, float]

Mapping from sub-score name to its capped value in [0, 1]. By convention, do not mutate.

quality float

Geometric mean of all sub-score values in :attr:sub_scores.

flagged bool

True when the ruleset's flag predicate is triggered.

rule_results tuple[RuleResult, ...]

One :class:RuleResult per rule that was evaluated.

audit_explanation str

Multi-line human-readable summary suitable for inclusion in an audit log.

Attributes
spec: float property

Legacy specificity sub-score. Returns 0.0 if not defined by ruleset.

expl: float property

Legacy explanatory-linkage sub-score. Returns 0.0 if not defined by ruleset.

bshift: float property

Legacy boundary-shift sub-score. Returns 0.0 if not defined by ruleset.

groundedness: float property

Groundedness sub-score. Returns 0.0 if not defined by ruleset.

completeness: float property

Completeness sub-score. Returns 0.0 if not defined by ruleset.

calibration: float property

Calibration sub-score. Returns 0.0 if not defined by ruleset.

traceability: float property

Traceability sub-score. Returns 0.0 if not defined by ruleset.

robustness: float property

Robustness sub-score. Returns 0.0 if not defined by ruleset.

DGIResult(value: float, normalized: float, flagged: bool, method: str = 'dgi', explanation: str = '') dataclass

Result of Directional Grounding Index computation.

DGI measures whether the question-to-response displacement vector aligns with the mean displacement of verified grounded pairs. Higher values indicate alignment with grounded patterns.

Attributes:

Name Type Description
value float

Raw DGI score = cosine similarity to reference direction. Range: [-1, 1].

normalized float

Score mapped to [0, 1] via linear normalization.

flagged bool

True if the score is below the pass threshold.

method str

Always "dgi".

explanation str

Human-readable interpretation of the score.

Methods:
__post_init__() -> None

Generate explanation from score if not provided.

Source code in src/groundlens/score.py
def __post_init__(self) -> None:
    """Generate explanation from score if not provided."""
    if not self.explanation:
        if self.value >= DGI_PASS:
            expl = f"DGI={self.value:.3f} — aligns with grounded patterns (pass)"
        elif self.value >= 0.0:
            expl = f"DGI={self.value:.3f} — weak alignment (flagged)"
        else:
            expl = f"DGI={self.value:.3f} — opposes grounded direction (high risk)"
        object.__setattr__(self, "explanation", expl)

GroundlensScore(value: float, normalized: float, flagged: bool, method: str, explanation: str, detail: SGIResult | DGIResult) dataclass

Unified score container returned by high-level evaluate() calls.

Wraps either an SGIResult or DGIResult with additional metadata.

Attributes:

Name Type Description
value float

Raw score from the underlying method.

normalized float

Score in [0, 1].

flagged bool

Whether human review is recommended.

method str

"sgi" or "dgi".

explanation str

Human-readable interpretation.

detail SGIResult | DGIResult

The full SGIResult or DGIResult for method-specific fields.

SGIResult(value: float, normalized: float, flagged: bool, q_dist: float, ctx_dist: float, method: str = 'sgi', explanation: str = '') dataclass

Result of Semantic Grounding Index computation.

SGI measures whether a response engaged with the provided context or stayed anchored to the question. Higher values indicate stronger context engagement (grounded).

Attributes:

Name Type Description
value float

Raw SGI score = dist(response, question) / dist(response, context).

normalized float

Score mapped to [0, 1] via tanh normalization.

flagged bool

True if the score is below the review threshold.

q_dist float

Euclidean distance from response to question embedding.

ctx_dist float

Euclidean distance from response to context embedding.

method str

Always "sgi".

explanation str

Human-readable interpretation of the score.

Methods:
__post_init__() -> None

Generate explanation from score if not provided.

Source code in src/groundlens/score.py
def __post_init__(self) -> None:
    """Generate explanation from score if not provided."""
    if not self.explanation:
        if self.value >= SGI_STRONG_PASS:
            expl = f"SGI={self.value:.3f} — strong context engagement (pass)"
        elif self.value >= SGI_REVIEW:
            expl = f"SGI={self.value:.3f} — partial engagement (review recommended)"
        else:
            expl = f"SGI={self.value:.3f} — weak context engagement (flagged)"
        object.__setattr__(self, "explanation", expl)

SGI(model: str = DEFAULT_MODEL, encoder: EmbeddingFn | None = None)

Reusable SGI scorer with a pre-configured embedding model.

Use this class when evaluating multiple responses with the same model to avoid repeating the model parameter.

Example

sgi = SGI(model="all-MiniLM-L6-v2") result = sgi.score( ... question="What is X?", ... context="X is Y.", ... response="X is Y.", ... ) result.flagged False

Initialize SGI scorer.

Parameters:

Name Type Description Default
model str

Sentence transformer model name or path.

DEFAULT_MODEL
encoder EmbeddingFn | None

Optional bring-your-own-embeddings callable. When set, scoring bypasses sentence-transformers (no torch required).

None
Source code in src/groundlens/sgi.py
def __init__(
    self,
    model: str = DEFAULT_MODEL,
    encoder: EmbeddingFn | None = None,
) -> None:
    """Initialize SGI scorer.

    Args:
        model: Sentence transformer model name or path.
        encoder: Optional bring-your-own-embeddings callable. When set,
            scoring bypasses sentence-transformers (no torch required).
    """
    self.model = model
    self.encoder = encoder
Methods:
score(question: str, context: str, response: str) -> SGIResult

Compute SGI for a single response.

Parameters:

Name Type Description Default
question str

The input query.

required
context str

Source document or reference text.

required
response str

The LLM output to evaluate.

required

Returns:

Type Description
SGIResult

SGIResult with score and flag status.

Source code in src/groundlens/sgi.py
def score(
    self,
    question: str,
    context: str,
    response: str,
) -> SGIResult:
    """Compute SGI for a single response.

    Args:
        question: The input query.
        context: Source document or reference text.
        response: The LLM output to evaluate.

    Returns:
        SGIResult with score and flag status.
    """
    return compute_sgi(
        question=question,
        context=context,
        response=response,
        model=self.model,
        encoder=self.encoder,
    )

Functions:

get_default_encoder() -> EmbeddingFn | None

Return the process-global embedding callable, or None if unset.

Returns:

Type Description
EmbeddingFn | None

The encoder previously set via :func:set_default_encoder, or None.

Source code in src/groundlens/_internal/embeddings.py
def get_default_encoder() -> EmbeddingFn | None:
    """Return the process-global embedding callable, or ``None`` if unset.

    Returns:
        The encoder previously set via :func:`set_default_encoder`, or ``None``.
    """
    return _custom_encoder

set_default_encoder(encoder: EmbeddingFn | None) -> None

Set (or clear) the process-global embedding callable.

When a default encoder is set, every encode_texts call that does not receive an explicit encoder= argument routes through it, bypassing sentence-transformers entirely (so no torch import is triggered). Pass None to clear and restore the sentence-transformers path.

Parameters:

Name Type Description Default
encoder EmbeddingFn | None

A callable taking list[str] and returning an (n, d) array-like of float embeddings, or None to clear.

required
Source code in src/groundlens/_internal/embeddings.py
def set_default_encoder(encoder: EmbeddingFn | None) -> None:
    """Set (or clear) the process-global embedding callable.

    When a default encoder is set, every ``encode_texts`` call that does not
    receive an explicit ``encoder=`` argument routes through it, bypassing
    sentence-transformers entirely (so no torch import is triggered). Pass
    ``None`` to clear and restore the sentence-transformers path.

    Args:
        encoder: A callable taking ``list[str]`` and returning an ``(n, d)``
            array-like of float embeddings, or ``None`` to clear.
    """
    global _custom_encoder
    _custom_encoder = encoder

customer_support_rag_rules() -> RuleSet

Deprecated alias — use :func:customer_support_rules (with rag=True).

Preserved for one or more releases for backwards compatibility with code written against groundlens 2026.6.11 / 2026.6.12. The returned rule set is byte-for-byte identical to customer_support_rules(rag=True, domain="general", language="en") except for the RuleSet.name field, which keeps the legacy "customer_support_rag_v1" value so existing audit logs continue to match.

.. deprecated:: 2026.6.13 Use :func:customer_support_rules instead.

Source code in src/groundlens/agents/customer_support.py
def customer_support_rag_rules() -> RuleSet:
    """Deprecated alias — use :func:`customer_support_rules` (with ``rag=True``).

    Preserved for one or more releases for backwards compatibility with
    code written against groundlens 2026.6.11 / 2026.6.12. The returned
    rule set is byte-for-byte identical to
    ``customer_support_rules(rag=True, domain="general", language="en")``
    except for the ``RuleSet.name`` field, which keeps the legacy
    ``"customer_support_rag_v1"`` value so existing audit logs continue to
    match.

    .. deprecated:: 2026.6.13
        Use :func:`customer_support_rules` instead.
    """
    warnings.warn(
        "customer_support_rag_rules() is deprecated; "
        "use customer_support_rules(rag=True) instead. "
        "The legacy alias will be removed in a future release.",
        DeprecationWarning,
        stacklevel=2,
    )
    rs = customer_support_rules(rag=True, domain="general", language="en")
    # Preserve the legacy name so downstream code that asserts on rs.name
    # (e.g. the cookbook notebook's `ruleset.name` check) does not break.
    object.__setattr__(rs, "name", "customer_support_rag_v1")
    return rs

customer_support_rules(rag: bool = True, domain: str = 'general', language: str = 'en') -> RuleSet

Rule set for customer-support informational agents.

Designed for informational customer-facing assistants. Selects between the RAG and no-RAG sub-score taxonomies and adjusts the stopword / speculative-marker vocabulary to the deployment domain and language.

Parameters:

Name Type Description Default
rag bool

Whether the agent retrieves context (FAQ) before answering.

  • True (default) — full 7-rule, 3-sub-score set (groundedness, completeness, no_overreach).
  • False — 4-rule, 2-sub-score set (completeness, no_overreach). The three groundedness rules are omitted because there is no context to compare against. The flag predicate adapts.
True
domain str

Deployment domain. Affects stopwords and speculative-procedure markers; does not add or remove rules.

One of: "general" (default), "finance", "healthcare", "legal".

'general'
language str

Deployment language. Affects stopwords, speculative-procedure markers, and the legal-reference regular expression.

One of: "en" (default), "es", "multi".

'en'

Returns:

Name Type Description
A RuleSet

class:RuleSet whose name encodes the active configuration:

RuleSet

customer_support_v2_{domain}_{language}_{rag|norag}.

Raises:

Type Description
ValueError

If domain is not in :data:_VALID_DOMAINS or language is not in :data:_VALID_LANGUAGES.

Examples:

Default — FAQ-RAG, general domain, English::

from groundlens.agents import customer_support_rules

rs = customer_support_rules()
result = rs.evaluate(
    question="What is the Bizum daily limit?",
    response="The Bizum daily limit is 1,000 EUR per transaction.",
    context=(
        "The daily Bizum transfer limit is 1,000 EUR per "
        "transaction and 2,000 EUR per day in total."
    ),
)
assert not result.flagged

No-RAG chat in Spanish finance vocabulary::

rs = customer_support_rules(rag=False, domain="finance", language="es")
assert "completeness" in rs.sub_scores
assert "groundedness" not in rs.sub_scores
Source code in src/groundlens/agents/customer_support.py
def customer_support_rules(
    rag: bool = True,
    domain: str = "general",
    language: str = "en",
) -> RuleSet:
    """Rule set for customer-support informational agents.

    Designed for informational customer-facing assistants. Selects between
    the RAG and no-RAG sub-score taxonomies and adjusts the
    stopword / speculative-marker vocabulary to the deployment domain and
    language.

    Args:
        rag: Whether the agent retrieves context (FAQ) before answering.

            - ``True`` (default) — full 7-rule, 3-sub-score set
              (``groundedness``, ``completeness``, ``no_overreach``).
            - ``False`` — 4-rule, 2-sub-score set (``completeness``,
              ``no_overreach``). The three groundedness rules are omitted
              because there is no context to compare against. The flag
              predicate adapts.
        domain: Deployment domain. Affects stopwords and
            speculative-procedure markers; does not add or remove rules.

            One of: ``"general"`` (default), ``"finance"``,
            ``"healthcare"``, ``"legal"``.
        language: Deployment language. Affects stopwords,
            speculative-procedure markers, and the legal-reference
            regular expression.

            One of: ``"en"`` (default), ``"es"``, ``"multi"``.

    Returns:
        A :class:`RuleSet` whose name encodes the active configuration:
        ``customer_support_v2_{domain}_{language}_{rag|norag}``.

    Raises:
        ValueError: If ``domain`` is not in :data:`_VALID_DOMAINS` or
            ``language`` is not in :data:`_VALID_LANGUAGES`.

    Examples:
        Default — FAQ-RAG, general domain, English::

            from groundlens.agents import customer_support_rules

            rs = customer_support_rules()
            result = rs.evaluate(
                question="What is the Bizum daily limit?",
                response="The Bizum daily limit is 1,000 EUR per transaction.",
                context=(
                    "The daily Bizum transfer limit is 1,000 EUR per "
                    "transaction and 2,000 EUR per day in total."
                ),
            )
            assert not result.flagged

        No-RAG chat in Spanish finance vocabulary::

            rs = customer_support_rules(rag=False, domain="finance", language="es")
            assert "completeness" in rs.sub_scores
            assert "groundedness" not in rs.sub_scores
    """
    if domain not in _VALID_DOMAINS:
        msg = (
            f"customer_support_rules(domain={domain!r}) — supported domains are {_VALID_DOMAINS}."
        )
        raise ValueError(msg)
    if language not in _VALID_LANGUAGES:
        msg = (
            f"customer_support_rules(language={language!r}) — supported languages are "
            f"{_VALID_LANGUAGES}."
        )
        raise ValueError(msg)

    stopwords = _build_stopwords(domain=domain, language=language)
    markers = _build_speculative_markers(domain=domain, language=language)
    legal_ref_re = _legal_ref_re(language=language)

    # Bind the domain/language-specific knobs into the check callables.
    proper_nouns_check = partial(_check_no_invented_proper_nouns_impl, stopwords=stopwords)
    legal_refs_check = partial(_check_no_unrequested_legal_refs_impl, legal_ref_re=legal_ref_re)
    speculative_check = partial(_check_no_speculative_procedure_impl, speculative_markers=markers)

    grounded_rules: tuple[ChecklistRule, ...] = (
        ChecklistRule(
            id="csr.no_invented_numbers",
            description="every number in response appears in FAQ or query",
            weight=0.50,
            sub_score="groundedness",
            check=_check_no_invented_numbers,
            citation="Es et al. RAGAs (EACL 2024) §3 Faithfulness — atomic claim verification",
        ),
        ChecklistRule(
            id="csr.no_invented_proper_nouns",
            description="every proper noun in response appears in FAQ",
            weight=0.30,
            sub_score="groundedness",
            check=proper_nouns_check,
            citation="Min et al. FActScore (EMNLP 2023) — atomic factual precision",
        ),
        ChecklistRule(
            id="csr.content_overlaps_faq",
            description="response content overlaps FAQ above threshold",
            weight=0.20,
            sub_score="groundedness",
            check=_check_content_overlaps_faq,
            citation="Marin (2025) SGI arXiv:2512.13771 — surface grounding signal",
        ),
    )
    completeness_rules: tuple[ChecklistRule, ...] = (
        ChecklistRule(
            id="csr.addresses_query_topic",
            description="response addresses the query topic",
            weight=0.70,
            sub_score="completeness",
            check=_check_addresses_query_topic,
            citation="Industry banking RAG evaluation framework — relevance check",
        ),
        ChecklistRule(
            id="csr.uses_concrete_values",
            description="response uses concrete values from FAQ",
            weight=0.30,
            sub_score="completeness",
            check=_check_uses_concrete_values,
            citation="Industry banking RAG evaluation framework — usefulness check",
        ),
    )
    overreach_rules: tuple[ChecklistRule, ...] = (
        ChecklistRule(
            id="csr.no_unrequested_legal_refs",
            description="no legal references in response that are not in FAQ",
            weight=0.60,
            sub_score="no_overreach",
            check=legal_refs_check,
            citation="EU AI Act 2024/1689 Art. 13 — transparency on capabilities and limits",
        ),
        ChecklistRule(
            id="csr.no_speculative_procedure",
            description="no procedural additions not present in FAQ",
            weight=0.40,
            sub_score="no_overreach",
            check=speculative_check,
            citation="Federal Reserve SR 26-2 (Apr 2026) §model output controls",
        ),
    )

    rag_tag = "rag" if rag else "norag"
    name = f"customer_support_v2_{domain}_{language}_{rag_tag}"

    if rag:
        rules = grounded_rules + completeness_rules + overreach_rules
        return RuleSet(
            name=name,
            rules=rules,
            sub_scores=("groundedness", "completeness", "no_overreach"),
            flag_predicate=customer_support_flag_predicate,
        )
    rules = completeness_rules + overreach_rules
    return RuleSet(
        name=name,
        rules=rules,
        sub_scores=("completeness", "no_overreach"),
        flag_predicate=_customer_support_no_rag_flag_predicate,
    )

rag_rules(domain: str = 'banking') -> RuleSet

Deprecated dispatcher — use the archetype-named factories directly.

Parameters:

Name Type Description Default
domain str

"banking" (default) returns :func:groundlens.rules.decision_rationale_rules (the 20-rule decision-rationale set). "customer_support" returns :func:groundlens.agents.customer_support_rules with rag=True (the 7-rule informational-agent set).

'banking'

Returns:

Type Description
RuleSet

The selected :class:RuleSet.

Raises:

Type Description
ValueError

If domain is not in :data:_SUPPORTED_DOMAINS.

.. deprecated:: 2026.6.13 Call the canonical factory directly: :func:groundlens.rules.decision_rationale_rules for credit / AML / KYC decision rationales, or :func:groundlens.agents.customer_support_rules for informational FAQ-RAG agents. The :func:rag_rules dispatcher will be removed in a future release.

Source code in src/groundlens/agents/rag.py
def rag_rules(domain: str = "banking") -> RuleSet:
    """Deprecated dispatcher — use the archetype-named factories directly.

    Args:
        domain: ``"banking"`` (default) returns
            :func:`groundlens.rules.decision_rationale_rules` (the 20-rule
            decision-rationale set). ``"customer_support"`` returns
            :func:`groundlens.agents.customer_support_rules` with ``rag=True``
            (the 7-rule informational-agent set).

    Returns:
        The selected :class:`RuleSet`.

    Raises:
        ValueError: If ``domain`` is not in :data:`_SUPPORTED_DOMAINS`.

    .. deprecated:: 2026.6.13
        Call the canonical factory directly:
        :func:`groundlens.rules.decision_rationale_rules` for credit / AML /
        KYC decision rationales, or
        :func:`groundlens.agents.customer_support_rules` for informational
        FAQ-RAG agents. The :func:`rag_rules` dispatcher will be removed in a
        future release.
    """
    if domain not in _SUPPORTED_DOMAINS:
        msg = (
            f"rag_rules(domain={domain!r}) — supported domains are "
            f"{_SUPPORTED_DOMAINS}. The dispatcher is also deprecated; "
            "prefer decision_rationale_rules() or customer_support_rules() "
            "directly."
        )
        raise ValueError(msg)

    if domain == "banking":
        warnings.warn(
            'rag_rules(domain="banking") is deprecated; use '
            'decision_rationale_rules(domain="finance") from groundlens.rules instead. '
            "The dispatcher will be removed in a future release.",
            DeprecationWarning,
            stacklevel=2,
        )
        # Return the legacy-named ruleset for backwards compatibility with
        # downstream code that asserts on `rs.name`.
        return groundlens_banking_rules()

    # domain == "customer_support"
    warnings.warn(
        'rag_rules(domain="customer_support") is deprecated; use '
        "customer_support_rules(rag=True) from groundlens.agents instead. "
        "The dispatcher will be removed in a future release.",
        DeprecationWarning,
        stacklevel=2,
    )
    # Use the legacy alias so the returned RuleSet keeps its legacy name
    # ("customer_support_rag_v1") for backwards compatibility.
    with warnings.catch_warnings():
        # Suppress the inner DeprecationWarning emitted by the legacy alias —
        # the outer one above is the one the caller should see.
        warnings.simplefilter("ignore", DeprecationWarning)
        return customer_support_rag_rules()

routing_rules(domain: str = 'general') -> RuleSet

Rule set for routing / intent classification agents.

Returns a 10-rule set across 4 sub-scores: intent_clarity, classification_confidence, fallback_appropriateness, disambiguation_quality. Each rule carries a citation to its academic, industrial, or regulatory source.

Parameters:

Name Type Description Default
domain str

Deployment domain. Currently the routing rule set is domain-agnostic by design — the rules check structural properties of routing decisions (single intent, top-1 margin, fallback appropriateness, clarification quality) that hold across verticals. The kwarg is accepted for API symmetry with the other archetype factories and to leave a slot for domain-specific routing extensions in a future release.

One of: "general" (default), "finance", "healthcare", "legal".

'general'

Returns:

Name Type Description
A RuleSet

class:RuleSet named "groundlens_routing_v1".

Raises:

Type Description
ValueError

If domain is not in :data:_VALID_ROUTING_DOMAINS.

Example::

from groundlens.agents import routing_rules

rs = routing_rules()
result = rs.evaluate(
    question="transfer 500 to my brother and check my balance",
    response="I will transfer 500 EUR.",
    metadata={
        "predicted_intent": "transfer",
        "top1_score": 0.62,
        "margin": 0.08,
        "fallback_fired": False,
        "query_in_scope": True,
    },
)
assert result.flagged  # low confidence + multi-intent
Source code in src/groundlens/agents/routing.py
def routing_rules(domain: str = "general") -> RuleSet:
    """Rule set for routing / intent classification agents.

    Returns a 10-rule set across 4 sub-scores: intent_clarity,
    classification_confidence, fallback_appropriateness,
    disambiguation_quality. Each rule carries a citation to its
    academic, industrial, or regulatory source.

    Args:
        domain: Deployment domain. Currently the routing rule set is
            domain-agnostic by design — the rules check structural
            properties of routing decisions (single intent, top-1 margin,
            fallback appropriateness, clarification quality) that hold
            across verticals. The kwarg is accepted for API symmetry with
            the other archetype factories and to leave a slot for
            domain-specific routing extensions in a future release.

            One of: ``"general"`` (default), ``"finance"``,
            ``"healthcare"``, ``"legal"``.

    Returns:
        A :class:`RuleSet` named ``"groundlens_routing_v1"``.

    Raises:
        ValueError: If ``domain`` is not in :data:`_VALID_ROUTING_DOMAINS`.

    Example::

        from groundlens.agents import routing_rules

        rs = routing_rules()
        result = rs.evaluate(
            question="transfer 500 to my brother and check my balance",
            response="I will transfer 500 EUR.",
            metadata={
                "predicted_intent": "transfer",
                "top1_score": 0.62,
                "margin": 0.08,
                "fallback_fired": False,
                "query_in_scope": True,
            },
        )
        assert result.flagged  # low confidence + multi-intent
    """
    if domain not in _VALID_ROUTING_DOMAINS:
        msg = f"routing_rules(domain={domain!r}) — supported domains are {_VALID_ROUTING_DOMAINS}."
        raise ValueError(msg)
    rules = (
        # intent_clarity (3 rules, weights 0.4 + 0.3 + 0.3 = 1.0)
        ChecklistRule(
            id="routing.single_intent_signal",
            description="query carries a single intent, not multiple chained operations",
            weight=0.40,
            sub_score="intent_clarity",
            check=check_single_intent_signal,
            citation="Sarikaya et al. (IEEE TASLP 2014) — intent detection in spoken NLU",
        ),
        ChecklistRule(
            id="routing.no_ambiguous_pronoun_lead",
            description="query does not start with a bare pronoun without antecedent",
            weight=0.30,
            sub_score="intent_clarity",
            check=check_no_ambiguous_pronoun_lead,
            citation=(
                "Industry banking routing-agent design pattern (production deployments, 2025)"
            ),
        ),
        ChecklistRule(
            id="routing.intent_shares_query_tokens",
            description="predicted intent shares at least one content token with the query",
            weight=0.30,
            sub_score="intent_clarity",
            check=check_intent_shares_query_tokens,
            citation="Wang et al. (ACL 2020) — intent-slot consistency for joint NLU",
        ),
        # classification_confidence (3 rules, weights 0.4 + 0.3 + 0.3 = 1.0)
        ChecklistRule(
            id="routing.top1_confidence_above_threshold",
            description="top-1 confidence above operational threshold (default 0.7)",
            weight=0.40,
            sub_score="classification_confidence",
            check=check_top1_confidence_above_threshold,
            citation="Guo et al. (ICML 2017) — on calibration of modern neural networks",
        ),
        ChecklistRule(
            id="routing.margin_to_runner_up",
            description="margin between top-1 and top-2 above floor (default 0.15)",
            weight=0.30,
            sub_score="classification_confidence",
            check=check_margin_to_runner_up,
            citation="Industry banking routing-agent evaluation — top-1 to top-2 margin metric",
        ),
        ChecklistRule(
            id="routing.intent_in_allowed_set",
            description="predicted intent belongs to the configured allowed set",
            weight=0.30,
            sub_score="classification_confidence",
            check=check_intent_in_allowed_set,
            citation="Hendrycks & Gimpel (ICLR 2017) — out-of-distribution detection",
        ),
        # fallback_appropriateness (2 rules, weights 0.6 + 0.4 = 1.0)
        ChecklistRule(
            id="routing.fallback_when_out_of_scope",
            description="if fallback fired, the query is actually out of scope",
            weight=0.60,
            sub_score="fallback_appropriateness",
            check=check_fallback_when_out_of_scope,
            citation="Industry banking RAG evaluation framework — fallback necessity check",
        ),
        ChecklistRule(
            id="routing.no_silent_fallback",
            description="fallback responses explain the limit instead of being silent",
            weight=0.40,
            sub_score="fallback_appropriateness",
            check=check_no_silent_fallback,
            citation="NIST AI RMF 1.0 (2023) §Govern 5 — transparency to affected parties",
        ),
        # disambiguation_quality (2 rules, weights 0.6 + 0.4 = 1.0)
        ChecklistRule(
            id="routing.clarify_when_ambiguous",
            description="low-margin cases trigger clarification rather than silent routing",
            weight=0.60,
            sub_score="disambiguation_quality",
            check=check_clarify_when_ambiguous,
            citation="Rao & Daumé III (ACL 2018) — learning to ask good questions",
        ),
        ChecklistRule(
            id="routing.specific_clarify_question",
            description="clarify question references the two candidate intents specifically",
            weight=0.40,
            sub_score="disambiguation_quality",
            check=check_specific_clarify_question,
            citation="De Vries et al. (ACL 2018) — task-oriented dialogue clarification",
        ),
    )

    return RuleSet(
        name="groundlens_routing_v1",
        rules=rules,
        sub_scores=(
            "intent_clarity",
            "classification_confidence",
            "fallback_appropriateness",
            "disambiguation_quality",
        ),
        flag_predicate=routing_flag_predicate,
    )

specialized_agent_rules(domain: str = 'general', tools: tuple[str, ...] = ()) -> RuleSet

Rule set for specialized / tool-using agents.

Returns a 10-rule set across 4 sub-scores: entity_groundedness, entity_completeness, entity_calibration, execution_readiness.

The flag predicate is stricter than for RAG agents because specialized agents execute irreversible operations (move money, open accounts, send messages on behalf of the customer).

Parameters:

Name Type Description Default
domain str

Deployment domain. Today this kwarg is accepted for API symmetry with the other archetype factories; the bundled rules check structural properties (entity groundedness, schema completeness, execution readiness) that hold across verticals. Reserved for domain-specific entity validators in a future release.

One of: "general" (default), "finance", "healthcare", "legal".

'general'
tools tuple[str, ...]

Optional tuple of validator keys. Today the bundled rule set ships IBAN, amount, and card-number checks unconditionally — they abstain when the corresponding metadata field is absent. The kwarg is reserved for future releases that will let deployments opt in to additional domain-specific validators (e.g. NPI for healthcare, DNI/NIE for Spain). Currently a non-empty value is validated against :data:_VALID_SPECIALIZED_TOOLS but has no behavioural effect.

()

Returns:

Name Type Description
A RuleSet

class:RuleSet named "groundlens_specialized_v1".

Raises:

Type Description
ValueError

If domain is not in :data:_VALID_SPECIALIZED_DOMAINS or any of tools is not in :data:_VALID_SPECIALIZED_TOOLS.

Example::

from groundlens.agents import specialized_agent_rules

rs = specialized_agent_rules()
result = rs.evaluate(
    question="send 500 to my brother",
    response="OK, I'll send 500 EUR to IBAN ES12...",
    metadata={
        "dialog": "send 500 to my brother. yes go ahead.",
        "entities": {"amount": 500, "iban": "ES1234567890123456789012"},
        "required_entities": ["amount", "iban"],
        "confirmed": True,
        "operation": "wire_transfer",
    },
)
Source code in src/groundlens/agents/specialized.py
def specialized_agent_rules(
    domain: str = "general",
    tools: tuple[str, ...] = (),
) -> RuleSet:
    """Rule set for specialized / tool-using agents.

    Returns a 10-rule set across 4 sub-scores: entity_groundedness,
    entity_completeness, entity_calibration, execution_readiness.

    The flag predicate is stricter than for RAG agents because
    specialized agents execute irreversible operations (move money,
    open accounts, send messages on behalf of the customer).

    Args:
        domain: Deployment domain. Today this kwarg is accepted for API
            symmetry with the other archetype factories; the bundled
            rules check structural properties (entity groundedness,
            schema completeness, execution readiness) that hold across
            verticals. Reserved for domain-specific entity validators in
            a future release.

            One of: ``"general"`` (default), ``"finance"``,
            ``"healthcare"``, ``"legal"``.
        tools: Optional tuple of validator keys. Today the bundled rule
            set ships IBAN, amount, and card-number checks
            unconditionally — they abstain when the corresponding
            metadata field is absent. The kwarg is reserved for future
            releases that will let deployments opt in to additional
            domain-specific validators (e.g. NPI for healthcare,
            DNI/NIE for Spain). Currently a non-empty value is validated
            against :data:`_VALID_SPECIALIZED_TOOLS` but has no
            behavioural effect.

    Returns:
        A :class:`RuleSet` named ``"groundlens_specialized_v1"``.

    Raises:
        ValueError: If ``domain`` is not in
            :data:`_VALID_SPECIALIZED_DOMAINS` or any of ``tools`` is not
            in :data:`_VALID_SPECIALIZED_TOOLS`.

    Example::

        from groundlens.agents import specialized_agent_rules

        rs = specialized_agent_rules()
        result = rs.evaluate(
            question="send 500 to my brother",
            response="OK, I'll send 500 EUR to IBAN ES12...",
            metadata={
                "dialog": "send 500 to my brother. yes go ahead.",
                "entities": {"amount": 500, "iban": "ES1234567890123456789012"},
                "required_entities": ["amount", "iban"],
                "confirmed": True,
                "operation": "wire_transfer",
            },
        )
    """
    if domain not in _VALID_SPECIALIZED_DOMAINS:
        msg = (
            f"specialized_agent_rules(domain={domain!r}) — supported domains are "
            f"{_VALID_SPECIALIZED_DOMAINS}."
        )
        raise ValueError(msg)
    unknown_tools = tuple(t for t in tools if t not in _VALID_SPECIALIZED_TOOLS)
    if unknown_tools:
        msg = (
            f"specialized_agent_rules(tools={tools!r}) — unknown tools "
            f"{unknown_tools}. Known tools: {_VALID_SPECIALIZED_TOOLS}."
        )
        raise ValueError(msg)
    rules = (
        # entity_groundedness (3 rules, weights 0.5 + 0.3 + 0.2 = 1.0)
        ChecklistRule(
            id="specialized.entities_in_dialog",
            description="each captured entity appears verbatim in the dialogue",
            weight=0.50,
            sub_score="entity_groundedness",
            check=check_entities_in_dialog,
            citation="Industry banking conversational-AI evaluation — entity hallucination metric",
        ),
        ChecklistRule(
            id="specialized.iban_format_valid",
            description="captured IBANs pass ISO 13616 mod-97 verification",
            weight=0.30,
            sub_score="entity_groundedness",
            check=check_iban_format_valid,
            citation="ISO 13616:2020 — International Bank Account Number (IBAN)",
        ),
        ChecklistRule(
            id="specialized.amounts_parseable",
            description="captured amount entities parse as numbers",
            weight=0.20,
            sub_score="entity_groundedness",
            check=check_amounts_parseable,
            citation=(
                "EBA Guidelines on the security of internet payments (2019) "
                "§Transaction Authentication — exact-amount confirmation"
            ),
        ),
        # entity_completeness (2 rules, weights 0.6 + 0.4 = 1.0)
        ChecklistRule(
            id="specialized.required_entities_present",
            description="all entities required by the operation schema are captured",
            weight=0.60,
            sub_score="entity_completeness",
            check=check_required_entities_present,
            citation="Evans (2003) Domain-Driven Design — aggregate root invariants",
        ),
        ChecklistRule(
            id="specialized.no_partial_fields",
            description="no required entity is partially filled or truncated",
            weight=0.40,
            sub_score="entity_completeness",
            check=check_no_partial_fields,
            citation="Wang & Strong (1996) — beyond accuracy: data quality dimensions",
        ),
        # entity_calibration (1 rule, weight 1.0)
        ChecklistRule(
            id="specialized.no_phantom_entities",
            description="no captured entity is outside the operation schema",
            weight=1.00,
            sub_score="entity_calibration",
            check=check_no_phantom_entities,
            citation="Industry banking conversational-AI evaluation — precision of empty entities",
        ),
        # execution_readiness (4 rules, weights 0.4 + 0.3 + 0.3 = 1.0)
        ChecklistRule(
            id="specialized.explicit_confirmation",
            description="dialogue contains an explicit user confirmation before execution",
            weight=0.40,
            sub_score="execution_readiness",
            check=check_explicit_confirmation,
            citation=(
                "EBA Guidelines on the security of internet payments (2019) §27 "
                "— Transaction Authentication"
            ),
        ),
        ChecklistRule(
            id="specialized.eoc_when_complete",
            description="EOC signaled only after the operation is complete",
            weight=0.30,
            sub_score="execution_readiness",
            check=check_eoc_when_complete,
            citation=(
                "Industry banking conversational-AI evaluation — "
                "end-of-conversation detection rate"
            ),
        ),
        ChecklistRule(
            id="specialized.no_pre_execution_claim",
            description="response does not claim execution before user confirmation",
            weight=0.30,
            sub_score="execution_readiness",
            check=check_no_pre_execution_claim,
            citation=(
                "Federal Reserve SR 26-2 (Apr 2026) — Model Risk Management; model output controls"
            ),
        ),
    )

    return RuleSet(
        name="groundlens_specialized_v1",
        rules=rules,
        sub_scores=(
            "entity_groundedness",
            "entity_completeness",
            "entity_calibration",
            "execution_readiness",
        ),
        flag_predicate=specialized_flag_predicate,
    )

fit_thresholds(examples: list[Mapping[str, object]], *, model: str = DEFAULT_MODEL, encoder: EmbeddingFn | None = None, reference_csv: str | None = None) -> ThresholdFit

Fit SGI/DGI decision thresholds on a labeled set via Youden's J.

For each example this computes DGI (and SGI when a context is present), then picks each threshold by maximizing Youden's J for the rule "value >= threshold implies grounded".

Parameters:

Name Type Description Default
examples list[Mapping[str, object]]

A list of mappings, each with keys question (str), response (str), label (int: 1 = ungrounded / hallucinated, 0 = grounded), and optional context (str).

required
model str

Sentence transformer model name.

DEFAULT_MODEL
encoder EmbeddingFn | None

Optional bring-your-own-embeddings callable. Passed through to compute_dgi / compute_sgi so fitting works without torch.

None
reference_csv str | None

Optional DGI calibration CSV passed to compute_dgi.

None

Returns:

Name Type Description
A ThresholdFit

class:ThresholdFit with the fitted dgi_pass and (when any

ThresholdFit

contexts were supplied) sgi_review thresholds.

Raises:

Type Description
ValueError

If examples is empty, or if both classes (grounded and ungrounded) are not present.

Example

fit = fit_thresholds( ... [ ... {"question": "Q1?", "response": "A1.", "label": 0}, ... {"question": "Q2?", "response": "off-topic", "label": 1}, ... ] ... ) fit.metric 'youden_j'

Source code in src/groundlens/calibrate.py
def fit_thresholds(
    examples: list[Mapping[str, object]],
    *,
    model: str = DEFAULT_MODEL,
    encoder: EmbeddingFn | None = None,
    reference_csv: str | None = None,
) -> ThresholdFit:
    """Fit SGI/DGI decision thresholds on a labeled set via Youden's J.

    For each example this computes DGI (and SGI when a ``context`` is
    present), then picks each threshold by maximizing Youden's J for the
    rule "value >= threshold implies grounded".

    Args:
        examples: A list of mappings, each with keys ``question`` (str),
            ``response`` (str), ``label`` (int: ``1`` = ungrounded /
            hallucinated, ``0`` = grounded), and optional ``context`` (str).
        model: Sentence transformer model name.
        encoder: Optional bring-your-own-embeddings callable. Passed through
            to ``compute_dgi`` / ``compute_sgi`` so fitting works without
            torch.
        reference_csv: Optional DGI calibration CSV passed to ``compute_dgi``.

    Returns:
        A :class:`ThresholdFit` with the fitted ``dgi_pass`` and (when any
        contexts were supplied) ``sgi_review`` thresholds.

    Raises:
        ValueError: If ``examples`` is empty, or if both classes (grounded
            and ungrounded) are not present.

    Example:
        >>> fit = fit_thresholds(
        ...     [
        ...         {"question": "Q1?", "response": "A1.", "label": 0},
        ...         {"question": "Q2?", "response": "off-topic", "label": 1},
        ...     ]
        ... )
        >>> fit.metric
        'youden_j'
    """
    from groundlens.dgi import compute_dgi
    from groundlens.sgi import compute_sgi

    if not examples:
        msg = "examples must contain at least one item."
        raise ValueError(msg)

    labels = [int(ex["label"]) for ex in examples]  # type: ignore[call-overload]
    if 0 not in labels or 1 not in labels:
        msg = (
            "fit_thresholds requires both classes present: at least one "
            "grounded (label=0) and one ungrounded (label=1) example."
        )
        raise ValueError(msg)

    dgi_grounded: list[float] = []
    dgi_hallucinated: list[float] = []
    sgi_grounded: list[float] = []
    sgi_hallucinated: list[float] = []

    for ex in examples:
        question = str(ex["question"])
        response = str(ex["response"])
        label = int(ex["label"])  # type: ignore[call-overload]

        dgi = compute_dgi(
            question,
            response,
            model=model,
            reference_csv=reference_csv,
            encoder=encoder,
        )
        (dgi_hallucinated if label == 1 else dgi_grounded).append(dgi.value)

        context = ex.get("context")
        if context:
            sgi = compute_sgi(
                question,
                str(context),
                response,
                model=model,
                encoder=encoder,
            )
            (sgi_hallucinated if label == 1 else sgi_grounded).append(sgi.value)

    dgi_pass: float | None = None
    if dgi_grounded and dgi_hallucinated:
        dgi_pass = _youden_threshold(dgi_grounded, dgi_hallucinated)

    sgi_review: float | None = None
    if sgi_grounded and sgi_hallucinated:
        sgi_review = _youden_threshold(sgi_grounded, sgi_hallucinated)

    return ThresholdFit(
        sgi_review=sgi_review,
        dgi_pass=dgi_pass,
        n=len(examples),
        model=model,
    )

compute_dgi(question: str, response: str, *, model: str = DEFAULT_MODEL, reference_csv: str | None = None, encoder: EmbeddingFn | None = None) -> DGIResult

Compute the Directional Grounding Index for a response.

Parameters:

Name Type Description Default
question str

The input query.

required
response str

The LLM output to evaluate.

required
model str

Sentence transformer model name.

DEFAULT_MODEL
reference_csv str | None

Path to domain-specific calibration CSV. If None, uses the bundled dataset.

None
encoder EmbeddingFn | None

Optional bring-your-own-embeddings callable taking list[str] and returning an (n, d) array. Bypasses sentence-transformers (no torch required) when provided.

None

Returns:

Type Description
DGIResult

DGIResult with raw score, normalized score, and flag status.

Raises:

Type Description
ValueError

If question or response is empty.

Example

from groundlens import compute_dgi result = compute_dgi( ... question="What causes seasons on Earth?", ... response="Seasons are caused by Earth's 23.5-degree axial tilt.", ... ) result.flagged False

Source code in src/groundlens/dgi.py
def compute_dgi(
    question: str,
    response: str,
    *,
    model: str = DEFAULT_MODEL,
    reference_csv: str | None = None,
    encoder: EmbeddingFn | None = None,
) -> DGIResult:
    """Compute the Directional Grounding Index for a response.

    Args:
        question: The input query.
        response: The LLM output to evaluate.
        model: Sentence transformer model name.
        reference_csv: Path to domain-specific calibration CSV.
            If ``None``, uses the bundled dataset.
        encoder: Optional bring-your-own-embeddings callable taking
            ``list[str]`` and returning an ``(n, d)`` array. Bypasses
            sentence-transformers (no torch required) when provided.

    Returns:
        DGIResult with raw score, normalized score, and flag status.

    Raises:
        ValueError: If question or response is empty.

    Example:
        >>> from groundlens import compute_dgi
        >>> result = compute_dgi(
        ...     question="What causes seasons on Earth?",
        ...     response="Seasons are caused by Earth's 23.5-degree axial tilt.",
        ... )
        >>> result.flagged
        False
    """
    if not question.strip():
        msg = "question must be a non-empty string."
        raise ValueError(msg)
    if not response.strip():
        msg = "response must be a non-empty string."
        raise ValueError(msg)

    if (encoder is not None or model != DEFAULT_MODEL) and reference_csv is None:
        _warn_default_thresholds_with_custom_encoder("compute_dgi", model, encoder is not None)

    mu_hat = _get_mu_hat(model, reference_csv, encoder=encoder)
    embeddings = encode_texts([question, response], model_name=model, encoder=encoder)
    q_emb, r_emb = embeddings[0], embeddings[1]

    delta = displacement_vector(q_emb, r_emb)
    magnitude = float(np.linalg.norm(delta))

    # Degenerate case: response identical to question.
    if magnitude < 1e-8:
        return DGIResult(value=0.0, normalized=0.0, flagged=True)

    delta_hat = delta / magnitude
    gamma = float(np.dot(delta_hat, mu_hat))

    if math.isnan(gamma):
        logger.warning("DGI produced NaN — check embedding dimensions.")
        return DGIResult(value=0.0, normalized=0.0, flagged=True)

    normalized = round(normalize_dgi(gamma), 4)

    return DGIResult(
        value=round(gamma, 4),
        normalized=normalized,
        flagged=gamma < DGI_PASS,
    )

evaluate_batch(items: list[dict[str, str]], *, model: str = DEFAULT_MODEL, reference_csv: str | None = None) -> list[GroundlensScore]

Evaluate a batch of LLM responses.

Each item in the list is a dict with keys
  • question (required)
  • response (required)
  • context (optional — triggers SGI when present)

Parameters:

Name Type Description Default
items list[dict[str, str]]

List of dicts, each containing question, response, and optionally context.

required
model str

Sentence transformer model name.

DEFAULT_MODEL
reference_csv str | None

DGI calibration CSV path.

None

Returns:

Type Description
list[GroundlensScore]

List of GroundlensScore results, one per input item.

Raises:

Type Description
KeyError

If any item is missing question or response.

Example

from groundlens import evaluate_batch items = [ ... {"question": "Q1?", "response": "A1.", "context": "C1."}, ... {"question": "Q2?", "response": "A2."}, ... ] results = evaluate_batch(items) len(results) 2

Source code in src/groundlens/evaluate.py
def evaluate_batch(
    items: list[dict[str, str]],
    *,
    model: str = DEFAULT_MODEL,
    reference_csv: str | None = None,
) -> list[GroundlensScore]:
    """Evaluate a batch of LLM responses.

    Each item in the list is a dict with keys:
        - ``question`` (required)
        - ``response`` (required)
        - ``context`` (optional — triggers SGI when present)

    Args:
        items: List of dicts, each containing question, response, and
            optionally context.
        model: Sentence transformer model name.
        reference_csv: DGI calibration CSV path.

    Returns:
        List of GroundlensScore results, one per input item.

    Raises:
        KeyError: If any item is missing ``question`` or ``response``.

    Example:
        >>> from groundlens import evaluate_batch
        >>> items = [
        ...     {"question": "Q1?", "response": "A1.", "context": "C1."},
        ...     {"question": "Q2?", "response": "A2."},
        ... ]
        >>> results = evaluate_batch(items)
        >>> len(results)
        2
    """
    results: list[GroundlensScore] = []

    for i, item in enumerate(items):
        if "question" not in item:
            msg = f"Item {i} missing required key 'question'."
            raise KeyError(msg)
        if "response" not in item:
            msg = f"Item {i} missing required key 'response'."
            raise KeyError(msg)

        score = evaluate(
            question=item["question"],
            response=item["response"],
            context=item.get("context"),
            model=model,
            reference_csv=reference_csv,
        )
        results.append(score)

    logger.info(
        "Evaluated %d items (%d flagged).", len(results), sum(1 for r in results if r.flagged)
    )

    return results

banking_rules(quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet

Curated ruleset for regulated banking governance decisions.

The rules cover the three sub-scores that an auditor or compliance officer typically inspects in a deferral or escalation rationale:

  • Specificity (spec): does the rationale cite the case parameters that triggered the decision? Flags, risk score, numeric thresholds, gates, completeness, jurisdictional details, sufficient length, and specificity-marking language.
  • Explanatory linkage (expl): does the rationale link the case facts to the decision? Conditional structure, pending actions, causal connectives, epistemic limits, domain references, modal verbs, length, and temporal ordering.
  • Boundary shift (bshift): does the rationale state what would change the decision? Conditional approval pathways, information requests, risk-reduction proposals, alternative framings, threshold references, and length.

The default quality_floor=0.3 follows the cosmetic-deadlock threshold introduced in the financial-decisions governance literature. A response that falls below this floor on either spec or expl is flagged as audit-deficient even if the geometric SGI/DGI score looks acceptable in isolation — a structurally typical "false negative" of embedding-based detection.

Parameters:

Name Type Description Default
quality_floor float

Threshold below which a sub-score triggers the cosmetic-deadlock flag. Tune per deployment risk tolerance.

_DEFAULT_QUALITY_FLOOR

Returns:

Name Type Description
A RuleSet

class:RuleSet named "banking_v1".

Source code in src/groundlens/rules.py
def banking_rules(quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet:
    """Curated ruleset for regulated banking governance decisions.

    The rules cover the three sub-scores that an auditor or compliance
    officer typically inspects in a deferral or escalation rationale:

    - **Specificity (spec):** does the rationale cite the case parameters
      that triggered the decision? Flags, risk score, numeric thresholds,
      gates, completeness, jurisdictional details, sufficient length, and
      specificity-marking language.
    - **Explanatory linkage (expl):** does the rationale link the case
      facts to the decision? Conditional structure, pending actions, causal
      connectives, epistemic limits, domain references, modal verbs,
      length, and temporal ordering.
    - **Boundary shift (bshift):** does the rationale state what would
      change the decision? Conditional approval pathways, information
      requests, risk-reduction proposals, alternative framings, threshold
      references, and length.

    The default ``quality_floor=0.3`` follows the cosmetic-deadlock
    threshold introduced in the financial-decisions governance literature.
    A response that falls below this floor on either ``spec`` or ``expl``
    is flagged as audit-deficient even if the geometric SGI/DGI score
    looks acceptable in isolation — a structurally typical "false
    negative" of embedding-based detection.

    Args:
        quality_floor: Threshold below which a sub-score triggers the
            cosmetic-deadlock flag. Tune per deployment risk tolerance.

    Returns:
        A :class:`RuleSet` named ``"banking_v1"``.
    """
    rules: tuple[ChecklistRule, ...] = (
        # Specificity sub-rules
        ChecklistRule("spec.reg_flag", "regulatory flag", 0.20, "spec", _check_regulatory_flag),
        ChecklistRule("spec.risk_ref", "risk reference", 0.15, "spec", _check_risk_reference),
        ChecklistRule("spec.numeric", "numeric value", 0.10, "spec", _check_numeric_value),
        ChecklistRule("spec.gate", "gate / threshold", 0.10, "spec", _check_gate_name),
        ChecklistRule("spec.info_gap", "information gap", 0.15, "spec", _check_information_gap),
        ChecklistRule(
            "spec.case_detail", "case-specific detail", 0.10, "spec", _check_case_specific_detail
        ),
        ChecklistRule(
            "spec.length", "substantive length", 0.10, "spec", _check_substantive_length
        ),
        ChecklistRule(
            "spec.spec_language",
            "specificity language",
            0.10,
            "spec",
            _check_specificity_language,
        ),
        # Explanatory linkage sub-rules
        ChecklistRule(
            "expl.conditional", "conditional structure", 0.20, "expl", _check_conditional_structure
        ),
        ChecklistRule("expl.pending", "pending action", 0.15, "expl", _check_pending_action),
        ChecklistRule("expl.causal", "causal connective", 0.15, "expl", _check_causal_connective),
        ChecklistRule(
            "expl.epistemic", "epistemic limitation", 0.15, "expl", _check_epistemic_limit
        ),
        ChecklistRule("expl.domain", "domain reference", 0.10, "expl", _check_domain_reference),
        ChecklistRule("expl.modal", "modal verb", 0.10, "expl", _check_modal_verb),
        ChecklistRule("expl.length", "minimum length", 0.10, "expl", _check_minimum_length),
        ChecklistRule(
            "expl.temporal", "temporal ordering", 0.05, "expl", _check_temporal_ordering
        ),
        # Boundary shift sub-rules
        ChecklistRule(
            "bshift.cond_approval",
            "conditional approval",
            0.25,
            "bshift",
            _check_conditional_approval,
        ),
        ChecklistRule(
            "bshift.info_request",
            "information request",
            0.20,
            "bshift",
            _check_information_request,
        ),
        ChecklistRule(
            "bshift.risk_reduction", "risk reduction", 0.15, "bshift", _check_risk_reduction
        ),
        ChecklistRule(
            "bshift.alternative", "alternative framing", 0.10, "bshift", _check_alternative_framing
        ),
        ChecklistRule(
            "bshift.threshold_ref",
            "threshold reference",
            0.10,
            "bshift",
            _check_threshold_reference,
        ),
        ChecklistRule(
            "bshift.length", "resolution-path length", 0.05, "bshift", _check_resolution_length
        ),
    )
    return RuleSet(name="banking_v1", rules=rules, quality_floor=quality_floor)

decision_rationale_rules(domain: str = 'finance', regulations: tuple[str, ...] = (), quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet

Rule set for decision-rationale agents (credit / AML / KYC / sanctions).

Canonical factory for the 20-rule, 5-sub-score decision-rationale rule set. Replaces :func:groundlens_banking_rules under the archetype-as-function naming convention introduced in ADR 0001 (release 2026.6.13).

Parameters:

Name Type Description Default
domain str

Deployment domain. Currently only "finance" (default) is supported; calling with any other value raises ValueError so the caller knows the verticalization is not yet shipped. Insurance, healthcare, and legal vertical decision-rationale sets are on the roadmap.

'finance'
regulations tuple[str, ...]

Optional tuple of regulation keys. When non-empty, audit_explanation lines whose rule citation does not mention any of the requested regulations are suppressed from the rendered audit text. Does not add or remove rules. Valid keys include: "eu_ai_act", "sr_26_2", "sr_11_7", "nist_ai_600_1", "nist_ai_rmf", "iso_42001", "ecb_internal_models", "eba_gl_2020_06", "pra_ss1_23", "hipaa", "gdpr".

Implementation note (2026.6.13): the kwarg is accepted and validated, but provenance-filtered rendering of audit_explanation will land in a follow-up release. For now the audit text is unmodified; the rule set is returned unchanged. A UserWarning is emitted when the kwarg is non-empty so the caller is aware the filter is not yet active.

()
quality_floor float

Threshold below which a sub-score triggers the cosmetic-deadlock flag. Kept for compatibility with the legacy banking_rules() signature.

_DEFAULT_QUALITY_FLOOR

Returns:

Name Type Description
A RuleSet

class:RuleSet named "decision_rationale_v1_finance" with

RuleSet

five sub-scores and 20 rules. The rules and weights are identical

RuleSet

to those of :func:groundlens_banking_rules; only the rule-set

RuleSet

name is updated.

Raises:

Type Description
ValueError

If domain is not in :data:_VALID_DECISION_RATIONALE_DOMAINS.

Example::

from groundlens import decision_rationale_rules

rs = decision_rationale_rules(
    domain="finance",
    regulations=("eu_ai_act", "sr_26_2"),
)
result = rs.evaluate(question=q, response=r, context=ctx)
Source code in src/groundlens/rules.py
def decision_rationale_rules(
    domain: str = "finance",
    regulations: tuple[str, ...] = (),
    quality_floor: float = _DEFAULT_QUALITY_FLOOR,
) -> RuleSet:
    """Rule set for decision-rationale agents (credit / AML / KYC / sanctions).

    Canonical factory for the 20-rule, 5-sub-score decision-rationale
    rule set. Replaces :func:`groundlens_banking_rules` under the
    archetype-as-function naming convention introduced in ADR 0001
    (release 2026.6.13).

    Args:
        domain: Deployment domain. Currently only ``"finance"`` (default)
            is supported; calling with any other value raises
            ``ValueError`` so the caller knows the verticalization is not
            yet shipped. Insurance, healthcare, and legal vertical
            decision-rationale sets are on the roadmap.
        regulations: Optional tuple of regulation keys. When non-empty,
            ``audit_explanation`` lines whose rule citation does not
            mention any of the requested regulations are suppressed from
            the rendered audit text. Does not add or remove rules. Valid
            keys include: ``"eu_ai_act"``, ``"sr_26_2"``, ``"sr_11_7"``,
            ``"nist_ai_600_1"``, ``"nist_ai_rmf"``, ``"iso_42001"``,
            ``"ecb_internal_models"``, ``"eba_gl_2020_06"``,
            ``"pra_ss1_23"``, ``"hipaa"``, ``"gdpr"``.

            *Implementation note (2026.6.13):* the kwarg is accepted and
            validated, but provenance-filtered rendering of
            ``audit_explanation`` will land in a follow-up release. For
            now the audit text is unmodified; the rule set is returned
            unchanged. A ``UserWarning`` is emitted when the kwarg is
            non-empty so the caller is aware the filter is not yet active.
        quality_floor: Threshold below which a sub-score triggers the
            cosmetic-deadlock flag. Kept for compatibility with the
            legacy ``banking_rules()`` signature.

    Returns:
        A :class:`RuleSet` named ``"decision_rationale_v1_finance"`` with
        five sub-scores and 20 rules. The rules and weights are identical
        to those of :func:`groundlens_banking_rules`; only the rule-set
        name is updated.

    Raises:
        ValueError: If ``domain`` is not in
            :data:`_VALID_DECISION_RATIONALE_DOMAINS`.

    Example::

        from groundlens import decision_rationale_rules

        rs = decision_rationale_rules(
            domain="finance",
            regulations=("eu_ai_act", "sr_26_2"),
        )
        result = rs.evaluate(question=q, response=r, context=ctx)
    """
    if domain not in _VALID_DECISION_RATIONALE_DOMAINS:
        msg = (
            f"decision_rationale_rules(domain={domain!r}) — supported domains "
            f"are {_VALID_DECISION_RATIONALE_DOMAINS}. Other verticalizations "
            "are on the roadmap; open an issue at "
            "https://github.com/groundlens-dev/groundlens/issues to request "
            "one."
        )
        raise ValueError(msg)

    unknown = tuple(r for r in regulations if r not in _REGULATION_CITATION_KEYS)
    if unknown:
        msg = (
            f"decision_rationale_rules(regulations={regulations!r}) — unknown "
            f"keys {unknown}. Known keys: "
            f"{tuple(_REGULATION_CITATION_KEYS.keys())}."
        )
        raise ValueError(msg)
    if regulations:
        warnings.warn(
            "decision_rationale_rules(regulations=...) is accepted but the "
            "provenance-filtered audit_explanation rendering is not yet "
            "active (slated for a follow-up release). The returned RuleSet "
            "is unchanged.",
            UserWarning,
            stacklevel=2,
        )

    base = groundlens_banking_rules(quality_floor=quality_floor)
    # Replace the legacy name with the archetype-aware canonical name.
    object.__setattr__(base, "name", f"decision_rationale_v1_{domain}")
    return base

groundlens_banking_rules(quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet

Canonical rule set for LLM rationale evaluation in banking governance.

Returns the 20-rule reference set whose provenance is triangulated across five independent research tracks: peer-reviewed NLP literature, tier-1 bank public reports, banking regulator whitepapers, cross-industry frameworks, and financial-domain NLP benchmarks. The rules are organized into five empirically-emergent sub-score categories:

  • groundedness (5 rules): claims linked to and supported by source.
  • completeness (3 rules): coverage of the governance question.
  • calibration (4 rules): uncertainty expression and abstention.
  • traceability (5 rules): citation, audit trail, validation references.
  • robustness (3 rules): resistance to noise, conflict, injection.

Each rule carries a citation field pointing to at least one of its academic, industrial, or regulatory provenance sources. The companion paper (Marin, 2026) documents the full per-rule provenance.

The default flag predicate :func:_groundlens_banking_flag_predicate triggers when any regulator-non-negotiable sub-score falls below its threshold (groundedness < 0.5, calibration < 0.3, or traceability < 0.4).

Parameters:

Name Type Description Default
quality_floor float

Legacy floor exposed for users who want a uniform threshold across sub-scores. Not used by the default flag predicate; kept for compatibility with the legacy banking_rules() signature so deployers can A/B both rulesets with one parameter.

_DEFAULT_QUALITY_FLOOR

Returns:

Name Type Description
A RuleSet

class:RuleSet named "groundlens_banking_v1" with five

RuleSet

sub-scores and 20 rules.

Source code in src/groundlens/rules.py
def groundlens_banking_rules(quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet:
    """Canonical rule set for LLM rationale evaluation in banking governance.

    Returns the 20-rule reference set whose provenance is triangulated across
    five independent research tracks: peer-reviewed NLP literature, tier-1
    bank public reports, banking regulator whitepapers, cross-industry
    frameworks, and financial-domain NLP benchmarks. The rules are organized
    into five empirically-emergent sub-score categories:

    - **groundedness** (5 rules): claims linked to and supported by source.
    - **completeness** (3 rules): coverage of the governance question.
    - **calibration** (4 rules): uncertainty expression and abstention.
    - **traceability** (5 rules): citation, audit trail, validation references.
    - **robustness** (3 rules): resistance to noise, conflict, injection.

    Each rule carries a ``citation`` field pointing to at least one of its
    academic, industrial, or regulatory provenance sources. The companion
    paper (Marin, 2026) documents the full per-rule provenance.

    The default flag predicate :func:`_groundlens_banking_flag_predicate`
    triggers when any regulator-non-negotiable sub-score falls below its
    threshold (groundedness < 0.5, calibration < 0.3, or traceability < 0.4).

    Args:
        quality_floor: Legacy floor exposed for users who want a uniform
            threshold across sub-scores. Not used by the default flag
            predicate; kept for compatibility with the legacy ``banking_rules()``
            signature so deployers can A/B both rulesets with one parameter.

    Returns:
        A :class:`RuleSet` named ``"groundlens_banking_v1"`` with five
        sub-scores and 20 rules.
    """
    rules: tuple[ChecklistRule, ...] = (
        # ── Groundedness (5 rules) ──────────────────────────────────────────
        ChecklistRule(
            id="grnd.claim_supported_by_context",
            description="every claim inferable from context",
            weight=0.25,
            sub_score="groundedness",
            check=_check_grounded_in_context,
            citation="RAGAs (Es et al., EACL 2024) §3; NIST AI 600-1 (2024) §2.2 Confabulation",
        ),
        ChecklistRule(
            id="grnd.atomic_decomposition",
            description="rationale decomposable into atomic claims",
            weight=0.20,
            sub_score="groundedness",
            check=_check_atomic_decomposable,
            citation="FactScore (Min et al., EMNLP 2023) §3; RAGAs (Es et al., EACL 2024) §3",
        ),
        ChecklistRule(
            id="grnd.no_unsupported_extensions",
            description="no claims beyond what context supports",
            weight=0.20,
            sub_score="groundedness",
            check=_check_no_unsupported_extensions,
            citation=(
                "HaluEval (Li et al., EMNLP 2023); Ji et al. ACM CSUR 2023; NIST AI 600-1 (2024)"
            ),
        ),
        ChecklistRule(
            id="grnd.regulatory_flag",
            description="names a specific regulatory flag or policy clause",
            weight=0.20,
            sub_score="groundedness",
            check=_check_regulatory_flag,
            citation="REV (Chen et al., ACL 2023); SR 26-2 (Fed/OCC/FDIC 2026) §VI Documentation",
        ),
        ChecklistRule(
            id="grnd.counterfactual_robust",
            description="screened against wrong-retrieval scenarios",
            weight=0.15,
            sub_score="groundedness",
            check=_check_counterfactual_robustness,
            citation="RGB (Chen et al., AAAI 2024); EU AI Act 2024/1689 Art. 15(4)",
        ),
        # ── Completeness (3 rules) ──────────────────────────────────────────
        ChecklistRule(
            id="comp.addresses_all_parts",
            description="response length scales with question parts",
            weight=0.40,
            sub_score="completeness",
            check=_check_addresses_all_parts,
            citation="RAGAs (Es et al., EACL 2024) §3; EU AI Act 2024/1689 Art. 13(2)",
        ),
        ChecklistRule(
            id="comp.governance_dimensions",
            description="references multiple governance dimensions",
            weight=0.35,
            sub_score="completeness",
            check=_check_governance_dimensions,
            citation="EBA GL/2020/06 §4.3.3; SR 26-2 (Fed/OCC/FDIC 2026) §IV Model Development",
        ),
        ChecklistRule(
            id="comp.information_integration",
            description="integrates multiple sources",
            weight=0.25,
            sub_score="completeness",
            check=_check_information_integration,
            citation="RGB (Chen et al., AAAI 2024); TRUE (Honovich et al., NAACL 2022)",
        ),
        # ── Calibration (4 rules) ───────────────────────────────────────────
        ChecklistRule(
            id="cal.abstains_when_insufficient",
            description="explicitly abstains when evidence is insufficient",
            weight=0.35,
            sub_score="calibration",
            check=_check_abstains_when_insufficient,
            citation=(
                "RAGAs (Es et al., EACL 2024) §3; FinanceBench (Islam et al., 2023); "
                "SR 26-2 §V Model Validation"
            ),
        ),
        ChecklistRule(
            id="cal.explicit_hedging",
            description="uses hedging language for uncertain claims",
            weight=0.30,
            sub_score="calibration",
            check=_check_explicit_hedging,
            citation=(
                "TruthfulQA (Lin et al., ACL 2022); Hyland (1998) hedging taxonomy; "
                "SR 26-2 §IV Model Use"
            ),
        ),
        ChecklistRule(
            id="cal.confidence_score",
            description="includes a numeric confidence or probability",
            weight=0.20,
            sub_score="calibration",
            check=_check_confidence_score,
            citation="G-Eval (Liu et al., EMNLP 2023); EU AI Act Art. 13(3)(b)(ii)",
        ),
        ChecklistRule(
            id="cal.self_consistency",
            description="pipeline screened for self-consistency",
            weight=0.15,
            sub_score="calibration",
            check=_check_self_consistency,
            citation="SelfCheckGPT (Manakul et al., EMNLP 2023); Morgan Stanley + OpenAI (2024)",
        ),
        # ── Traceability (5 rules) ──────────────────────────────────────────
        ChecklistRule(
            id="trace.specific_source_span",
            description="cites a specific page / section / paragraph",
            weight=0.25,
            sub_score="traceability",
            check=_check_specific_source_span,
            citation=(
                "e-SNLI (Camburu et al., NeurIPS 2018); EU AI Act Art. 13(3)(b)(iv); "
                "FinanceBench (Islam et al., 2023)"
            ),
        ),
        ChecklistRule(
            id="trace.natural_language_rationale",
            description="provides a substantive natural-language rationale",
            weight=0.20,
            sub_score="traceability",
            check=_check_substantive_length,
            citation=(
                "e-SNLI (Camburu et al., NeurIPS 2018); EU AI Act Art. 13(3)(b)(iv); "
                "PRA SS1/23 Principle 3"
            ),
        ),
        ChecklistRule(
            id="trace.falsifiable_actionable",
            description="couples numeric claim with causal mechanism",
            weight=0.20,
            sub_score="traceability",
            check=_check_falsifiable_actionable,
            citation="REV (Chen et al., ACL 2023); SR 26-2 §V Conceptual Soundness",
        ),
        ChecklistRule(
            id="trace.numeric_value",
            description="includes a numeric value or metric",
            weight=0.15,
            sub_score="traceability",
            check=_check_numeric_value,
            citation=(
                "FinQA (Chen et al., EMNLP 2021); EU AI Act Art. 13(3)(b)(ii); "
                "SR 26-2 §V Outcomes Analysis"
            ),
        ),
        ChecklistRule(
            id="trace.audit_logged",
            description="rationale persisted to audit log",
            weight=0.20,
            sub_score="traceability",
            check=_check_audit_logged,
            citation=(
                "EU AI Act Art. 12 Record-Keeping; SR 26-2 §VI Documentation; "
                "ISO/IEC 42001:2023 §8.2"
            ),
        ),
        # ── Robustness (3 rules) ────────────────────────────────────────────
        ChecklistRule(
            id="rob.independent_validation",
            description="references independent validation / effective challenge",
            weight=0.40,
            sub_score="robustness",
            check=_check_independent_validation,
            citation=(
                "SR 26-2 §III Effective Challenge; PRA SS1/23 Principle 4; "
                "ECB Guide to Internal Models §9.3 ¶43(a)"
            ),
        ),
        ChecklistRule(
            id="rob.prompt_injection_robust",
            description="pipeline screened for prompt-injection robustness",
            weight=0.35,
            sub_score="robustness",
            check=_check_prompt_injection_robust,
            citation="RGB (Chen et al., AAAI 2024); EU AI Act Art. 15; MAS MindForge (2024)",
        ),
        ChecklistRule(
            id="rob.cross_source_conflict",
            description="acknowledges cross-source conflicts",
            weight=0.25,
            sub_score="robustness",
            check=_check_cross_source_conflict,
            citation=(
                "ConflictBank (Su et al., 2024); EU AI Act Art. 15(4); RGB (Chen et al., 2024)"
            ),
        ),
    )

    return RuleSet(
        name="groundlens_banking_v1",
        rules=rules,
        sub_scores=("groundedness", "completeness", "calibration", "traceability", "robustness"),
        quality_floor=quality_floor,
        flag_predicate=_groundlens_banking_flag_predicate,
    )

compute_sgi(question: str, context: str, response: str, *, model: str = DEFAULT_MODEL, encoder: EmbeddingFn | None = None) -> SGIResult

Compute the Semantic Grounding Index for a response.

Parameters:

Name Type Description Default
question str

The input query.

required
context str

Source document, retrieved chunks, or reference text.

required
response str

The LLM output to evaluate.

required
model str

Sentence transformer model name. Default all-MiniLM-L6-v2.

DEFAULT_MODEL
encoder EmbeddingFn | None

Optional bring-your-own-embeddings callable taking list[str] and returning an (n, d) array. Bypasses sentence-transformers (no torch required) when provided.

None

Returns:

Type Description
SGIResult

SGIResult with raw score, normalized score, and flag status.

Raises:

Type Description
ValueError

If any input string is empty.

Example

from groundlens import compute_sgi result = compute_sgi( ... question="What is the capital of France?", ... context="France is in Western Europe. Its capital is Paris.", ... response="The capital of France is Paris.", ... ) result.flagged False

Source code in src/groundlens/sgi.py
def compute_sgi(
    question: str,
    context: str,
    response: str,
    *,
    model: str = DEFAULT_MODEL,
    encoder: EmbeddingFn | None = None,
) -> SGIResult:
    """Compute the Semantic Grounding Index for a response.

    Args:
        question: The input query.
        context: Source document, retrieved chunks, or reference text.
        response: The LLM output to evaluate.
        model: Sentence transformer model name. Default ``all-MiniLM-L6-v2``.
        encoder: Optional bring-your-own-embeddings callable taking
            ``list[str]`` and returning an ``(n, d)`` array. Bypasses
            sentence-transformers (no torch required) when provided.

    Returns:
        SGIResult with raw score, normalized score, and flag status.

    Raises:
        ValueError: If any input string is empty.

    Example:
        >>> from groundlens import compute_sgi
        >>> result = compute_sgi(
        ...     question="What is the capital of France?",
        ...     context="France is in Western Europe. Its capital is Paris.",
        ...     response="The capital of France is Paris.",
        ... )
        >>> result.flagged
        False
    """
    if not question.strip():
        msg = "question must be a non-empty string."
        raise ValueError(msg)
    if not context.strip():
        msg = "context must be a non-empty string."
        raise ValueError(msg)
    if not response.strip():
        msg = "response must be a non-empty string."
        raise ValueError(msg)

    if encoder is not None or model != DEFAULT_MODEL:
        _warn_default_thresholds_with_custom_encoder("compute_sgi", model, encoder is not None)

    embeddings = encode_texts([question, context, response], model_name=model, encoder=encoder)
    q_emb, ctx_emb, resp_emb = embeddings[0], embeddings[1], embeddings[2]

    # L2-normalize to project onto the unit hypersphere (paper Algorithm 1).
    q_hat = _l2_normalize(q_emb)
    c_hat = _l2_normalize(ctx_emb)
    r_hat = _l2_normalize(resp_emb)

    # Angular (geodesic) distances on S^(d-1).
    q_dist = _angular_distance(r_hat, q_hat)
    ctx_dist = _angular_distance(r_hat, c_hat)

    # Degenerate case: response identical to context (theta(r, c) ≈ 0).
    if ctx_dist < _EPS:
        return SGIResult(
            value=10.0,
            normalized=1.0,
            flagged=False,
            q_dist=round(q_dist, 4),
            ctx_dist=round(ctx_dist, 4),
        )

    # Degenerate case: response identical to question (theta(r, q) ≈ 0).
    if q_dist < _EPS:
        return SGIResult(
            value=0.0,
            normalized=0.0,
            flagged=True,
            q_dist=round(q_dist, 4),
            ctx_dist=round(ctx_dist, 4),
        )

    raw = q_dist / ctx_dist
    normalized = normalize_sgi(raw)

    return SGIResult(
        value=round(raw, 4),
        normalized=round(normalized, 4),
        flagged=raw < SGI_REVIEW,
        q_dist=round(q_dist, 4),
        ctx_dist=round(ctx_dist, 4),
    )