Index

`groundlens` ¶

Groundlens — Verifiable agent triage.

Deterministic. Auditable. No second LLM in the loop.

Groundlens triages outputs from individual LLMs and from multi-agent pipelines (routing, RAG, specialized / tool-using agents). Two layers:

Geometric layer. SGI and DGI score grounding via embedding geometry, sub-second and deterministic. Apply to any agent's natural-language output.
Rule-based layer. Domain-specific rule sets with per-rule citations to academic, industrial, and regulatory sources. Per-agent factories live in :mod:groundlens.agents: :func:groundlens.agents.routing_rules, :func:groundlens.agents.rag_rules, :func:groundlens.agents.specialized_agent_rules.

Quick start::

>>> from groundlens import compute_sgi, compute_dgi, evaluate
>>>
>>> # With context (RAG verification) — uses SGI
>>> result = compute_sgi(
...     question="What is the capital of France?",
...     context="France is in Western Europe. Its capital is Paris.",
...     response="The capital of France is Paris.",
... )
>>> result.flagged
False
>>>
>>> # Without context — uses DGI
>>> result = compute_dgi(
...     question="What causes seasons?",
...     response="Seasons are caused by Earth's 23.5-degree axial tilt.",
... )
>>> result.flagged
False
>>>
>>> # Auto-select method
>>> score = evaluate(question="Q?", response="A.", context="Source.")
>>> score.method
'sgi'
>>>
>>> # Agent-specific rule triage
>>> from groundlens.agents import routing_rules, rag_rules, specialized_agent_rules
>>> rag = rag_rules()
>>> rag.name
'groundlens_banking_v1'

References

Marin (2025). Semantic Grounding Index. arXiv:2512.13771. Marin (2026). A Geometric Taxonomy of Hallucinations. arXiv:2602.13224v3. Marin (2026). Rotational Dynamics of Factual Constraint Processing. arXiv:2603.13259. Marin (2026). Defendable Rules for LLM Rationale Evaluation in Banking Governance: A Multi-Source Provenance Framework.

Attributes¶

`DEFAULT_MODEL: str = 'Snowflake/snowflake-arctic-embed-l-v2.0'` `module-attribute` ¶

Default sentence transformer model.

Snowflake Arctic Embed L v2.0 — 1024 dims, 568M params, multilingual (100+ languages including Spanish/Catalan/Galician/English/Portuguese), 8192 token context window. Requires trust_remote_code=True on load (the model ships custom pooling code).

Why this is the default:

Verified on RAGTruth (n=2,700) and RAGBench (n=8,838) with consistent SGI/DGI behavior; calibrations in cookbooks ship against this encoder.
L2-normalizes embeddings naturally (contrastive training), which keeps the canonical angular SGI formulation numerically stable.
Multilingual out-of-the-box — relevant for European bank deployments.

When to override:

Lightweight deployment (CPU-only, latency-critical): use LIGHTWEIGHT_MINILM = "all-MiniLM-L6-v2" (22M params, 384 dims). The previous default through 2026.6.17.
Spanish/multilingual smaller footprint: use MULTILINGUAL_MINI (118M params, 384 dims).
Higher quality multilingual at higher cost: use MULTILINGUAL_E5 (560M params, 1024 dims) with required "query: "/"passage: " prefixes.

To override globally, pass model="..." to compute_sgi, compute_dgi, or the corresponding scorer classes.

`LIGHTWEIGHT_MINILM: str = 'all-MiniLM-L6-v2'` `module-attribute` ¶

Lightweight English-only encoder (22M params, 384 dims). Was the default through groundlens 2026.6.17. Use for latency-critical CPU-only deployments where the trade-off in grounding signal quality is acceptable.

`MULTILINGUAL_E5: str = 'intfloat/multilingual-e5-large'` `module-attribute` ¶

Multilingual E5 (560M params, 1024 dims, 100+ languages). Higher quality than MULTILINGUAL_MINI at ~5x the inference cost. Choose when latency budget allows it (e.g. batch evaluation, audit replay) and the deployment domain has shown weak separation under MiniLM. Requires prefixing queries with "query: " and passages with "passage: " to match the encoder's training recipe; see model card on HuggingFace.

`MULTILINGUAL_MINI: str = 'paraphrase-multilingual-MiniLM-L12-v2'` `module-attribute` ¶

Multilingual MiniLM (118M params, 384 dims, 50+ languages including Spanish, Catalan, Galician, English). Sub-second on CPU. Recommended default for European-bank customer-support deployments where the WhatsApp / app channel receives queries across the bank's operating languages. Calibrate mu_hat and SGI threshold on a multilingual verified-grounded corpus for the expected query distribution.

Classes¶

`CalibrationResult(model: str, n_pairs: int, embedding_dim: int, mu_hat: NDArray[np.float32], concentration: float, metadata: dict[str, str] = dict())` `dataclass` ¶

Result of DGI calibration.

Attributes:

Name	Type	Description
`model`	`str`	Sentence transformer model used for calibration.
`n_pairs`	`int`	Number of (question, response) pairs used.
`embedding_dim`	`int`	Dimensionality of the embedding space.
`mu_hat`	`NDArray[float32]`	The computed reference direction vector.
`concentration`	`float`	Estimated concentration parameter (kappa) of the von Mises-Fisher distribution. Higher values indicate more consistent displacement directions in the reference data.

Methods:¶

`save(path: str | Path) -> None` ¶

Save calibration result to JSON.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Output file path. The mu_hat vector is stored as a list.	required

Source code in src/groundlens/calibrate.py

def save(self, path: str | Path) -> None:
    """Save calibration result to JSON.

    Args:
        path: Output file path. The mu_hat vector is stored as a list.
    """
    data = {
        "model": self.model,
        "n_pairs": self.n_pairs,
        "embedding_dim": self.embedding_dim,
        "mu_hat": self.mu_hat.tolist(),
        "concentration": self.concentration,
        "metadata": self.metadata,
    }
    Path(path).write_text(json.dumps(data, indent=2), encoding="utf-8")
    logger.info("Calibration saved to %s.", path)

`load(path: str | Path) -> CalibrationResult` `classmethod` ¶

Load a saved calibration result.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to JSON calibration file.	required

Returns:

Type	Description
`CalibrationResult`	CalibrationResult instance with restored mu_hat vector.

Source code in src/groundlens/calibrate.py

@classmethod
def load(cls, path: str | Path) -> CalibrationResult:
    """Load a saved calibration result.

    Args:
        path: Path to JSON calibration file.

    Returns:
        CalibrationResult instance with restored mu_hat vector.
    """
    data = json.loads(Path(path).read_text(encoding="utf-8"))
    return cls(
        model=data["model"],
        n_pairs=data["n_pairs"],
        embedding_dim=data["embedding_dim"],
        mu_hat=np.array(data["mu_hat"], dtype=np.float32),
        concentration=data["concentration"],
        metadata=data.get("metadata", {}),
    )

`ThresholdFit(sgi_review: float | None, dgi_pass: float | None, n: int, model: str, metric: str = 'youden_j')` `dataclass` ¶

Fitted decision thresholds for SGI and DGI on a labeled set.

Thresholds are chosen by maximizing Youden's J for the rule "value >= threshold implies grounded" over the supplied examples.

Attributes:

Name	Type	Description
`sgi_review`	`float \| None`	Fitted SGI review threshold, or `None` if no contexts were supplied (SGI requires context).
`dgi_pass`	`float \| None`	Fitted DGI pass threshold, or `None` if it could not be estimated.
`n`	`int`	Number of examples used for fitting.
`model`	`str`	Sentence transformer model the scores were computed with.
`metric`	`str`	Name of the criterion used to pick thresholds.

`DGI(model: str = DEFAULT_MODEL, reference_csv: str | None = None, encoder: EmbeddingFn | None = None)` ¶

Reusable DGI scorer with pre-configured model and calibration.

Use this class when evaluating multiple responses against the same reference direction. Supports both bundled and custom calibration.

Example

dgi = DGI() result = dgi.score( ... question="What is ML?", ... response="ML is a branch of AI.", ... ) result.flagged False

dgi = DGI(reference_csv="my_domain_pairs.csv") result = dgi.score(question="...", response="...")

Initialize DGI scorer.

Parameters:

Name	Type	Description	Default
`model`	`str`	Sentence transformer model name.	`DEFAULT_MODEL`
`reference_csv`	`str \| None`	Path to domain-specific calibration CSV.	`None`
`encoder`	`EmbeddingFn \| None`	Optional bring-your-own-embeddings callable. When set, both calibration and scoring bypass sentence-transformers (no torch required).	`None`

Source code in src/groundlens/dgi.py

def __init__(
    self,
    model: str = DEFAULT_MODEL,
    reference_csv: str | None = None,
    encoder: EmbeddingFn | None = None,
) -> None:
    """Initialize DGI scorer.

    Args:
        model: Sentence transformer model name.
        reference_csv: Path to domain-specific calibration CSV.
        encoder: Optional bring-your-own-embeddings callable. When set,
            both calibration and scoring bypass sentence-transformers
            (no torch required).
    """
    self.model = model
    self.reference_csv = reference_csv
    self.encoder = encoder

Methods:¶

`calibrate(pairs: list[tuple[str, str]] | None = None, csv_path: str | None = None) -> None` ¶

Set custom calibration data.

Either provide pairs directly or a path to a CSV file. This replaces any previously cached reference direction.

Parameters:

Name	Type	Description	Default
`pairs`	`list[tuple[str, str]] \| None`	List of verified (question, response) tuples.	`None`
`csv_path`	`str \| None`	Path to a calibration CSV file.	`None`

Raises:

Type	Description
`ValueError`	If neither `pairs` nor `csv_path` is provided.

Source code in src/groundlens/dgi.py

def calibrate(
    self,
    pairs: list[tuple[str, str]] | None = None,
    csv_path: str | None = None,
) -> None:
    """Set custom calibration data.

    Either provide pairs directly or a path to a CSV file.
    This replaces any previously cached reference direction.

    Args:
        pairs: List of verified (question, response) tuples.
        csv_path: Path to a calibration CSV file.

    Raises:
        ValueError: If neither ``pairs`` nor ``csv_path`` is provided.
    """
    enc_id = id(self.encoder) if self.encoder is not None else None

    if csv_path is not None:
        self.reference_csv = csv_path
        # Force recomputation on next score() call.
        cache_key = (self.model, csv_path, enc_id)
        _mu_hat_cache.pop(cache_key, None)
        return

    if pairs is not None:
        # Compute and cache the reference direction directly.
        mu = _compute_reference_direction(pairs, self.model, encoder=self.encoder)
        cache_key = (self.model, "__inline__", enc_id)
        _mu_hat_cache[cache_key] = mu
        self.reference_csv = "__inline__"
        return

    msg = "Provide either 'pairs' or 'csv_path' for calibration."
    raise ValueError(msg)

`score(question: str, response: str) -> DGIResult` ¶

Compute DGI for a single response.

Parameters:

Name	Type	Description	Default
`question`	`str`	The input query.	required
`response`	`str`	The LLM output to evaluate.	required

Returns:

Type	Description
`DGIResult`	DGIResult with score and flag status.

Raises:

Type	Description
`RuntimeError`	If `calibrate(pairs=...)` has not been called yet on this instance and `reference_csv` is the inline sentinel.

Source code in src/groundlens/dgi.py

def score(self, question: str, response: str) -> DGIResult:
    """Compute DGI for a single response.

    Args:
        question: The input query.
        response: The LLM output to evaluate.

    Returns:
        DGIResult with score and flag status.

    Raises:
        RuntimeError: If ``calibrate(pairs=...)`` has not been called
            yet on this instance and ``reference_csv`` is the inline
            sentinel.
    """
    if self.reference_csv == "__inline__":
        # Guard: the inline mu_hat must already be in the cache, since
        # there is no on-disk CSV to fall back to.
        enc_id = id(self.encoder) if self.encoder is not None else None
        cache_key = (self.model, "__inline__", enc_id)
        if cache_key not in _mu_hat_cache:
            msg = "Call calibrate() before score() when using inline pairs."
            raise RuntimeError(msg)

    # Pass reference_csv through unchanged. ``_get_mu_hat`` resolves:
    #   None         -> bundled mu_hat
    #   real path    -> load CSV, compute mu_hat
    #   "__inline__" -> hit the cache populated by calibrate(pairs=...)
    return compute_dgi(
        question=question,
        response=response,
        model=self.model,
        reference_csv=self.reference_csv,
        encoder=self.encoder,
    )

`propose_labels(*, seeds: list[SeedExample], llm_generate: Callable[[str], str], n_candidates: int = 50, n_to_label: int = 10, strategies: str | tuple[str | tuple[str, str], ...] = 'default', diverse_fraction: float = 0.3, seed: int = 42) -> PropositionBatch` ¶

Active-learning bootstrap of a verified-grounded calibration set.

Given 10-50 verified-grounded :class:SeedExample triples and a text-generation callable, this method:

Picks a seed at random for each candidate and rewrites its grounded response under one of the named confabulation strategies, using the seed's own context as the source of truth in the prompt. Coherence is preserved by design -- the prompt never sees a mismatched context+question pair.
Scores each generated candidate with this DGI.
Ranks candidates by acquisition score (70% uncertainty / 30% strategy diversity) and returns the top n_to_label for a human reviewer.

The method DOES NOT label and DOES NOT calibrate. The human reviewer assigns the labels; the caller then passes the labelled grounded subset to :meth:calibrate. The loop is non-circular by design.

Parameters:

Name	Type	Description	Default
`seeds`	`list[SeedExample]`	10-50 verified-grounded :class:`SeedExample` triples. Each carries its own `context`, `question` and `grounded` response, so the generation prompt is always coherent.	required
`llm_generate`	`Callable[[str], str]`	A callable `(prompt: str) -> str` that the user provides (an OpenAI / Anthropic / local LLM wrapper). groundlens does not embed an LLM.	required
`n_candidates`	`int`	Total candidates to generate across all strategies. Default 50 (≈5 minutes at 4 s/call).	`50`
`n_to_label`	`int`	How many candidates the batch should contain. Default 10. The rest are returned in `batch.all_candidates` for audit.	`10`
`strategies`	`str \| tuple[str \| tuple[str, str], ...]`	`"default"` (all five strategies from `groundlens-dev/grounding-benchmark`), or a tuple of strategy names, or a tuple of `(name, prompt_template)` custom pairs. Templates accept the slots `{context}`, `{question}`, `{grounded}`.	`'default'`
`diverse_fraction`	`float`	Fraction of the batch reserved for strategy diversity (the rest is filled by uncertainty). Default 0.3.	`0.3`
`seed`	`int`	Random seed for sampling seeds across rounds. Determinism is required for reproducible audits.	`42`

Returns:

Name	Type	Description
`A`	`PropositionBatch`	class:`groundlens.PropositionBatch` ready for human review.

Raises:

Type	Description
`ValueError`	If `seeds` is empty or `n_candidates` < 1.
`TypeError`	If `llm_generate` is not callable, or any element of `seeds` is not a `SeedExample`.

Source code in src/groundlens/dgi.py

def propose_labels(
    self,
    *,
    seeds: list[SeedExample],
    llm_generate: Callable[[str], str],
    n_candidates: int = 50,
    n_to_label: int = 10,
    strategies: str | tuple[str | tuple[str, str], ...] = "default",
    diverse_fraction: float = 0.3,
    seed: int = 42,
) -> PropositionBatch:
    """Active-learning bootstrap of a verified-grounded calibration set.

    Given 10-50 verified-grounded :class:`SeedExample` triples and a
    text-generation callable, this method:

    1. Picks a seed at random for each candidate and rewrites its
       ``grounded`` response under one of the named confabulation
       strategies, using the seed's own ``context`` as the source
       of truth in the prompt. Coherence is preserved by design --
       the prompt never sees a mismatched context+question pair.
    2. Scores each generated candidate with this DGI.
    3. Ranks candidates by acquisition score (70% uncertainty /
       30% strategy diversity) and returns the top ``n_to_label``
       for a human reviewer.

    The method DOES NOT label and DOES NOT calibrate. The human
    reviewer assigns the labels; the caller then passes the labelled
    grounded subset to :meth:`calibrate`. The loop is non-circular
    by design.

    Args:
        seeds: 10-50 verified-grounded :class:`SeedExample` triples.
            Each carries its own ``context``, ``question`` and
            ``grounded`` response, so the generation prompt is
            always coherent.
        llm_generate: A callable ``(prompt: str) -> str`` that the
            user provides (an OpenAI / Anthropic / local LLM
            wrapper). groundlens does not embed an LLM.
        n_candidates: Total candidates to generate across all
            strategies. Default 50 (≈5 minutes at 4 s/call).
        n_to_label: How many candidates the batch should contain.
            Default 10. The rest are returned in
            ``batch.all_candidates`` for audit.
        strategies: ``"default"`` (all five strategies from
            ``groundlens-dev/grounding-benchmark``), or a tuple of
            strategy names, or a tuple of ``(name, prompt_template)``
            custom pairs. Templates accept the slots ``{context}``,
            ``{question}``, ``{grounded}``.
        diverse_fraction: Fraction of the batch reserved for
            strategy diversity (the rest is filled by uncertainty).
            Default 0.3.
        seed: Random seed for sampling seeds across rounds.
            Determinism is required for reproducible audits.

    Returns:
        A :class:`groundlens.PropositionBatch` ready for human review.

    Raises:
        ValueError: If ``seeds`` is empty or ``n_candidates`` < 1.
        TypeError: If ``llm_generate`` is not callable, or any
            element of ``seeds`` is not a ``SeedExample``.
    """
    import random
    import warnings

    from groundlens._internal.strategies import resolve_strategies
    from groundlens.propose import (
        ProposedLabel,
        PropositionBatch,
        SeedExample,
        _uncertainty,
        build_review_template,
        rank_for_labelling,
    )

    if not seeds:
        msg = "seeds must contain at least one SeedExample."
        raise ValueError(msg)
    if not all(isinstance(s, SeedExample) for s in seeds):
        msg = (
            "Every item in seeds must be a SeedExample(context=..., "
            "question=..., grounded=...) instance."
        )
        raise TypeError(msg)
    if n_candidates < 1:
        msg = "n_candidates must be >= 1."
        raise ValueError(msg)
    if not callable(llm_generate):
        msg = "llm_generate must be a callable (prompt: str) -> str."
        raise TypeError(msg)

    resolved_strategies = resolve_strategies(strategies)
    if not resolved_strategies:
        msg = "At least one strategy must be specified."
        raise ValueError(msg)

    # Threshold: median DGI score on the seed grounded pairs. This is
    # a reasonable proxy for the boundary between grounded and
    # ungrounded when no calibrated threshold is available yet.
    seed_scores = [self.score(s.question, s.grounded).normalized for s in seeds]
    sorted_scores = sorted(seed_scores)
    n = len(sorted_scores)
    median = (
        sorted_scores[n // 2]
        if n % 2 == 1
        else 0.5 * (sorted_scores[n // 2 - 1] + sorted_scores[n // 2])
    )
    threshold = float(median)

    rng = random.Random(seed)

    # Round-robin across strategies. For each candidate, sample ONE
    # seed and pass its OWN (context, question, grounded) through
    # the strategy template. No more mismatched context/seed pairs.
    candidates: list[ProposedLabel] = []
    per_strategy = max(1, n_candidates // len(resolved_strategies))
    for strat_name, template in resolved_strategies:
        for _ in range(per_strategy):
            if len(candidates) >= n_candidates:
                break
            anchor = rng.choice(seeds)
            prompt = template.format(
                context=anchor.context,
                question=anchor.question,
                grounded=anchor.grounded,
            )
            try:
                candidate_resp = llm_generate(prompt)
            except Exception as exc:
                msg = (
                    f"llm_generate raised {type(exc).__name__}: {exc}. "
                    "Skipping this candidate."
                )
                warnings.warn(msg, RuntimeWarning, stacklevel=2)
                continue

            if not isinstance(candidate_resp, str) or not candidate_resp.strip():
                continue

            score = self.score(anchor.question, candidate_resp).normalized
            candidates.append(
                ProposedLabel(
                    question=anchor.question,
                    candidate_response=candidate_resp.strip(),
                    dgi_score=float(score),
                    strategy=strat_name,
                    context_excerpt=anchor.context,
                    uncertainty=_uncertainty(float(score), threshold),
                )
            )

    # Rank for labelling (uncertainty + diversity).
    ranked = rank_for_labelling(
        candidates,
        n_to_label=n_to_label,
        diverse_fraction=diverse_fraction,
    )

    # Audit: keep all candidates, ordered by uncertainty.
    all_ordered = sorted(candidates, key=lambda c: c.uncertainty)

    return PropositionBatch(
        items=tuple(ranked),
        review_template=build_review_template(ranked),
        all_candidates=tuple(all_ordered),
        strategies_used=tuple(name for name, _ in resolved_strategies),
    )

`ProposedLabel(question: str, candidate_response: str, dgi_score: float, strategy: str, context_excerpt: str, uncertainty: float)` `dataclass` ¶

One candidate (question, response) pair ready for human review.

Attributes:

Name	Type	Description
`question`	`str`	A question grounded in one of the FAQ-corpus entries.
`candidate_response`	`str`	A confabulated response written by the generation LLM under the named `strategy`.
`dgi_score`	`float`	The DGI normalized score of the candidate against the current `mu_hat`. Lower scores mean stronger deferral signal.
`strategy`	`str`	The name of the confabulation strategy that produced this candidate (e.g. `"redefinition"`).
`context_excerpt`	`str`	The FAQ excerpt the question was anchored to.
`uncertainty`	`float`	Distance of `dgi_score` from the threshold used for ranking. Smaller = more uncertain = higher priority.

`PropositionBatch(items: tuple[ProposedLabel, ...], review_template: str, all_candidates: tuple[ProposedLabel, ...] = tuple(), strategies_used: tuple[str, ...] = tuple())` `dataclass` ¶

A batch of candidates returned by :meth:groundlens.DGI.propose_labels.

Attributes:

Name	Type	Description
`items`	`tuple[ProposedLabel, ...]`	Candidates ordered by acquisition score (most useful to label first). Length up to `n_to_label`.
`review_template`	`str`	A Markdown template instructing the human reviewer how to label the items in the batch.
`all_candidates`	`tuple[ProposedLabel, ...]`	Every candidate generated in the round, ordered by acquisition score. Useful for audit and debugging.
`strategies_used`	`tuple[str, ...]`	The tuple of strategy names actually used.

`SeedExample(context: str, question: str, grounded: str)` `dataclass` ¶

One verified-grounded triple you supply to DGI.propose_labels.

A SeedExample binds a FAQ paragraph (context) to a question that paragraph answers (question) and the verified-grounded response to that question (grounded). Bundling the three together is what keeps the candidate generation coherent: the confabulation prompt receives the same context, question and grounded answer rather than randomly-paired pieces.

Attributes:

Name	Type	Description
`context`	`str`	A paragraph from the deployment's FAQ corpus that supports the grounded response.
`question`	`str`	A question whose answer is contained in `context`.
`grounded`	`str`	The verified-grounded response to `question` given `context`. The confabulation strategies rewrite this response under specific failure modes.

Raises:

Type	Description
`ValueError`	If any field is empty or whitespace-only.

Methods:¶

`__post_init__() -> None` ¶

Validate that every field is a non-empty, non-whitespace string.

Source code in src/groundlens/propose.py

def __post_init__(self) -> None:
    """Validate that every field is a non-empty, non-whitespace string."""
    for name in ("context", "question", "grounded"):
        value = getattr(self, name)
        if not isinstance(value, str) or not value.strip():
            msg = f"SeedExample.{name} must be a non-empty string."
            raise ValueError(msg)

`ChecklistRule(id: str, description: str, weight: float, sub_score: str, check: Callable[[str, str, str | None, dict[str, Any]], RuleEvidence], citation: str = '')` `dataclass` ¶

A single rule with an id, a pattern check, and a weight.

Rules are designed to be readable: id and description are surfaced verbatim in the audit explanation. The check callable returns a :class:RuleEvidence so the audit trail records why the rule fired, not just that it did.

Attributes:

Name	Type	Description
`id`	`str`	Stable identifier (e.g. `"spec.reg_flag"`). Used in audit logs.
`description`	`str`	One-line human-readable description of the rule.
`weight`	`float`	Contribution to the parent sub-score when matched, in [0, 1]. Sub-scores are capped at 1.0 even when weights sum higher.
`sub_score`	`str`	Which sub-score this rule contributes to. For the legacy `banking_rules()` set: `"spec"`, `"expl"`, or `"bshift"`. For the current `groundlens_banking_rules()` set: `"groundedness"`, `"completeness"`, `"calibration"`, `"traceability"`, or `"robustness"`. Custom rule sets may define additional categories.
`check`	`Callable[[str, str, str \| None, dict[str, Any]], RuleEvidence]`	Pure function `(question, response, context, metadata) -> RuleEvidence`. Must be deterministic.
`citation`	`str`	Free-text academic / industry / regulatory provenance for the rule, suitable for inclusion in an audit explanation or a regulatory submission. Empty string when no citation is provided. Example: `"RAGAs (Es et al., EACL 2024) §3 Faithfulness"`.

`RuleEvidence(matched: bool, span: str, explanation: str)` `dataclass` ¶

A single piece of evidence supporting a rule's pass/fail decision.

Attributes:

Name	Type	Description
`matched`	`bool`	Whether the rule pattern matched the input text.
`span`	`str`	The substring (lowercased) that triggered the match, or `""` if no match was found.
`explanation`	`str`	Short human-readable note describing what was checked.

`RuleResult(rule_id: str, sub_score: str, weight: float, matched: bool, evidence_span: str, explanation: str)` `dataclass` ¶

Outcome of evaluating a single rule.

Attributes:

Name	Type	Description
`rule_id`	`str`	The :attr:`ChecklistRule.id` that produced this result.
`sub_score`	`str`	Which sub-score this rule contributes to.
`weight`	`float`	The weight of the rule (echo of :attr:`ChecklistRule.weight`).
`matched`	`bool`	Whether the rule fired.
`evidence_span`	`str`	The substring that triggered the match, if any.
`explanation`	`str`	The rule's human-readable explanation.

`RuleSet(name: str, rules: tuple[ChecklistRule, ...], sub_scores: tuple[str, ...] = ('spec', 'expl', 'bshift'), quality_floor: float = _DEFAULT_QUALITY_FLOOR, flag_predicate: Callable[[dict[str, float]], bool] | None = None)` `dataclass` ¶

A collection of rules evaluated together against a (q, r, ctx) triple.

Use :func:groundlens_banking_rules for the current canonical five-category ruleset, :func:banking_rules for the legacy three-category ruleset, or construct your own by passing a sequence of :class:ChecklistRule along with the list of sub-score categories the rules contribute to.

Attributes:

Name	Type	Description
`name`	`str`	Identifier (e.g. `"groundlens_banking_v1"`). Surfaced in audit logs.
`rules`	`tuple[ChecklistRule, ...]`	The rules to evaluate.
`sub_scores`	`tuple[str, ...]`	Ordered tuple of sub-score category names this ruleset produces. Rules whose `sub_score` field is not in this tuple are ignored at aggregation time (their evidence is still recorded in :attr:`RuleSetResult.rule_results`). Default `("spec", "expl", "bshift")` preserves legacy behavior.
`quality_floor`	`float`	Default flag-predicate threshold below which a sub-score triggers the audit-deficiency flag. Applied to `spec` and `expl` only when :attr:`flag_predicate` is `None`.
`flag_predicate`	`Callable[[dict[str, float]], bool] \| None`	Optional pure function `dict[str, float] -> bool` that decides whether the aggregated result is flagged. When `None`, the default legacy predicate is used: flagged iff `spec < quality_floor or expl < quality_floor`.

Methods:¶

`evaluate(*, question: str, response: str, context: str | None = None, metadata: dict[str, Any] | None = None) -> RuleSetResult` ¶

Evaluate the ruleset against a single (question, response) pair.

Parameters:

Name	Type	Description	Default
`question`	`str`	The user query / prompt the LLM received.	required
`response`	`str`	The LLM's rationale text being audited.	required
`context`	`str \| None`	Optional retrieved context (RAG-style). May be `None` when no retrieval was performed.	`None`
`metadata`	`dict[str, Any] \| None`	Optional dict carrying domain-specific structured data that some rules may consult (e.g. the case parameters in a banking decision: risk score, flags, amount, etc.).	`None`

Returns:

Name	Type	Description
`A`	`RuleSetResult`	class:`RuleSetResult` with all sub-scores, the aggregated
	`RuleSetResult`	quality, and a full audit explanation.

Raises:

Type	Description
`ValueError`	If `response` is empty.

Source code in src/groundlens/rules.py

def evaluate(
    self,
    *,
    question: str,
    response: str,
    context: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> RuleSetResult:
    """Evaluate the ruleset against a single (question, response) pair.

    Args:
        question: The user query / prompt the LLM received.
        response: The LLM's rationale text being audited.
        context: Optional retrieved context (RAG-style). May be ``None``
            when no retrieval was performed.
        metadata: Optional dict carrying domain-specific structured data
            that some rules may consult (e.g. the case parameters in a
            banking decision: risk score, flags, amount, etc.).

    Returns:
        A :class:`RuleSetResult` with all sub-scores, the aggregated
        quality, and a full audit explanation.

    Raises:
        ValueError: If ``response`` is empty.
    """
    if not response.strip():
        msg = "response must be a non-empty string."
        raise ValueError(msg)

    meta = metadata or {}

    results: list[RuleResult] = []
    weights_by_sub: dict[str, float] = dict.fromkeys(self.sub_scores, 0.0)

    for rule in self.rules:
        evidence = rule.check(question, response, context, meta)
        results.append(
            RuleResult(
                rule_id=rule.id,
                sub_score=rule.sub_score,
                weight=rule.weight,
                matched=evidence.matched,
                evidence_span=evidence.span,
                explanation=evidence.explanation,
            )
        )
        if evidence.matched and rule.sub_score in weights_by_sub:
            weights_by_sub[rule.sub_score] += rule.weight

    sub_scores: dict[str, float] = {
        name: round(min(1.0, weights_by_sub[name]), 4) for name in self.sub_scores
    }

    product = 1.0
    for value in sub_scores.values():
        product *= value
    n = len(sub_scores)
    quality = round(product ** (1.0 / n), 4) if product > 0 and n > 0 else 0.0

    if self.flag_predicate is not None:
        flagged = bool(self.flag_predicate(sub_scores))
    else:
        # Legacy default: flagged iff spec or expl below quality_floor.
        flagged = (sub_scores.get("spec", 0.0) < self.quality_floor) or (
            sub_scores.get("expl", 0.0) < self.quality_floor
        )

    audit = _format_audit_explanation(
        ruleset_name=self.name,
        sub_scores=sub_scores,
        quality=quality,
        flagged=flagged,
        quality_floor=self.quality_floor,
        results=results,
    )

    return RuleSetResult(
        sub_scores=sub_scores,
        quality=quality,
        flagged=flagged,
        rule_results=tuple(results),
        audit_explanation=audit,
    )

`RuleSetResult(sub_scores: dict[str, float], quality: float, flagged: bool, rule_results: tuple[RuleResult, ...], audit_explanation: str)` `dataclass` ¶

Aggregated result of evaluating a :class:RuleSet against a response.

Each sub-score is a capped weight sum of matched rules in that category, stored in the :attr:sub_scores mapping. quality is the geometric mean of all sub-score values: any zero sub-score yields quality = 0.0, reflecting that a rationale missing any audited dimension is structurally incomplete for human review.

Backward-compatible read accessors are exposed for the legacy De-La-Chica style sub-scores (spec, expl, bshift) and for the current GroundLens five-category skeleton (groundedness, completeness, calibration, traceability, robustness). Accessors return 0.0 when the underlying ruleset did not define the requested sub-score.

Attributes:

Name	Type	Description
`sub_scores`	`dict[str, float]`	Mapping from sub-score name to its capped value in [0, 1]. By convention, do not mutate.
`quality`	`float`	Geometric mean of all sub-score values in :attr:`sub_scores`.
`flagged`	`bool`	`True` when the ruleset's flag predicate is triggered.
`rule_results`	`tuple[RuleResult, ...]`	One :class:`RuleResult` per rule that was evaluated.
`audit_explanation`	`str`	Multi-line human-readable summary suitable for inclusion in an audit log.

Attributes¶

`spec: float` `property` ¶

Legacy specificity sub-score. Returns 0.0 if not defined by ruleset.

`expl: float` `property` ¶

Legacy explanatory-linkage sub-score. Returns 0.0 if not defined by ruleset.

`bshift: float` `property` ¶

Legacy boundary-shift sub-score. Returns 0.0 if not defined by ruleset.

`groundedness: float` `property` ¶

Groundedness sub-score. Returns 0.0 if not defined by ruleset.

`completeness: float` `property` ¶

Completeness sub-score. Returns 0.0 if not defined by ruleset.

`calibration: float` `property` ¶

Calibration sub-score. Returns 0.0 if not defined by ruleset.

`traceability: float` `property` ¶

Traceability sub-score. Returns 0.0 if not defined by ruleset.

`robustness: float` `property` ¶

Robustness sub-score. Returns 0.0 if not defined by ruleset.

`DGIResult(value: float, normalized: float, flagged: bool, method: str = 'dgi', explanation: str = '')` `dataclass` ¶

Result of Directional Grounding Index computation.

DGI measures whether the question-to-response displacement vector aligns with the mean displacement of verified grounded pairs. Higher values indicate alignment with grounded patterns.

Attributes:

Name	Type	Description
`value`	`float`	Raw DGI score = cosine similarity to reference direction. Range: [-1, 1].
`normalized`	`float`	Score mapped to [0, 1] via linear normalization.
`flagged`	`bool`	`True` if the score is below the pass threshold.
`method`	`str`	Always `"dgi"`.
`explanation`	`str`	Human-readable interpretation of the score.

Methods:¶

`__post_init__() -> None` ¶

Generate explanation from score if not provided.

Source code in src/groundlens/score.py

def __post_init__(self) -> None:
    """Generate explanation from score if not provided."""
    if not self.explanation:
        if self.value >= DGI_PASS:
            expl = f"DGI={self.value:.3f} — aligns with grounded patterns (pass)"
        elif self.value >= 0.0:
            expl = f"DGI={self.value:.3f} — weak alignment (flagged)"
        else:
            expl = f"DGI={self.value:.3f} — opposes grounded direction (high risk)"
        object.__setattr__(self, "explanation", expl)

`GroundlensScore(value: float, normalized: float, flagged: bool, method: str, explanation: str, detail: SGIResult | DGIResult)` `dataclass` ¶

Unified score container returned by high-level evaluate() calls.

Wraps either an SGIResult or DGIResult with additional metadata.

Attributes:

Name	Type	Description
`value`	`float`	Raw score from the underlying method.
`normalized`	`float`	Score in [0, 1].
`flagged`	`bool`	Whether human review is recommended.
`method`	`str`	`"sgi"` or `"dgi"`.
`explanation`	`str`	Human-readable interpretation.
`detail`	`SGIResult \| DGIResult`	The full SGIResult or DGIResult for method-specific fields.

`SGIResult(value: float, normalized: float, flagged: bool, q_dist: float, ctx_dist: float, method: str = 'sgi', explanation: str = '')` `dataclass` ¶

Result of Semantic Grounding Index computation.

SGI measures whether a response engaged with the provided context or stayed anchored to the question. Higher values indicate stronger context engagement (grounded).

Attributes:

Name	Type	Description
`value`	`float`	Raw SGI score = dist(response, question) / dist(response, context).
`normalized`	`float`	Score mapped to [0, 1] via tanh normalization.
`flagged`	`bool`	`True` if the score is below the review threshold.
`q_dist`	`float`	Euclidean distance from response to question embedding.
`ctx_dist`	`float`	Euclidean distance from response to context embedding.
`method`	`str`	Always `"sgi"`.
`explanation`	`str`	Human-readable interpretation of the score.

Methods:¶

`__post_init__() -> None` ¶

Generate explanation from score if not provided.

Source code in src/groundlens/score.py

def __post_init__(self) -> None:
    """Generate explanation from score if not provided."""
    if not self.explanation:
        if self.value >= SGI_STRONG_PASS:
            expl = f"SGI={self.value:.3f} — strong context engagement (pass)"
        elif self.value >= SGI_REVIEW:
            expl = f"SGI={self.value:.3f} — partial engagement (review recommended)"
        else:
            expl = f"SGI={self.value:.3f} — weak context engagement (flagged)"
        object.__setattr__(self, "explanation", expl)

`SGI(model: str = DEFAULT_MODEL, encoder: EmbeddingFn | None = None)` ¶

Reusable SGI scorer with a pre-configured embedding model.

Use this class when evaluating multiple responses with the same model to avoid repeating the model parameter.

Example

sgi = SGI(model="all-MiniLM-L6-v2") result = sgi.score( ... question="What is X?", ... context="X is Y.", ... response="X is Y.", ... ) result.flagged False

Initialize SGI scorer.

Parameters:

Name	Type	Description	Default
`model`	`str`	Sentence transformer model name or path.	`DEFAULT_MODEL`
`encoder`	`EmbeddingFn \| None`	Optional bring-your-own-embeddings callable. When set, scoring bypasses sentence-transformers (no torch required).	`None`

Source code in src/groundlens/sgi.py

def __init__(
    self,
    model: str = DEFAULT_MODEL,
    encoder: EmbeddingFn | None = None,
) -> None:
    """Initialize SGI scorer.

    Args:
        model: Sentence transformer model name or path.
        encoder: Optional bring-your-own-embeddings callable. When set,
            scoring bypasses sentence-transformers (no torch required).
    """
    self.model = model
    self.encoder = encoder

Methods:¶

`score(question: str, context: str, response: str) -> SGIResult` ¶

Compute SGI for a single response.

Parameters:

Name	Type	Description	Default
`question`	`str`	The input query.	required
`context`	`str`	Source document or reference text.	required
`response`	`str`	The LLM output to evaluate.	required

Returns:

Type	Description
`SGIResult`	SGIResult with score and flag status.

Source code in src/groundlens/sgi.py

def score(
    self,
    question: str,
    context: str,
    response: str,
) -> SGIResult:
    """Compute SGI for a single response.

    Args:
        question: The input query.
        context: Source document or reference text.
        response: The LLM output to evaluate.

    Returns:
        SGIResult with score and flag status.
    """
    return compute_sgi(
        question=question,
        context=context,
        response=response,
        model=self.model,
        encoder=self.encoder,
    )

Functions:¶

`get_default_encoder() -> EmbeddingFn | None` ¶

Return the process-global embedding callable, or None if unset.

Returns:

Type	Description
`EmbeddingFn \| None`	The encoder previously set via :func:`set_default_encoder`, or `None`.

Source code in src/groundlens/_internal/embeddings.py

def get_default_encoder() -> EmbeddingFn | None:
    """Return the process-global embedding callable, or ``None`` if unset.

    Returns:
        The encoder previously set via :func:`set_default_encoder`, or ``None``.
    """
    return _custom_encoder

`set_default_encoder(encoder: EmbeddingFn | None) -> None` ¶

Set (or clear) the process-global embedding callable.

When a default encoder is set, every encode_texts call that does not receive an explicit encoder= argument routes through it, bypassing sentence-transformers entirely (so no torch import is triggered). Pass None to clear and restore the sentence-transformers path.

Parameters:

Name	Type	Description	Default
`encoder`	`EmbeddingFn \| None`	A callable taking `list[str]` and returning an `(n, d)` array-like of float embeddings, or `None` to clear.	required

Source code in src/groundlens/_internal/embeddings.py

def set_default_encoder(encoder: EmbeddingFn | None) -> None:
    """Set (or clear) the process-global embedding callable.

    When a default encoder is set, every ``encode_texts`` call that does not
    receive an explicit ``encoder=`` argument routes through it, bypassing
    sentence-transformers entirely (so no torch import is triggered). Pass
    ``None`` to clear and restore the sentence-transformers path.

    Args:
        encoder: A callable taking ``list[str]`` and returning an ``(n, d)``
            array-like of float embeddings, or ``None`` to clear.
    """
    global _custom_encoder
    _custom_encoder = encoder

`customer_support_rag_rules() -> RuleSet` ¶

Deprecated alias — use :func:customer_support_rules (with rag=True).

Preserved for one or more releases for backwards compatibility with code written against groundlens 2026.6.11 / 2026.6.12. The returned rule set is byte-for-byte identical to customer_support_rules(rag=True, domain="general", language="en") except for the RuleSet.name field, which keeps the legacy "customer_support_rag_v1" value so existing audit logs continue to match.

.. deprecated:: 2026.6.13 Use :func:customer_support_rules instead.

Source code in src/groundlens/agents/customer_support.py

def customer_support_rag_rules() -> RuleSet:
    """Deprecated alias — use :func:`customer_support_rules` (with ``rag=True``).

    Preserved for one or more releases for backwards compatibility with
    code written against groundlens 2026.6.11 / 2026.6.12. The returned
    rule set is byte-for-byte identical to
    ``customer_support_rules(rag=True, domain="general", language="en")``
    except for the ``RuleSet.name`` field, which keeps the legacy
    ``"customer_support_rag_v1"`` value so existing audit logs continue to
    match.

    .. deprecated:: 2026.6.13
        Use :func:`customer_support_rules` instead.
    """
    warnings.warn(
        "customer_support_rag_rules() is deprecated; "
        "use customer_support_rules(rag=True) instead. "
        "The legacy alias will be removed in a future release.",
        DeprecationWarning,
        stacklevel=2,
    )
    rs = customer_support_rules(rag=True, domain="general", language="en")
    # Preserve the legacy name so downstream code that asserts on rs.name
    # (e.g. the cookbook notebook's `ruleset.name` check) does not break.
    object.__setattr__(rs, "name", "customer_support_rag_v1")
    return rs

`customer_support_rules(rag: bool = True, domain: str = 'general', language: str = 'en') -> RuleSet` ¶

Rule set for customer-support informational agents.

Designed for informational customer-facing assistants. Selects between the RAG and no-RAG sub-score taxonomies and adjusts the stopword / speculative-marker vocabulary to the deployment domain and language.

Parameters:

Name	Type	Description	Default
`rag`	`bool`	Whether the agent retrieves context (FAQ) before answering. `True` (default) — full 7-rule, 3-sub-score set (`groundedness`, `completeness`, `no_overreach`). `False` — 4-rule, 2-sub-score set (`completeness`, `no_overreach`). The three groundedness rules are omitted because there is no context to compare against. The flag predicate adapts.	`True`
`domain`	`str`	Deployment domain. Affects stopwords and speculative-procedure markers; does not add or remove rules. One of: `"general"` (default), `"finance"`, `"healthcare"`, `"legal"`.	`'general'`
`language`	`str`	Deployment language. Affects stopwords, speculative-procedure markers, and the legal-reference regular expression. One of: `"en"` (default), `"es"`, `"multi"`.	`'en'`

Returns:

Name	Type	Description
`A`	`RuleSet`	class:`RuleSet` whose name encodes the active configuration:
	`RuleSet`	`customer_support_v2_{domain}_{language}_{rag\|norag}`.

Raises:

Type	Description
`ValueError`	If `domain` is not in :data:`_VALID_DOMAINS` or `language` is not in :data:`_VALID_LANGUAGES`.

Examples:

Default — FAQ-RAG, general domain, English::

from groundlens.agents import customer_support_rules

rs = customer_support_rules()
result = rs.evaluate(
    question="What is the Bizum daily limit?",
    response="The Bizum daily limit is 1,000 EUR per transaction.",
    context=(
        "The daily Bizum transfer limit is 1,000 EUR per "
        "transaction and 2,000 EUR per day in total."
    ),
)
assert not result.flagged

No-RAG chat in Spanish finance vocabulary::

rs = customer_support_rules(rag=False, domain="finance", language="es")
assert "completeness" in rs.sub_scores
assert "groundedness" not in rs.sub_scores

Source code in src/groundlens/agents/customer_support.py

def customer_support_rules(
    rag: bool = True,
    domain: str = "general",
    language: str = "en",
) -> RuleSet:
    """Rule set for customer-support informational agents.

    Designed for informational customer-facing assistants. Selects between
    the RAG and no-RAG sub-score taxonomies and adjusts the
    stopword / speculative-marker vocabulary to the deployment domain and
    language.

    Args:
        rag: Whether the agent retrieves context (FAQ) before answering.

            - ``True`` (default) — full 7-rule, 3-sub-score set
              (``groundedness``, ``completeness``, ``no_overreach``).
            - ``False`` — 4-rule, 2-sub-score set (``completeness``,
              ``no_overreach``). The three groundedness rules are omitted
              because there is no context to compare against. The flag
              predicate adapts.
        domain: Deployment domain. Affects stopwords and
            speculative-procedure markers; does not add or remove rules.

            One of: ``"general"`` (default), ``"finance"``,
            ``"healthcare"``, ``"legal"``.
        language: Deployment language. Affects stopwords,
            speculative-procedure markers, and the legal-reference
            regular expression.

            One of: ``"en"`` (default), ``"es"``, ``"multi"``.

    Returns:
        A :class:`RuleSet` whose name encodes the active configuration:
        ``customer_support_v2_{domain}_{language}_{rag|norag}``.

    Raises:
        ValueError: If ``domain`` is not in :data:`_VALID_DOMAINS` or
            ``language`` is not in :data:`_VALID_LANGUAGES`.

    Examples:
        Default — FAQ-RAG, general domain, English::

            from groundlens.agents import customer_support_rules

            rs = customer_support_rules()
            result = rs.evaluate(
                question="What is the Bizum daily limit?",
                response="The Bizum daily limit is 1,000 EUR per transaction.",
                context=(
                    "The daily Bizum transfer limit is 1,000 EUR per "
                    "transaction and 2,000 EUR per day in total."
                ),
            )
            assert not result.flagged

        No-RAG chat in Spanish finance vocabulary::

            rs = customer_support_rules(rag=False, domain="finance", language="es")
            assert "completeness" in rs.sub_scores
            assert "groundedness" not in rs.sub_scores
    """
    if domain not in _VALID_DOMAINS:
        msg = (
            f"customer_support_rules(domain={domain!r}) — supported domains are {_VALID_DOMAINS}."
        )
        raise ValueError(msg)
    if language not in _VALID_LANGUAGES:
        msg = (
            f"customer_support_rules(language={language!r}) — supported languages are "
            f"{_VALID_LANGUAGES}."
        )
        raise ValueError(msg)

    stopwords = _build_stopwords(domain=domain, language=language)
    markers = _build_speculative_markers(domain=domain, language=language)
    legal_ref_re = _legal_ref_re(language=language)

    # Bind the domain/language-specific knobs into the check callables.
    proper_nouns_check = partial(_check_no_invented_proper_nouns_impl, stopwords=stopwords)
    legal_refs_check = partial(_check_no_unrequested_legal_refs_impl, legal_ref_re=legal_ref_re)
    speculative_check = partial(_check_no_speculative_procedure_impl, speculative_markers=markers)

    grounded_rules: tuple[ChecklistRule, ...] = (
        ChecklistRule(
            id="csr.no_invented_numbers",
            description="every number in response appears in FAQ or query",
            weight=0.50,
            sub_score="groundedness",
            check=_check_no_invented_numbers,
            citation="Es et al. RAGAs (EACL 2024) §3 Faithfulness — atomic claim verification",
        ),
        ChecklistRule(
            id="csr.no_invented_proper_nouns",
            description="every proper noun in response appears in FAQ",
            weight=0.30,
            sub_score="groundedness",
            check=proper_nouns_check,
            citation="Min et al. FActScore (EMNLP 2023) — atomic factual precision",
        ),
        ChecklistRule(
            id="csr.content_overlaps_faq",
            description="response content overlaps FAQ above threshold",
            weight=0.20,
            sub_score="groundedness",
            check=_check_content_overlaps_faq,
            citation="Marin (2025) SGI arXiv:2512.13771 — surface grounding signal",
        ),
    )
    completeness_rules: tuple[ChecklistRule, ...] = (
        ChecklistRule(
            id="csr.addresses_query_topic",
            description="response addresses the query topic",
            weight=0.70,
            sub_score="completeness",
            check=_check_addresses_query_topic,
            citation="Industry banking RAG evaluation framework — relevance check",
        ),
        ChecklistRule(
            id="csr.uses_concrete_values",
            description="response uses concrete values from FAQ",
            weight=0.30,
            sub_score="completeness",
            check=_check_uses_concrete_values,
            citation="Industry banking RAG evaluation framework — usefulness check",
        ),
    )
    overreach_rules: tuple[ChecklistRule, ...] = (
        ChecklistRule(
            id="csr.no_unrequested_legal_refs",
            description="no legal references in response that are not in FAQ",
            weight=0.60,
            sub_score="no_overreach",
            check=legal_refs_check,
            citation="EU AI Act 2024/1689 Art. 13 — transparency on capabilities and limits",
        ),
        ChecklistRule(
            id="csr.no_speculative_procedure",
            description="no procedural additions not present in FAQ",
            weight=0.40,
            sub_score="no_overreach",
            check=speculative_check,
            citation="Federal Reserve SR 26-2 (Apr 2026) §model output controls",
        ),
    )

    rag_tag = "rag" if rag else "norag"
    name = f"customer_support_v2_{domain}_{language}_{rag_tag}"

    if rag:
        rules = grounded_rules + completeness_rules + overreach_rules
        return RuleSet(
            name=name,
            rules=rules,
            sub_scores=("groundedness", "completeness", "no_overreach"),
            flag_predicate=customer_support_flag_predicate,
        )
    rules = completeness_rules + overreach_rules
    return RuleSet(
        name=name,
        rules=rules,
        sub_scores=("completeness", "no_overreach"),
        flag_predicate=_customer_support_no_rag_flag_predicate,
    )

`rag_rules(domain: str = 'banking') -> RuleSet` ¶

Deprecated dispatcher — use the archetype-named factories directly.

Parameters:

Name	Type	Description	Default
`domain`	`str`	`"banking"` (default) returns :func:`groundlens.rules.decision_rationale_rules` (the 20-rule decision-rationale set). `"customer_support"` returns :func:`groundlens.agents.customer_support_rules` with `rag=True` (the 7-rule informational-agent set).	`'banking'`

Returns:

Type	Description
`RuleSet`	The selected :class:`RuleSet`.

Raises:

Type	Description
`ValueError`	If `domain` is not in :data:`_SUPPORTED_DOMAINS`.

.. deprecated:: 2026.6.13 Call the canonical factory directly: :func:groundlens.rules.decision_rationale_rules for credit / AML / KYC decision rationales, or :func:groundlens.agents.customer_support_rules for informational FAQ-RAG agents. The :func:rag_rules dispatcher will be removed in a future release.

Source code in src/groundlens/agents/rag.py

def rag_rules(domain: str = "banking") -> RuleSet:
    """Deprecated dispatcher — use the archetype-named factories directly.

    Args:
        domain: ``"banking"`` (default) returns
            :func:`groundlens.rules.decision_rationale_rules` (the 20-rule
            decision-rationale set). ``"customer_support"`` returns
            :func:`groundlens.agents.customer_support_rules` with ``rag=True``
            (the 7-rule informational-agent set).

    Returns:
        The selected :class:`RuleSet`.

    Raises:
        ValueError: If ``domain`` is not in :data:`_SUPPORTED_DOMAINS`.

    .. deprecated:: 2026.6.13
        Call the canonical factory directly:
        :func:`groundlens.rules.decision_rationale_rules` for credit / AML /
        KYC decision rationales, or
        :func:`groundlens.agents.customer_support_rules` for informational
        FAQ-RAG agents. The :func:`rag_rules` dispatcher will be removed in a
        future release.
    """
    if domain not in _SUPPORTED_DOMAINS:
        msg = (
            f"rag_rules(domain={domain!r}) — supported domains are "
            f"{_SUPPORTED_DOMAINS}. The dispatcher is also deprecated; "
            "prefer decision_rationale_rules() or customer_support_rules() "
            "directly."
        )
        raise ValueError(msg)

    if domain == "banking":
        warnings.warn(
            'rag_rules(domain="banking") is deprecated; use '
            'decision_rationale_rules(domain="finance") from groundlens.rules instead. '
            "The dispatcher will be removed in a future release.",
            DeprecationWarning,
            stacklevel=2,
        )
        # Return the legacy-named ruleset for backwards compatibility with
        # downstream code that asserts on `rs.name`.
        return groundlens_banking_rules()

    # domain == "customer_support"
    warnings.warn(
        'rag_rules(domain="customer_support") is deprecated; use '
        "customer_support_rules(rag=True) from groundlens.agents instead. "
        "The dispatcher will be removed in a future release.",
        DeprecationWarning,
        stacklevel=2,
    )
    # Use the legacy alias so the returned RuleSet keeps its legacy name
    # ("customer_support_rag_v1") for backwards compatibility.
    with warnings.catch_warnings():
        # Suppress the inner DeprecationWarning emitted by the legacy alias —
        # the outer one above is the one the caller should see.
        warnings.simplefilter("ignore", DeprecationWarning)
        return customer_support_rag_rules()

`routing_rules(domain: str = 'general') -> RuleSet` ¶

Rule set for routing / intent classification agents.

Returns a 10-rule set across 4 sub-scores: intent_clarity, classification_confidence, fallback_appropriateness, disambiguation_quality. Each rule carries a citation to its academic, industrial, or regulatory source.

Parameters:

Name	Type	Description	Default
`domain`	`str`	Deployment domain. Currently the routing rule set is domain-agnostic by design — the rules check structural properties of routing decisions (single intent, top-1 margin, fallback appropriateness, clarification quality) that hold across verticals. The kwarg is accepted for API symmetry with the other archetype factories and to leave a slot for domain-specific routing extensions in a future release. One of: `"general"` (default), `"finance"`, `"healthcare"`, `"legal"`.	`'general'`

Returns:

Name	Type	Description
`A`	`RuleSet`	class:`RuleSet` named `"groundlens_routing_v1"`.

Raises:

Type	Description
`ValueError`	If `domain` is not in :data:`_VALID_ROUTING_DOMAINS`.

Example::

from groundlens.agents import routing_rules

rs = routing_rules()
result = rs.evaluate(
    question="transfer 500 to my brother and check my balance",
    response="I will transfer 500 EUR.",
    metadata={
        "predicted_intent": "transfer",
        "top1_score": 0.62,
        "margin": 0.08,
        "fallback_fired": False,
        "query_in_scope": True,
    },
)
assert result.flagged  # low confidence + multi-intent

Source code in src/groundlens/agents/routing.py

def routing_rules(domain: str = "general") -> RuleSet:
    """Rule set for routing / intent classification agents.

    Returns a 10-rule set across 4 sub-scores: intent_clarity,
    classification_confidence, fallback_appropriateness,
    disambiguation_quality. Each rule carries a citation to its
    academic, industrial, or regulatory source.

    Args:
        domain: Deployment domain. Currently the routing rule set is
            domain-agnostic by design — the rules check structural
            properties of routing decisions (single intent, top-1 margin,
            fallback appropriateness, clarification quality) that hold
            across verticals. The kwarg is accepted for API symmetry with
            the other archetype factories and to leave a slot for
            domain-specific routing extensions in a future release.

            One of: ``"general"`` (default), ``"finance"``,
            ``"healthcare"``, ``"legal"``.

    Returns:
        A :class:`RuleSet` named ``"groundlens_routing_v1"``.

    Raises:
        ValueError: If ``domain`` is not in :data:`_VALID_ROUTING_DOMAINS`.

    Example::

        from groundlens.agents import routing_rules

        rs = routing_rules()
        result = rs.evaluate(
            question="transfer 500 to my brother and check my balance",
            response="I will transfer 500 EUR.",
            metadata={
                "predicted_intent": "transfer",
                "top1_score": 0.62,
                "margin": 0.08,
                "fallback_fired": False,
                "query_in_scope": True,
            },
        )
        assert result.flagged  # low confidence + multi-intent
    """
    if domain not in _VALID_ROUTING_DOMAINS:
        msg = f"routing_rules(domain={domain!r}) — supported domains are {_VALID_ROUTING_DOMAINS}."
        raise ValueError(msg)
    rules = (
        # intent_clarity (3 rules, weights 0.4 + 0.3 + 0.3 = 1.0)
        ChecklistRule(
            id="routing.single_intent_signal",
            description="query carries a single intent, not multiple chained operations",
            weight=0.40,
            sub_score="intent_clarity",
            check=check_single_intent_signal,
            citation="Sarikaya et al. (IEEE TASLP 2014) — intent detection in spoken NLU",
        ),
        ChecklistRule(
            id="routing.no_ambiguous_pronoun_lead",
            description="query does not start with a bare pronoun without antecedent",
            weight=0.30,
            sub_score="intent_clarity",
            check=check_no_ambiguous_pronoun_lead,
            citation=(
                "Industry banking routing-agent design pattern (production deployments, 2025)"
            ),
        ),
        ChecklistRule(
            id="routing.intent_shares_query_tokens",
            description="predicted intent shares at least one content token with the query",
            weight=0.30,
            sub_score="intent_clarity",
            check=check_intent_shares_query_tokens,
            citation="Wang et al. (ACL 2020) — intent-slot consistency for joint NLU",
        ),
        # classification_confidence (3 rules, weights 0.4 + 0.3 + 0.3 = 1.0)
        ChecklistRule(
            id="routing.top1_confidence_above_threshold",
            description="top-1 confidence above operational threshold (default 0.7)",
            weight=0.40,
            sub_score="classification_confidence",
            check=check_top1_confidence_above_threshold,
            citation="Guo et al. (ICML 2017) — on calibration of modern neural networks",
        ),
        ChecklistRule(
            id="routing.margin_to_runner_up",
            description="margin between top-1 and top-2 above floor (default 0.15)",
            weight=0.30,
            sub_score="classification_confidence",
            check=check_margin_to_runner_up,
            citation="Industry banking routing-agent evaluation — top-1 to top-2 margin metric",
        ),
        ChecklistRule(
            id="routing.intent_in_allowed_set",
            description="predicted intent belongs to the configured allowed set",
            weight=0.30,
            sub_score="classification_confidence",
            check=check_intent_in_allowed_set,
            citation="Hendrycks & Gimpel (ICLR 2017) — out-of-distribution detection",
        ),
        # fallback_appropriateness (2 rules, weights 0.6 + 0.4 = 1.0)
        ChecklistRule(
            id="routing.fallback_when_out_of_scope",
            description="if fallback fired, the query is actually out of scope",
            weight=0.60,
            sub_score="fallback_appropriateness",
            check=check_fallback_when_out_of_scope,
            citation="Industry banking RAG evaluation framework — fallback necessity check",
        ),
        ChecklistRule(
            id="routing.no_silent_fallback",
            description="fallback responses explain the limit instead of being silent",
            weight=0.40,
            sub_score="fallback_appropriateness",
            check=check_no_silent_fallback,
            citation="NIST AI RMF 1.0 (2023) §Govern 5 — transparency to affected parties",
        ),
        # disambiguation_quality (2 rules, weights 0.6 + 0.4 = 1.0)
        ChecklistRule(
            id="routing.clarify_when_ambiguous",
            description="low-margin cases trigger clarification rather than silent routing",
            weight=0.60,
            sub_score="disambiguation_quality",
            check=check_clarify_when_ambiguous,
            citation="Rao & Daumé III (ACL 2018) — learning to ask good questions",
        ),
        ChecklistRule(
            id="routing.specific_clarify_question",
            description="clarify question references the two candidate intents specifically",
            weight=0.40,
            sub_score="disambiguation_quality",
            check=check_specific_clarify_question,
            citation="De Vries et al. (ACL 2018) — task-oriented dialogue clarification",
        ),
    )

    return RuleSet(
        name="groundlens_routing_v1",
        rules=rules,
        sub_scores=(
            "intent_clarity",
            "classification_confidence",
            "fallback_appropriateness",
            "disambiguation_quality",
        ),
        flag_predicate=routing_flag_predicate,
    )

`specialized_agent_rules(domain: str = 'general', tools: tuple[str, ...] = ()) -> RuleSet` ¶

Rule set for specialized / tool-using agents.

Returns a 10-rule set across 4 sub-scores: entity_groundedness, entity_completeness, entity_calibration, execution_readiness.

The flag predicate is stricter than for RAG agents because specialized agents execute irreversible operations (move money, open accounts, send messages on behalf of the customer).

Parameters:

Name	Type	Description	Default
`domain`	`str`	Deployment domain. Today this kwarg is accepted for API symmetry with the other archetype factories; the bundled rules check structural properties (entity groundedness, schema completeness, execution readiness) that hold across verticals. Reserved for domain-specific entity validators in a future release. One of: `"general"` (default), `"finance"`, `"healthcare"`, `"legal"`.	`'general'`
`tools`	`tuple[str, ...]`	Optional tuple of validator keys. Today the bundled rule set ships IBAN, amount, and card-number checks unconditionally — they abstain when the corresponding metadata field is absent. The kwarg is reserved for future releases that will let deployments opt in to additional domain-specific validators (e.g. NPI for healthcare, DNI/NIE for Spain). Currently a non-empty value is validated against :data:`_VALID_SPECIALIZED_TOOLS` but has no behavioural effect.	`()`

Returns:

Name	Type	Description
`A`	`RuleSet`	class:`RuleSet` named `"groundlens_specialized_v1"`.

Raises:

Type	Description
`ValueError`	If `domain` is not in :data:`_VALID_SPECIALIZED_DOMAINS` or any of `tools` is not in :data:`_VALID_SPECIALIZED_TOOLS`.

Example::

from groundlens.agents import specialized_agent_rules

rs = specialized_agent_rules()
result = rs.evaluate(
    question="send 500 to my brother",
    response="OK, I'll send 500 EUR to IBAN ES12...",
    metadata={
        "dialog": "send 500 to my brother. yes go ahead.",
        "entities": {"amount": 500, "iban": "ES1234567890123456789012"},
        "required_entities": ["amount", "iban"],
        "confirmed": True,
        "operation": "wire_transfer",
    },
)

Source code in src/groundlens/agents/specialized.py

def specialized_agent_rules(
    domain: str = "general",
    tools: tuple[str, ...] = (),
) -> RuleSet:
    """Rule set for specialized / tool-using agents.

    Returns a 10-rule set across 4 sub-scores: entity_groundedness,
    entity_completeness, entity_calibration, execution_readiness.

    The flag predicate is stricter than for RAG agents because
    specialized agents execute irreversible operations (move money,
    open accounts, send messages on behalf of the customer).

    Args:
        domain: Deployment domain. Today this kwarg is accepted for API
            symmetry with the other archetype factories; the bundled
            rules check structural properties (entity groundedness,
            schema completeness, execution readiness) that hold across
            verticals. Reserved for domain-specific entity validators in
            a future release.

            One of: ``"general"`` (default), ``"finance"``,
            ``"healthcare"``, ``"legal"``.
        tools: Optional tuple of validator keys. Today the bundled rule
            set ships IBAN, amount, and card-number checks
            unconditionally — they abstain when the corresponding
            metadata field is absent. The kwarg is reserved for future
            releases that will let deployments opt in to additional
            domain-specific validators (e.g. NPI for healthcare,
            DNI/NIE for Spain). Currently a non-empty value is validated
            against :data:`_VALID_SPECIALIZED_TOOLS` but has no
            behavioural effect.

    Returns:
        A :class:`RuleSet` named ``"groundlens_specialized_v1"``.

    Raises:
        ValueError: If ``domain`` is not in
            :data:`_VALID_SPECIALIZED_DOMAINS` or any of ``tools`` is not
            in :data:`_VALID_SPECIALIZED_TOOLS`.

    Example::

        from groundlens.agents import specialized_agent_rules

        rs = specialized_agent_rules()
        result = rs.evaluate(
            question="send 500 to my brother",
            response="OK, I'll send 500 EUR to IBAN ES12...",
            metadata={
                "dialog": "send 500 to my brother. yes go ahead.",
                "entities": {"amount": 500, "iban": "ES1234567890123456789012"},
                "required_entities": ["amount", "iban"],
                "confirmed": True,
                "operation": "wire_transfer",
            },
        )
    """
    if domain not in _VALID_SPECIALIZED_DOMAINS:
        msg = (
            f"specialized_agent_rules(domain={domain!r}) — supported domains are "
            f"{_VALID_SPECIALIZED_DOMAINS}."
        )
        raise ValueError(msg)
    unknown_tools = tuple(t for t in tools if t not in _VALID_SPECIALIZED_TOOLS)
    if unknown_tools:
        msg = (
            f"specialized_agent_rules(tools={tools!r}) — unknown tools "
            f"{unknown_tools}. Known tools: {_VALID_SPECIALIZED_TOOLS}."
        )
        raise ValueError(msg)
    rules = (
        # entity_groundedness (3 rules, weights 0.5 + 0.3 + 0.2 = 1.0)
        ChecklistRule(
            id="specialized.entities_in_dialog",
            description="each captured entity appears verbatim in the dialogue",
            weight=0.50,
            sub_score="entity_groundedness",
            check=check_entities_in_dialog,
            citation="Industry banking conversational-AI evaluation — entity hallucination metric",
        ),
        ChecklistRule(
            id="specialized.iban_format_valid",
            description="captured IBANs pass ISO 13616 mod-97 verification",
            weight=0.30,
            sub_score="entity_groundedness",
            check=check_iban_format_valid,
            citation="ISO 13616:2020 — International Bank Account Number (IBAN)",
        ),
        ChecklistRule(
            id="specialized.amounts_parseable",
            description="captured amount entities parse as numbers",
            weight=0.20,
            sub_score="entity_groundedness",
            check=check_amounts_parseable,
            citation=(
                "EBA Guidelines on the security of internet payments (2019) "
                "§Transaction Authentication — exact-amount confirmation"
            ),
        ),
        # entity_completeness (2 rules, weights 0.6 + 0.4 = 1.0)
        ChecklistRule(
            id="specialized.required_entities_present",
            description="all entities required by the operation schema are captured",
            weight=0.60,
            sub_score="entity_completeness",
            check=check_required_entities_present,
            citation="Evans (2003) Domain-Driven Design — aggregate root invariants",
        ),
        ChecklistRule(
            id="specialized.no_partial_fields",
            description="no required entity is partially filled or truncated",
            weight=0.40,
            sub_score="entity_completeness",
            check=check_no_partial_fields,
            citation="Wang & Strong (1996) — beyond accuracy: data quality dimensions",
        ),
        # entity_calibration (1 rule, weight 1.0)
        ChecklistRule(
            id="specialized.no_phantom_entities",
            description="no captured entity is outside the operation schema",
            weight=1.00,
            sub_score="entity_calibration",
            check=check_no_phantom_entities,
            citation="Industry banking conversational-AI evaluation — precision of empty entities",
        ),
        # execution_readiness (4 rules, weights 0.4 + 0.3 + 0.3 = 1.0)
        ChecklistRule(
            id="specialized.explicit_confirmation",
            description="dialogue contains an explicit user confirmation before execution",
            weight=0.40,
            sub_score="execution_readiness",
            check=check_explicit_confirmation,
            citation=(
                "EBA Guidelines on the security of internet payments (2019) §27 "
                "— Transaction Authentication"
            ),
        ),
        ChecklistRule(
            id="specialized.eoc_when_complete",
            description="EOC signaled only after the operation is complete",
            weight=0.30,
            sub_score="execution_readiness",
            check=check_eoc_when_complete,
            citation=(
                "Industry banking conversational-AI evaluation — "
                "end-of-conversation detection rate"
            ),
        ),
        ChecklistRule(
            id="specialized.no_pre_execution_claim",
            description="response does not claim execution before user confirmation",
            weight=0.30,
            sub_score="execution_readiness",
            check=check_no_pre_execution_claim,
            citation=(
                "Federal Reserve SR 26-2 (Apr 2026) — Model Risk Management; model output controls"
            ),
        ),
    )

    return RuleSet(
        name="groundlens_specialized_v1",
        rules=rules,
        sub_scores=(
            "entity_groundedness",
            "entity_completeness",
            "entity_calibration",
            "execution_readiness",
        ),
        flag_predicate=specialized_flag_predicate,
    )

`fit_thresholds(examples: list[Mapping[str, object]], *, model: str = DEFAULT_MODEL, encoder: EmbeddingFn | None = None, reference_csv: str | None = None) -> ThresholdFit` ¶

Fit SGI/DGI decision thresholds on a labeled set via Youden's J.

For each example this computes DGI (and SGI when a context is present), then picks each threshold by maximizing Youden's J for the rule "value >= threshold implies grounded".

Parameters:

Name	Type	Description	Default
`examples`	`list[Mapping[str, object]]`	A list of mappings, each with keys `question` (str), `response` (str), `label` (int: `1` = ungrounded / hallucinated, `0` = grounded), and optional `context` (str).	required
`model`	`str`	Sentence transformer model name.	`DEFAULT_MODEL`
`encoder`	`EmbeddingFn \| None`	Optional bring-your-own-embeddings callable. Passed through to `compute_dgi` / `compute_sgi` so fitting works without torch.	`None`
`reference_csv`	`str \| None`	Optional DGI calibration CSV passed to `compute_dgi`.	`None`

Returns:

Name	Type	Description
`A`	`ThresholdFit`	class:`ThresholdFit` with the fitted `dgi_pass` and (when any
	`ThresholdFit`	contexts were supplied) `sgi_review` thresholds.

Raises:

Type	Description
`ValueError`	If `examples` is empty, or if both classes (grounded and ungrounded) are not present.

Example

fit = fit_thresholds( ... [ ... {"question": "Q1?", "response": "A1.", "label": 0}, ... {"question": "Q2?", "response": "off-topic", "label": 1}, ... ] ... ) fit.metric 'youden_j'

Source code in src/groundlens/calibrate.py

def fit_thresholds(
    examples: list[Mapping[str, object]],
    *,
    model: str = DEFAULT_MODEL,
    encoder: EmbeddingFn | None = None,
    reference_csv: str | None = None,
) -> ThresholdFit:
    """Fit SGI/DGI decision thresholds on a labeled set via Youden's J.

    For each example this computes DGI (and SGI when a ``context`` is
    present), then picks each threshold by maximizing Youden's J for the
    rule "value >= threshold implies grounded".

    Args:
        examples: A list of mappings, each with keys ``question`` (str),
            ``response`` (str), ``label`` (int: ``1`` = ungrounded /
            hallucinated, ``0`` = grounded), and optional ``context`` (str).
        model: Sentence transformer model name.
        encoder: Optional bring-your-own-embeddings callable. Passed through
            to ``compute_dgi`` / ``compute_sgi`` so fitting works without
            torch.
        reference_csv: Optional DGI calibration CSV passed to ``compute_dgi``.

    Returns:
        A :class:`ThresholdFit` with the fitted ``dgi_pass`` and (when any
        contexts were supplied) ``sgi_review`` thresholds.

    Raises:
        ValueError: If ``examples`` is empty, or if both classes (grounded
            and ungrounded) are not present.

    Example:
        >>> fit = fit_thresholds(
        ...     [
        ...         {"question": "Q1?", "response": "A1.", "label": 0},
        ...         {"question": "Q2?", "response": "off-topic", "label": 1},
        ...     ]
        ... )
        >>> fit.metric
        'youden_j'
    """
    from groundlens.dgi import compute_dgi
    from groundlens.sgi import compute_sgi

    if not examples:
        msg = "examples must contain at least one item."
        raise ValueError(msg)

    labels = [int(ex["label"]) for ex in examples]  # type: ignore[call-overload]
    if 0 not in labels or 1 not in labels:
        msg = (
            "fit_thresholds requires both classes present: at least one "
            "grounded (label=0) and one ungrounded (label=1) example."
        )
        raise ValueError(msg)

    dgi_grounded: list[float] = []
    dgi_hallucinated: list[float] = []
    sgi_grounded: list[float] = []
    sgi_hallucinated: list[float] = []

    for ex in examples:
        question = str(ex["question"])
        response = str(ex["response"])
        label = int(ex["label"])  # type: ignore[call-overload]

        dgi = compute_dgi(
            question,
            response,
            model=model,
            reference_csv=reference_csv,
            encoder=encoder,
        )
        (dgi_hallucinated if label == 1 else dgi_grounded).append(dgi.value)

        context = ex.get("context")
        if context:
            sgi = compute_sgi(
                question,
                str(context),
                response,
                model=model,
                encoder=encoder,
            )
            (sgi_hallucinated if label == 1 else sgi_grounded).append(sgi.value)

    dgi_pass: float | None = None
    if dgi_grounded and dgi_hallucinated:
        dgi_pass = _youden_threshold(dgi_grounded, dgi_hallucinated)

    sgi_review: float | None = None
    if sgi_grounded and sgi_hallucinated:
        sgi_review = _youden_threshold(sgi_grounded, sgi_hallucinated)

    return ThresholdFit(
        sgi_review=sgi_review,
        dgi_pass=dgi_pass,
        n=len(examples),
        model=model,
    )

`compute_dgi(question: str, response: str, *, model: str = DEFAULT_MODEL, reference_csv: str | None = None, encoder: EmbeddingFn | None = None) -> DGIResult` ¶

Compute the Directional Grounding Index for a response.

Parameters:

Name	Type	Description	Default
`question`	`str`	The input query.	required
`response`	`str`	The LLM output to evaluate.	required
`model`	`str`	Sentence transformer model name.	`DEFAULT_MODEL`
`reference_csv`	`str \| None`	Path to domain-specific calibration CSV. If `None`, uses the bundled dataset.	`None`
`encoder`	`EmbeddingFn \| None`	Optional bring-your-own-embeddings callable taking `list[str]` and returning an `(n, d)` array. Bypasses sentence-transformers (no torch required) when provided.	`None`

Returns:

Type	Description
`DGIResult`	DGIResult with raw score, normalized score, and flag status.

Raises:

Type	Description
`ValueError`	If question or response is empty.

Example

from groundlens import compute_dgi result = compute_dgi( ... question="What causes seasons on Earth?", ... response="Seasons are caused by Earth's 23.5-degree axial tilt.", ... ) result.flagged False

Source code in src/groundlens/dgi.py

def compute_dgi(
    question: str,
    response: str,
    *,
    model: str = DEFAULT_MODEL,
    reference_csv: str | None = None,
    encoder: EmbeddingFn | None = None,
) -> DGIResult:
    """Compute the Directional Grounding Index for a response.

    Args:
        question: The input query.
        response: The LLM output to evaluate.
        model: Sentence transformer model name.
        reference_csv: Path to domain-specific calibration CSV.
            If ``None``, uses the bundled dataset.
        encoder: Optional bring-your-own-embeddings callable taking
            ``list[str]`` and returning an ``(n, d)`` array. Bypasses
            sentence-transformers (no torch required) when provided.

    Returns:
        DGIResult with raw score, normalized score, and flag status.

    Raises:
        ValueError: If question or response is empty.

    Example:
        >>> from groundlens import compute_dgi
        >>> result = compute_dgi(
        ...     question="What causes seasons on Earth?",
        ...     response="Seasons are caused by Earth's 23.5-degree axial tilt.",
        ... )
        >>> result.flagged
        False
    """
    if not question.strip():
        msg = "question must be a non-empty string."
        raise ValueError(msg)
    if not response.strip():
        msg = "response must be a non-empty string."
        raise ValueError(msg)

    if (encoder is not None or model != DEFAULT_MODEL) and reference_csv is None:
        _warn_default_thresholds_with_custom_encoder("compute_dgi", model, encoder is not None)

    mu_hat = _get_mu_hat(model, reference_csv, encoder=encoder)
    embeddings = encode_texts([question, response], model_name=model, encoder=encoder)
    q_emb, r_emb = embeddings[0], embeddings[1]

    delta = displacement_vector(q_emb, r_emb)
    magnitude = float(np.linalg.norm(delta))

    # Degenerate case: response identical to question.
    if magnitude < 1e-8:
        return DGIResult(value=0.0, normalized=0.0, flagged=True)

    delta_hat = delta / magnitude
    gamma = float(np.dot(delta_hat, mu_hat))

    if math.isnan(gamma):
        logger.warning("DGI produced NaN — check embedding dimensions.")
        return DGIResult(value=0.0, normalized=0.0, flagged=True)

    normalized = round(normalize_dgi(gamma), 4)

    return DGIResult(
        value=round(gamma, 4),
        normalized=normalized,
        flagged=gamma < DGI_PASS,
    )

`evaluate_batch(items: list[dict[str, str]], *, model: str = DEFAULT_MODEL, reference_csv: str | None = None) -> list[GroundlensScore]` ¶

Evaluate a batch of LLM responses.

Each item in the list is a dict with keys

question (required)
response (required)
context (optional — triggers SGI when present)

Parameters:

Name	Type	Description	Default
`items`	`list[dict[str, str]]`	List of dicts, each containing question, response, and optionally context.	required
`model`	`str`	Sentence transformer model name.	`DEFAULT_MODEL`
`reference_csv`	`str \| None`	DGI calibration CSV path.	`None`

Returns:

Type	Description
`list[GroundlensScore]`	List of GroundlensScore results, one per input item.

Raises:

Type	Description
`KeyError`	If any item is missing `question` or `response`.

Example

from groundlens import evaluate_batch items = [ ... {"question": "Q1?", "response": "A1.", "context": "C1."}, ... {"question": "Q2?", "response": "A2."}, ... ] results = evaluate_batch(items) len(results) 2

Source code in src/groundlens/evaluate.py

def evaluate_batch(
    items: list[dict[str, str]],
    *,
    model: str = DEFAULT_MODEL,
    reference_csv: str | None = None,
) -> list[GroundlensScore]:
    """Evaluate a batch of LLM responses.

    Each item in the list is a dict with keys:
        - ``question`` (required)
        - ``response`` (required)
        - ``context`` (optional — triggers SGI when present)

    Args:
        items: List of dicts, each containing question, response, and
            optionally context.
        model: Sentence transformer model name.
        reference_csv: DGI calibration CSV path.

    Returns:
        List of GroundlensScore results, one per input item.

    Raises:
        KeyError: If any item is missing ``question`` or ``response``.

    Example:
        >>> from groundlens import evaluate_batch
        >>> items = [
        ...     {"question": "Q1?", "response": "A1.", "context": "C1."},
        ...     {"question": "Q2?", "response": "A2."},
        ... ]
        >>> results = evaluate_batch(items)
        >>> len(results)
        2
    """
    results: list[GroundlensScore] = []

    for i, item in enumerate(items):
        if "question" not in item:
            msg = f"Item {i} missing required key 'question'."
            raise KeyError(msg)
        if "response" not in item:
            msg = f"Item {i} missing required key 'response'."
            raise KeyError(msg)

        score = evaluate(
            question=item["question"],
            response=item["response"],
            context=item.get("context"),
            model=model,
            reference_csv=reference_csv,
        )
        results.append(score)

    logger.info(
        "Evaluated %d items (%d flagged).", len(results), sum(1 for r in results if r.flagged)
    )

    return results

`banking_rules(quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet` ¶

Curated ruleset for regulated banking governance decisions.

The rules cover the three sub-scores that an auditor or compliance officer typically inspects in a deferral or escalation rationale:

Specificity (spec): does the rationale cite the case parameters that triggered the decision? Flags, risk score, numeric thresholds, gates, completeness, jurisdictional details, sufficient length, and specificity-marking language.
Explanatory linkage (expl): does the rationale link the case facts to the decision? Conditional structure, pending actions, causal connectives, epistemic limits, domain references, modal verbs, length, and temporal ordering.
Boundary shift (bshift): does the rationale state what would change the decision? Conditional approval pathways, information requests, risk-reduction proposals, alternative framings, threshold references, and length.

The default quality_floor=0.3 follows the cosmetic-deadlock threshold introduced in the financial-decisions governance literature. A response that falls below this floor on either spec or expl is flagged as audit-deficient even if the geometric SGI/DGI score looks acceptable in isolation — a structurally typical "false negative" of embedding-based detection.

Parameters:

Name	Type	Description	Default
`quality_floor`	`float`	Threshold below which a sub-score triggers the cosmetic-deadlock flag. Tune per deployment risk tolerance.	`_DEFAULT_QUALITY_FLOOR`

Returns:

Name	Type	Description
`A`	`RuleSet`	class:`RuleSet` named `"banking_v1"`.

Source code in src/groundlens/rules.py

def banking_rules(quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet:
    """Curated ruleset for regulated banking governance decisions.

    The rules cover the three sub-scores that an auditor or compliance
    officer typically inspects in a deferral or escalation rationale:

    - **Specificity (spec):** does the rationale cite the case parameters
      that triggered the decision? Flags, risk score, numeric thresholds,
      gates, completeness, jurisdictional details, sufficient length, and
      specificity-marking language.
    - **Explanatory linkage (expl):** does the rationale link the case
      facts to the decision? Conditional structure, pending actions, causal
      connectives, epistemic limits, domain references, modal verbs,
      length, and temporal ordering.
    - **Boundary shift (bshift):** does the rationale state what would
      change the decision? Conditional approval pathways, information
      requests, risk-reduction proposals, alternative framings, threshold
      references, and length.

    The default ``quality_floor=0.3`` follows the cosmetic-deadlock
    threshold introduced in the financial-decisions governance literature.
    A response that falls below this floor on either ``spec`` or ``expl``
    is flagged as audit-deficient even if the geometric SGI/DGI score
    looks acceptable in isolation — a structurally typical "false
    negative" of embedding-based detection.

    Args:
        quality_floor: Threshold below which a sub-score triggers the
            cosmetic-deadlock flag. Tune per deployment risk tolerance.

    Returns:
        A :class:`RuleSet` named ``"banking_v1"``.
    """
    rules: tuple[ChecklistRule, ...] = (
        # Specificity sub-rules
        ChecklistRule("spec.reg_flag", "regulatory flag", 0.20, "spec", _check_regulatory_flag),
        ChecklistRule("spec.risk_ref", "risk reference", 0.15, "spec", _check_risk_reference),
        ChecklistRule("spec.numeric", "numeric value", 0.10, "spec", _check_numeric_value),
        ChecklistRule("spec.gate", "gate / threshold", 0.10, "spec", _check_gate_name),
        ChecklistRule("spec.info_gap", "information gap", 0.15, "spec", _check_information_gap),
        ChecklistRule(
            "spec.case_detail", "case-specific detail", 0.10, "spec", _check_case_specific_detail
        ),
        ChecklistRule(
            "spec.length", "substantive length", 0.10, "spec", _check_substantive_length
        ),
        ChecklistRule(
            "spec.spec_language",
            "specificity language",
            0.10,
            "spec",
            _check_specificity_language,
        ),
        # Explanatory linkage sub-rules
        ChecklistRule(
            "expl.conditional", "conditional structure", 0.20, "expl", _check_conditional_structure
        ),
        ChecklistRule("expl.pending", "pending action", 0.15, "expl", _check_pending_action),
        ChecklistRule("expl.causal", "causal connective", 0.15, "expl", _check_causal_connective),
        ChecklistRule(
            "expl.epistemic", "epistemic limitation", 0.15, "expl", _check_epistemic_limit
        ),
        ChecklistRule("expl.domain", "domain reference", 0.10, "expl", _check_domain_reference),
        ChecklistRule("expl.modal", "modal verb", 0.10, "expl", _check_modal_verb),
        ChecklistRule("expl.length", "minimum length", 0.10, "expl", _check_minimum_length),
        ChecklistRule(
            "expl.temporal", "temporal ordering", 0.05, "expl", _check_temporal_ordering
        ),
        # Boundary shift sub-rules
        ChecklistRule(
            "bshift.cond_approval",
            "conditional approval",
            0.25,
            "bshift",
            _check_conditional_approval,
        ),
        ChecklistRule(
            "bshift.info_request",
            "information request",
            0.20,
            "bshift",
            _check_information_request,
        ),
        ChecklistRule(
            "bshift.risk_reduction", "risk reduction", 0.15, "bshift", _check_risk_reduction
        ),
        ChecklistRule(
            "bshift.alternative", "alternative framing", 0.10, "bshift", _check_alternative_framing
        ),
        ChecklistRule(
            "bshift.threshold_ref",
            "threshold reference",
            0.10,
            "bshift",
            _check_threshold_reference,
        ),
        ChecklistRule(
            "bshift.length", "resolution-path length", 0.05, "bshift", _check_resolution_length
        ),
    )
    return RuleSet(name="banking_v1", rules=rules, quality_floor=quality_floor)

`decision_rationale_rules(domain: str = 'finance', regulations: tuple[str, ...] = (), quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet` ¶

Rule set for decision-rationale agents (credit / AML / KYC / sanctions).

Canonical factory for the 20-rule, 5-sub-score decision-rationale rule set. Replaces :func:groundlens_banking_rules under the archetype-as-function naming convention introduced in ADR 0001 (release 2026.6.13).

Parameters:

Name	Type	Description	Default
`domain`	`str`	Deployment domain. Currently only `"finance"` (default) is supported; calling with any other value raises `ValueError` so the caller knows the verticalization is not yet shipped. Insurance, healthcare, and legal vertical decision-rationale sets are on the roadmap.	`'finance'`
`regulations`	`tuple[str, ...]`	Optional tuple of regulation keys. When non-empty, `audit_explanation` lines whose rule citation does not mention any of the requested regulations are suppressed from the rendered audit text. Does not add or remove rules. Valid keys include: `"eu_ai_act"`, `"sr_26_2"`, `"sr_11_7"`, `"nist_ai_600_1"`, `"nist_ai_rmf"`, `"iso_42001"`, `"ecb_internal_models"`, `"eba_gl_2020_06"`, `"pra_ss1_23"`, `"hipaa"`, `"gdpr"`. Implementation note (2026.6.13): the kwarg is accepted and validated, but provenance-filtered rendering of `audit_explanation` will land in a follow-up release. For now the audit text is unmodified; the rule set is returned unchanged. A `UserWarning` is emitted when the kwarg is non-empty so the caller is aware the filter is not yet active.	`()`
`quality_floor`	`float`	Threshold below which a sub-score triggers the cosmetic-deadlock flag. Kept for compatibility with the legacy `banking_rules()` signature.	`_DEFAULT_QUALITY_FLOOR`

Returns:

Name	Type	Description
`A`	`RuleSet`	class:`RuleSet` named `"decision_rationale_v1_finance"` with
	`RuleSet`	five sub-scores and 20 rules. The rules and weights are identical
	`RuleSet`	to those of :func:`groundlens_banking_rules`; only the rule-set
	`RuleSet`	name is updated.

Raises:

Type	Description
`ValueError`	If `domain` is not in :data:`_VALID_DECISION_RATIONALE_DOMAINS`.

Example::

from groundlens import decision_rationale_rules

rs = decision_rationale_rules(
    domain="finance",
    regulations=("eu_ai_act", "sr_26_2"),
)
result = rs.evaluate(question=q, response=r, context=ctx)

Source code in src/groundlens/rules.py

def decision_rationale_rules(
    domain: str = "finance",
    regulations: tuple[str, ...] = (),
    quality_floor: float = _DEFAULT_QUALITY_FLOOR,
) -> RuleSet:
    """Rule set for decision-rationale agents (credit / AML / KYC / sanctions).

    Canonical factory for the 20-rule, 5-sub-score decision-rationale
    rule set. Replaces :func:`groundlens_banking_rules` under the
    archetype-as-function naming convention introduced in ADR 0001
    (release 2026.6.13).

    Args:
        domain: Deployment domain. Currently only ``"finance"`` (default)
            is supported; calling with any other value raises
            ``ValueError`` so the caller knows the verticalization is not
            yet shipped. Insurance, healthcare, and legal vertical
            decision-rationale sets are on the roadmap.
        regulations: Optional tuple of regulation keys. When non-empty,
            ``audit_explanation`` lines whose rule citation does not
            mention any of the requested regulations are suppressed from
            the rendered audit text. Does not add or remove rules. Valid
            keys include: ``"eu_ai_act"``, ``"sr_26_2"``, ``"sr_11_7"``,
            ``"nist_ai_600_1"``, ``"nist_ai_rmf"``, ``"iso_42001"``,
            ``"ecb_internal_models"``, ``"eba_gl_2020_06"``,
            ``"pra_ss1_23"``, ``"hipaa"``, ``"gdpr"``.

            *Implementation note (2026.6.13):* the kwarg is accepted and
            validated, but provenance-filtered rendering of
            ``audit_explanation`` will land in a follow-up release. For
            now the audit text is unmodified; the rule set is returned
            unchanged. A ``UserWarning`` is emitted when the kwarg is
            non-empty so the caller is aware the filter is not yet active.
        quality_floor: Threshold below which a sub-score triggers the
            cosmetic-deadlock flag. Kept for compatibility with the
            legacy ``banking_rules()`` signature.

    Returns:
        A :class:`RuleSet` named ``"decision_rationale_v1_finance"`` with
        five sub-scores and 20 rules. The rules and weights are identical
        to those of :func:`groundlens_banking_rules`; only the rule-set
        name is updated.

    Raises:
        ValueError: If ``domain`` is not in
            :data:`_VALID_DECISION_RATIONALE_DOMAINS`.

    Example::

        from groundlens import decision_rationale_rules

        rs = decision_rationale_rules(
            domain="finance",
            regulations=("eu_ai_act", "sr_26_2"),
        )
        result = rs.evaluate(question=q, response=r, context=ctx)
    """
    if domain not in _VALID_DECISION_RATIONALE_DOMAINS:
        msg = (
            f"decision_rationale_rules(domain={domain!r}) — supported domains "
            f"are {_VALID_DECISION_RATIONALE_DOMAINS}. Other verticalizations "
            "are on the roadmap; open an issue at "
            "https://github.com/groundlens-dev/groundlens/issues to request "
            "one."
        )
        raise ValueError(msg)

    unknown = tuple(r for r in regulations if r not in _REGULATION_CITATION_KEYS)
    if unknown:
        msg = (
            f"decision_rationale_rules(regulations={regulations!r}) — unknown "
            f"keys {unknown}. Known keys: "
            f"{tuple(_REGULATION_CITATION_KEYS.keys())}."
        )
        raise ValueError(msg)
    if regulations:
        warnings.warn(
            "decision_rationale_rules(regulations=...) is accepted but the "
            "provenance-filtered audit_explanation rendering is not yet "
            "active (slated for a follow-up release). The returned RuleSet "
            "is unchanged.",
            UserWarning,
            stacklevel=2,
        )

    base = groundlens_banking_rules(quality_floor=quality_floor)
    # Replace the legacy name with the archetype-aware canonical name.
    object.__setattr__(base, "name", f"decision_rationale_v1_{domain}")
    return base

`groundlens_banking_rules(quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet` ¶

Canonical rule set for LLM rationale evaluation in banking governance.

Returns the 20-rule reference set whose provenance is triangulated across five independent research tracks: peer-reviewed NLP literature, tier-1 bank public reports, banking regulator whitepapers, cross-industry frameworks, and financial-domain NLP benchmarks. The rules are organized into five empirically-emergent sub-score categories:

groundedness (5 rules): claims linked to and supported by source.
completeness (3 rules): coverage of the governance question.
calibration (4 rules): uncertainty expression and abstention.
traceability (5 rules): citation, audit trail, validation references.
robustness (3 rules): resistance to noise, conflict, injection.

Each rule carries a citation field pointing to at least one of its academic, industrial, or regulatory provenance sources. The companion paper (Marin, 2026) documents the full per-rule provenance.

The default flag predicate :func:_groundlens_banking_flag_predicate triggers when any regulator-non-negotiable sub-score falls below its threshold (groundedness < 0.5, calibration < 0.3, or traceability < 0.4).

Parameters:

Name	Type	Description	Default
`quality_floor`	`float`	Legacy floor exposed for users who want a uniform threshold across sub-scores. Not used by the default flag predicate; kept for compatibility with the legacy `banking_rules()` signature so deployers can A/B both rulesets with one parameter.	`_DEFAULT_QUALITY_FLOOR`

Returns:

Name	Type	Description
`A`	`RuleSet`	class:`RuleSet` named `"groundlens_banking_v1"` with five
	`RuleSet`	sub-scores and 20 rules.

Source code in src/groundlens/rules.py

def groundlens_banking_rules(quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet:
    """Canonical rule set for LLM rationale evaluation in banking governance.

    Returns the 20-rule reference set whose provenance is triangulated across
    five independent research tracks: peer-reviewed NLP literature, tier-1
    bank public reports, banking regulator whitepapers, cross-industry
    frameworks, and financial-domain NLP benchmarks. The rules are organized
    into five empirically-emergent sub-score categories:

    - **groundedness** (5 rules): claims linked to and supported by source.
    - **completeness** (3 rules): coverage of the governance question.
    - **calibration** (4 rules): uncertainty expression and abstention.
    - **traceability** (5 rules): citation, audit trail, validation references.
    - **robustness** (3 rules): resistance to noise, conflict, injection.

    Each rule carries a ``citation`` field pointing to at least one of its
    academic, industrial, or regulatory provenance sources. The companion
    paper (Marin, 2026) documents the full per-rule provenance.

    The default flag predicate :func:`_groundlens_banking_flag_predicate`
    triggers when any regulator-non-negotiable sub-score falls below its
    threshold (groundedness < 0.5, calibration < 0.3, or traceability < 0.4).

    Args:
        quality_floor: Legacy floor exposed for users who want a uniform
            threshold across sub-scores. Not used by the default flag
            predicate; kept for compatibility with the legacy ``banking_rules()``
            signature so deployers can A/B both rulesets with one parameter.

    Returns:
        A :class:`RuleSet` named ``"groundlens_banking_v1"`` with five
        sub-scores and 20 rules.
    """
    rules: tuple[ChecklistRule, ...] = (
        # ── Groundedness (5 rules) ──────────────────────────────────────────
        ChecklistRule(
            id="grnd.claim_supported_by_context",
            description="every claim inferable from context",
            weight=0.25,
            sub_score="groundedness",
            check=_check_grounded_in_context,
            citation="RAGAs (Es et al., EACL 2024) §3; NIST AI 600-1 (2024) §2.2 Confabulation",
        ),
        ChecklistRule(
            id="grnd.atomic_decomposition",
            description="rationale decomposable into atomic claims",
            weight=0.20,
            sub_score="groundedness",
            check=_check_atomic_decomposable,
            citation="FactScore (Min et al., EMNLP 2023) §3; RAGAs (Es et al., EACL 2024) §3",
        ),
        ChecklistRule(
            id="grnd.no_unsupported_extensions",
            description="no claims beyond what context supports",
            weight=0.20,
            sub_score="groundedness",
            check=_check_no_unsupported_extensions,
            citation=(
                "HaluEval (Li et al., EMNLP 2023); Ji et al. ACM CSUR 2023; NIST AI 600-1 (2024)"
            ),
        ),
        ChecklistRule(
            id="grnd.regulatory_flag",
            description="names a specific regulatory flag or policy clause",
            weight=0.20,
            sub_score="groundedness",
            check=_check_regulatory_flag,
            citation="REV (Chen et al., ACL 2023); SR 26-2 (Fed/OCC/FDIC 2026) §VI Documentation",
        ),
        ChecklistRule(
            id="grnd.counterfactual_robust",
            description="screened against wrong-retrieval scenarios",
            weight=0.15,
            sub_score="groundedness",
            check=_check_counterfactual_robustness,
            citation="RGB (Chen et al., AAAI 2024); EU AI Act 2024/1689 Art. 15(4)",
        ),
        # ── Completeness (3 rules) ──────────────────────────────────────────
        ChecklistRule(
            id="comp.addresses_all_parts",
            description="response length scales with question parts",
            weight=0.40,
            sub_score="completeness",
            check=_check_addresses_all_parts,
            citation="RAGAs (Es et al., EACL 2024) §3; EU AI Act 2024/1689 Art. 13(2)",
        ),
        ChecklistRule(
            id="comp.governance_dimensions",
            description="references multiple governance dimensions",
            weight=0.35,
            sub_score="completeness",
            check=_check_governance_dimensions,
            citation="EBA GL/2020/06 §4.3.3; SR 26-2 (Fed/OCC/FDIC 2026) §IV Model Development",
        ),
        ChecklistRule(
            id="comp.information_integration",
            description="integrates multiple sources",
            weight=0.25,
            sub_score="completeness",
            check=_check_information_integration,
            citation="RGB (Chen et al., AAAI 2024); TRUE (Honovich et al., NAACL 2022)",
        ),
        # ── Calibration (4 rules) ───────────────────────────────────────────
        ChecklistRule(
            id="cal.abstains_when_insufficient",
            description="explicitly abstains when evidence is insufficient",
            weight=0.35,
            sub_score="calibration",
            check=_check_abstains_when_insufficient,
            citation=(
                "RAGAs (Es et al., EACL 2024) §3; FinanceBench (Islam et al., 2023); "
                "SR 26-2 §V Model Validation"
            ),
        ),
        ChecklistRule(
            id="cal.explicit_hedging",
            description="uses hedging language for uncertain claims",
            weight=0.30,
            sub_score="calibration",
            check=_check_explicit_hedging,
            citation=(
                "TruthfulQA (Lin et al., ACL 2022); Hyland (1998) hedging taxonomy; "
                "SR 26-2 §IV Model Use"
            ),
        ),
        ChecklistRule(
            id="cal.confidence_score",
            description="includes a numeric confidence or probability",
            weight=0.20,
            sub_score="calibration",
            check=_check_confidence_score,
            citation="G-Eval (Liu et al., EMNLP 2023); EU AI Act Art. 13(3)(b)(ii)",
        ),
        ChecklistRule(
            id="cal.self_consistency",
            description="pipeline screened for self-consistency",
            weight=0.15,
            sub_score="calibration",
            check=_check_self_consistency,
            citation="SelfCheckGPT (Manakul et al., EMNLP 2023); Morgan Stanley + OpenAI (2024)",
        ),
        # ── Traceability (5 rules) ──────────────────────────────────────────
        ChecklistRule(
            id="trace.specific_source_span",
            description="cites a specific page / section / paragraph",
            weight=0.25,
            sub_score="traceability",
            check=_check_specific_source_span,
            citation=(
                "e-SNLI (Camburu et al., NeurIPS 2018); EU AI Act Art. 13(3)(b)(iv); "
                "FinanceBench (Islam et al., 2023)"
            ),
        ),
        ChecklistRule(
            id="trace.natural_language_rationale",
            description="provides a substantive natural-language rationale",
            weight=0.20,
            sub_score="traceability",
            check=_check_substantive_length,
            citation=(
                "e-SNLI (Camburu et al., NeurIPS 2018); EU AI Act Art. 13(3)(b)(iv); "
                "PRA SS1/23 Principle 3"
            ),
        ),
        ChecklistRule(
            id="trace.falsifiable_actionable",
            description="couples numeric claim with causal mechanism",
            weight=0.20,
            sub_score="traceability",
            check=_check_falsifiable_actionable,
            citation="REV (Chen et al., ACL 2023); SR 26-2 §V Conceptual Soundness",
        ),
        ChecklistRule(
            id="trace.numeric_value",
            description="includes a numeric value or metric",
            weight=0.15,
            sub_score="traceability",
            check=_check_numeric_value,
            citation=(
                "FinQA (Chen et al., EMNLP 2021); EU AI Act Art. 13(3)(b)(ii); "
                "SR 26-2 §V Outcomes Analysis"
            ),
        ),
        ChecklistRule(
            id="trace.audit_logged",
            description="rationale persisted to audit log",
            weight=0.20,
            sub_score="traceability",
            check=_check_audit_logged,
            citation=(
                "EU AI Act Art. 12 Record-Keeping; SR 26-2 §VI Documentation; "
                "ISO/IEC 42001:2023 §8.2"
            ),
        ),
        # ── Robustness (3 rules) ────────────────────────────────────────────
        ChecklistRule(
            id="rob.independent_validation",
            description="references independent validation / effective challenge",
            weight=0.40,
            sub_score="robustness",
            check=_check_independent_validation,
            citation=(
                "SR 26-2 §III Effective Challenge; PRA SS1/23 Principle 4; "
                "ECB Guide to Internal Models §9.3 ¶43(a)"
            ),
        ),
        ChecklistRule(
            id="rob.prompt_injection_robust",
            description="pipeline screened for prompt-injection robustness",
            weight=0.35,
            sub_score="robustness",
            check=_check_prompt_injection_robust,
            citation="RGB (Chen et al., AAAI 2024); EU AI Act Art. 15; MAS MindForge (2024)",
        ),
        ChecklistRule(
            id="rob.cross_source_conflict",
            description="acknowledges cross-source conflicts",
            weight=0.25,
            sub_score="robustness",
            check=_check_cross_source_conflict,
            citation=(
                "ConflictBank (Su et al., 2024); EU AI Act Art. 15(4); RGB (Chen et al., 2024)"
            ),
        ),
    )

    return RuleSet(
        name="groundlens_banking_v1",
        rules=rules,
        sub_scores=("groundedness", "completeness", "calibration", "traceability", "robustness"),
        quality_floor=quality_floor,
        flag_predicate=_groundlens_banking_flag_predicate,
    )

`compute_sgi(question: str, context: str, response: str, *, model: str = DEFAULT_MODEL, encoder: EmbeddingFn | None = None) -> SGIResult` ¶

Compute the Semantic Grounding Index for a response.

Parameters:

Name	Type	Description	Default
`question`	`str`	The input query.	required
`context`	`str`	Source document, retrieved chunks, or reference text.	required
`response`	`str`	The LLM output to evaluate.	required
`model`	`str`	Sentence transformer model name. Default `all-MiniLM-L6-v2`.	`DEFAULT_MODEL`
`encoder`	`EmbeddingFn \| None`	Optional bring-your-own-embeddings callable taking `list[str]` and returning an `(n, d)` array. Bypasses sentence-transformers (no torch required) when provided.	`None`

Returns:

Type	Description
`SGIResult`	SGIResult with raw score, normalized score, and flag status.

Raises:

Type	Description
`ValueError`	If any input string is empty.

Example

from groundlens import compute_sgi result = compute_sgi( ... question="What is the capital of France?", ... context="France is in Western Europe. Its capital is Paris.", ... response="The capital of France is Paris.", ... ) result.flagged False

Source code in src/groundlens/sgi.py

def compute_sgi(
    question: str,
    context: str,
    response: str,
    *,
    model: str = DEFAULT_MODEL,
    encoder: EmbeddingFn | None = None,
) -> SGIResult:
    """Compute the Semantic Grounding Index for a response.

    Args:
        question: The input query.
        context: Source document, retrieved chunks, or reference text.
        response: The LLM output to evaluate.
        model: Sentence transformer model name. Default ``all-MiniLM-L6-v2``.
        encoder: Optional bring-your-own-embeddings callable taking
            ``list[str]`` and returning an ``(n, d)`` array. Bypasses
            sentence-transformers (no torch required) when provided.

    Returns:
        SGIResult with raw score, normalized score, and flag status.

    Raises:
        ValueError: If any input string is empty.

    Example:
        >>> from groundlens import compute_sgi
        >>> result = compute_sgi(
        ...     question="What is the capital of France?",
        ...     context="France is in Western Europe. Its capital is Paris.",
        ...     response="The capital of France is Paris.",
        ... )
        >>> result.flagged
        False
    """
    if not question.strip():
        msg = "question must be a non-empty string."
        raise ValueError(msg)
    if not context.strip():
        msg = "context must be a non-empty string."
        raise ValueError(msg)
    if not response.strip():
        msg = "response must be a non-empty string."
        raise ValueError(msg)

    if encoder is not None or model != DEFAULT_MODEL:
        _warn_default_thresholds_with_custom_encoder("compute_sgi", model, encoder is not None)

    embeddings = encode_texts([question, context, response], model_name=model, encoder=encoder)
    q_emb, ctx_emb, resp_emb = embeddings[0], embeddings[1], embeddings[2]

    # L2-normalize to project onto the unit hypersphere (paper Algorithm 1).
    q_hat = _l2_normalize(q_emb)
    c_hat = _l2_normalize(ctx_emb)
    r_hat = _l2_normalize(resp_emb)

    # Angular (geodesic) distances on S^(d-1).
    q_dist = _angular_distance(r_hat, q_hat)
    ctx_dist = _angular_distance(r_hat, c_hat)

    # Degenerate case: response identical to context (theta(r, c) ≈ 0).
    if ctx_dist < _EPS:
        return SGIResult(
            value=10.0,
            normalized=1.0,
            flagged=False,
            q_dist=round(q_dist, 4),
            ctx_dist=round(ctx_dist, 4),
        )

    # Degenerate case: response identical to question (theta(r, q) ≈ 0).
    if q_dist < _EPS:
        return SGIResult(
            value=0.0,
            normalized=0.0,
            flagged=True,
            q_dist=round(q_dist, 4),
            ctx_dist=round(ctx_dist, 4),
        )

    raw = q_dist / ctx_dist
    normalized = normalize_sgi(raw)

    return SGIResult(
        value=round(raw, 4),
        normalized=round(normalized, 4),
        flagged=raw < SGI_REVIEW,
        q_dist=round(q_dist, 4),
        ctx_dist=round(ctx_dist, 4),
    )

Index

groundlens ¶

Attributes¶

DEFAULT_MODEL: str = 'Snowflake/snowflake-arctic-embed-l-v2.0' module-attribute ¶

LIGHTWEIGHT_MINILM: str = 'all-MiniLM-L6-v2' module-attribute ¶

MULTILINGUAL_E5: str = 'intfloat/multilingual-e5-large' module-attribute ¶

MULTILINGUAL_MINI: str = 'paraphrase-multilingual-MiniLM-L12-v2' module-attribute ¶

Classes¶

CalibrationResult(model: str, n_pairs: int, embedding_dim: int, mu_hat: NDArray[np.float32], concentration: float, metadata: dict[str, str] = dict()) dataclass ¶

Methods:¶

save(path: str | Path) -> None ¶

load(path: str | Path) -> CalibrationResult classmethod ¶

ThresholdFit(sgi_review: float | None, dgi_pass: float | None, n: int, model: str, metric: str = 'youden_j') dataclass ¶

DGI(model: str = DEFAULT_MODEL, reference_csv: str | None = None, encoder: EmbeddingFn | None = None) ¶

Methods:¶

calibrate(pairs: list[tuple[str, str]] | None = None, csv_path: str | None = None) -> None ¶

score(question: str, response: str) -> DGIResult ¶

propose_labels(*, seeds: list[SeedExample], llm_generate: Callable[[str], str], n_candidates: int = 50, n_to_label: int = 10, strategies: str | tuple[str | tuple[str, str], ...] = 'default', diverse_fraction: float = 0.3, seed: int = 42) -> PropositionBatch ¶

ProposedLabel(question: str, candidate_response: str, dgi_score: float, strategy: str, context_excerpt: str, uncertainty: float) dataclass ¶

PropositionBatch(items: tuple[ProposedLabel, ...], review_template: str, all_candidates: tuple[ProposedLabel, ...] = tuple(), strategies_used: tuple[str, ...] = tuple()) dataclass ¶

SeedExample(context: str, question: str, grounded: str) dataclass ¶

Methods:¶

__post_init__() -> None ¶

ChecklistRule(id: str, description: str, weight: float, sub_score: str, check: Callable[[str, str, str | None, dict[str, Any]], RuleEvidence], citation: str = '') dataclass ¶

RuleEvidence(matched: bool, span: str, explanation: str) dataclass ¶

RuleResult(rule_id: str, sub_score: str, weight: float, matched: bool, evidence_span: str, explanation: str) dataclass ¶

RuleSet(name: str, rules: tuple[ChecklistRule, ...], sub_scores: tuple[str, ...] = ('spec', 'expl', 'bshift'), quality_floor: float = _DEFAULT_QUALITY_FLOOR, flag_predicate: Callable[[dict[str, float]], bool] | None = None) dataclass ¶

Methods:¶

evaluate(*, question: str, response: str, context: str | None = None, metadata: dict[str, Any] | None = None) -> RuleSetResult ¶

RuleSetResult(sub_scores: dict[str, float], quality: float, flagged: bool, rule_results: tuple[RuleResult, ...], audit_explanation: str) dataclass ¶

Attributes¶

spec: float property ¶

expl: float property ¶

bshift: float property ¶

groundedness: float property ¶

completeness: float property ¶

calibration: float property ¶

traceability: float property ¶

robustness: float property ¶

DGIResult(value: float, normalized: float, flagged: bool, method: str = 'dgi', explanation: str = '') dataclass ¶

Methods:¶

__post_init__() -> None ¶

GroundlensScore(value: float, normalized: float, flagged: bool, method: str, explanation: str, detail: SGIResult | DGIResult) dataclass ¶

SGIResult(value: float, normalized: float, flagged: bool, q_dist: float, ctx_dist: float, method: str = 'sgi', explanation: str = '') dataclass ¶

Methods:¶

__post_init__() -> None ¶

SGI(model: str = DEFAULT_MODEL, encoder: EmbeddingFn | None = None) ¶

Methods:¶

score(question: str, context: str, response: str) -> SGIResult ¶

Functions:¶

get_default_encoder() -> EmbeddingFn | None ¶

set_default_encoder(encoder: EmbeddingFn | None) -> None ¶

customer_support_rag_rules() -> RuleSet ¶

customer_support_rules(rag: bool = True, domain: str = 'general', language: str = 'en') -> RuleSet ¶

rag_rules(domain: str = 'banking') -> RuleSet ¶

routing_rules(domain: str = 'general') -> RuleSet ¶

specialized_agent_rules(domain: str = 'general', tools: tuple[str, ...] = ()) -> RuleSet ¶

fit_thresholds(examples: list[Mapping[str, object]], *, model: str = DEFAULT_MODEL, encoder: EmbeddingFn | None = None, reference_csv: str | None = None) -> ThresholdFit ¶

compute_dgi(question: str, response: str, *, model: str = DEFAULT_MODEL, reference_csv: str | None = None, encoder: EmbeddingFn | None = None) -> DGIResult ¶

evaluate_batch(items: list[dict[str, str]], *, model: str = DEFAULT_MODEL, reference_csv: str | None = None) -> list[GroundlensScore] ¶

banking_rules(quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet ¶

decision_rationale_rules(domain: str = 'finance', regulations: tuple[str, ...] = (), quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet ¶

groundlens_banking_rules(quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet ¶

compute_sgi(question: str, context: str, response: str, *, model: str = DEFAULT_MODEL, encoder: EmbeddingFn | None = None) -> SGIResult ¶

`groundlens` ¶

`DEFAULT_MODEL: str = 'Snowflake/snowflake-arctic-embed-l-v2.0'` `module-attribute` ¶

`LIGHTWEIGHT_MINILM: str = 'all-MiniLM-L6-v2'` `module-attribute` ¶

`MULTILINGUAL_E5: str = 'intfloat/multilingual-e5-large'` `module-attribute` ¶

`MULTILINGUAL_MINI: str = 'paraphrase-multilingual-MiniLM-L12-v2'` `module-attribute` ¶

`CalibrationResult(model: str, n_pairs: int, embedding_dim: int, mu_hat: NDArray[np.float32], concentration: float, metadata: dict[str, str] = dict())` `dataclass` ¶

`save(path: str | Path) -> None` ¶

`load(path: str | Path) -> CalibrationResult` `classmethod` ¶

`ThresholdFit(sgi_review: float | None, dgi_pass: float | None, n: int, model: str, metric: str = 'youden_j')` `dataclass` ¶

`DGI(model: str = DEFAULT_MODEL, reference_csv: str | None = None, encoder: EmbeddingFn | None = None)` ¶

`calibrate(pairs: list[tuple[str, str]] | None = None, csv_path: str | None = None) -> None` ¶

`score(question: str, response: str) -> DGIResult` ¶

`propose_labels(*, seeds: list[SeedExample], llm_generate: Callable[[str], str], n_candidates: int = 50, n_to_label: int = 10, strategies: str | tuple[str | tuple[str, str], ...] = 'default', diverse_fraction: float = 0.3, seed: int = 42) -> PropositionBatch` ¶

`ProposedLabel(question: str, candidate_response: str, dgi_score: float, strategy: str, context_excerpt: str, uncertainty: float)` `dataclass` ¶

`PropositionBatch(items: tuple[ProposedLabel, ...], review_template: str, all_candidates: tuple[ProposedLabel, ...] = tuple(), strategies_used: tuple[str, ...] = tuple())` `dataclass` ¶

`SeedExample(context: str, question: str, grounded: str)` `dataclass` ¶

`__post_init__() -> None` ¶

`ChecklistRule(id: str, description: str, weight: float, sub_score: str, check: Callable[[str, str, str | None, dict[str, Any]], RuleEvidence], citation: str = '')` `dataclass` ¶

`RuleEvidence(matched: bool, span: str, explanation: str)` `dataclass` ¶

`RuleResult(rule_id: str, sub_score: str, weight: float, matched: bool, evidence_span: str, explanation: str)` `dataclass` ¶

`RuleSet(name: str, rules: tuple[ChecklistRule, ...], sub_scores: tuple[str, ...] = ('spec', 'expl', 'bshift'), quality_floor: float = _DEFAULT_QUALITY_FLOOR, flag_predicate: Callable[[dict[str, float]], bool] | None = None)` `dataclass` ¶

`evaluate(*, question: str, response: str, context: str | None = None, metadata: dict[str, Any] | None = None) -> RuleSetResult` ¶

`RuleSetResult(sub_scores: dict[str, float], quality: float, flagged: bool, rule_results: tuple[RuleResult, ...], audit_explanation: str)` `dataclass` ¶

`spec: float` `property` ¶

`expl: float` `property` ¶

`bshift: float` `property` ¶

`groundedness: float` `property` ¶

`completeness: float` `property` ¶

`calibration: float` `property` ¶

`traceability: float` `property` ¶

`robustness: float` `property` ¶

`DGIResult(value: float, normalized: float, flagged: bool, method: str = 'dgi', explanation: str = '')` `dataclass` ¶

`__post_init__() -> None` ¶

`GroundlensScore(value: float, normalized: float, flagged: bool, method: str, explanation: str, detail: SGIResult | DGIResult)` `dataclass` ¶

`SGIResult(value: float, normalized: float, flagged: bool, q_dist: float, ctx_dist: float, method: str = 'sgi', explanation: str = '')` `dataclass` ¶

`__post_init__() -> None` ¶

`SGI(model: str = DEFAULT_MODEL, encoder: EmbeddingFn | None = None)` ¶

`score(question: str, context: str, response: str) -> SGIResult` ¶

`get_default_encoder() -> EmbeddingFn | None` ¶

`set_default_encoder(encoder: EmbeddingFn | None) -> None` ¶

`customer_support_rag_rules() -> RuleSet` ¶

`customer_support_rules(rag: bool = True, domain: str = 'general', language: str = 'en') -> RuleSet` ¶

`rag_rules(domain: str = 'banking') -> RuleSet` ¶

`routing_rules(domain: str = 'general') -> RuleSet` ¶

`specialized_agent_rules(domain: str = 'general', tools: tuple[str, ...] = ()) -> RuleSet` ¶

`fit_thresholds(examples: list[Mapping[str, object]], *, model: str = DEFAULT_MODEL, encoder: EmbeddingFn | None = None, reference_csv: str | None = None) -> ThresholdFit` ¶

`compute_dgi(question: str, response: str, *, model: str = DEFAULT_MODEL, reference_csv: str | None = None, encoder: EmbeddingFn | None = None) -> DGIResult` ¶

`evaluate_batch(items: list[dict[str, str]], *, model: str = DEFAULT_MODEL, reference_csv: str | None = None) -> list[GroundlensScore]` ¶

`banking_rules(quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet` ¶

`decision_rationale_rules(domain: str = 'finance', regulations: tuple[str, ...] = (), quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet` ¶

`groundlens_banking_rules(quality_floor: float = _DEFAULT_QUALITY_FLOOR) -> RuleSet` ¶

`compute_sgi(question: str, context: str, response: str, *, model: str = DEFAULT_MODEL, encoder: EmbeddingFn | None = None) -> SGIResult` ¶