Production Deployment in Banking¶
This guide is for teams deploying groundlens as part of an AI governance layer in a regulated banking environment. It covers the end-to-end pipeline: triage of LLM outputs, audit log, compliance mapping, calibration, and operational considerations specific to US / EU bank deployments.
The patterns below are the deployment shape groundlens is designed for. Each component (rules, audit, calibration, compliance) can be adopted independently; they compose well when used together.
Architectural overview¶
┌─────────────────────────────────────────────────────────────────┐
│ LLM-powered banking workflow │
│ (credit decision, AML review, KYC, fraud, ...) │
└────────────────────────────┬────────────────────────────────────┘
│ (question, response, context)
▼
┌─────────────────────────────────────────────────────────────────┐
│ groundlens triage layer │
│ ┌────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ SGI/DGI │ │ Rule-based │ │ Hybrid quality score │ │
│ │ scoring │ │ sub-scores │ │ (geometric × rules) │ │
│ └────────────┘ └──────────────┘ └────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────┴──────────────────┐
▼ ▼
┌────────────────────────┐ ┌────────────────────────────┐
│ Flagged → human │ │ AuditLog (sqlite + │
│ review queue │ │ hash chain) │
└────────────────────────┘ └────────────────────────────┘
│
▼
┌────────────────────────┐
│ Examiner exports + │
│ compliance reports │
└────────────────────────┘
Minimum viable pipeline¶
The shortest production pipeline that covers triage + audit looks like:
from groundlens import compute_sgi
from groundlens.rules import banking_rules
from groundlens.audit import AuditLog
audit = AuditLog(db_path="/var/lib/groundlens/audit.sqlite")
rules = banking_rules(quality_floor=0.3)
def process_decision(case):
"""Process one LLM-produced banking decision."""
# 1. Geometric grounding score
sgi = compute_sgi(
question=case["question"],
context=case["context"],
response=case["response"],
)
# 2. Rule-based interpretable sub-scores
rs = rules.evaluate(
question=case["question"],
response=case["response"],
context=case["context"],
metadata={
"flags_present": case.get("flags", []),
"jurisdiction": case.get("jurisdiction"),
"transaction_type": case.get("transaction_type"),
},
)
# 3. Hybrid flagging: flag if either signal flags
flagged = sgi.flagged or rs.flagged
# 4. Persist to audit log (hash chain)
audit.record(
identifier=case["case_id"],
method="hybrid",
score=sgi.value,
flagged=flagged,
inputs={
"question": case["question"],
"context": case["context"],
"response": case["response"],
},
metadata={
"operator": case.get("operator"),
"model_version": case.get("model_version"),
"jurisdiction": case.get("jurisdiction"),
},
rule_results=[
{
"rule_id": r.rule_id,
"sub_score": r.sub_score,
"matched": r.matched,
"evidence_span": r.evidence_span,
}
for r in rs.rule_results
],
)
return {
"case_id": case["case_id"],
"flagged": flagged,
"sgi_value": sgi.value,
"spec": rs.spec,
"expl": rs.expl,
"bshift": rs.bshift,
"audit_explanation": rs.audit_explanation,
}
Why hybrid (geometric + rules)?¶
Geometric methods (SGI, DGI) and rule-based methods detect different failure modes:
| Failure mode | Detected by SGI/DGI | Detected by rules |
|---|---|---|
| Response ignores context (RAG miss) | yes (Type I) | partially (rules need response to be specific to case) |
| Response invents content outside the topic | yes (Type II) | sometimes |
| Rationale lacks specificity (no case detail) | no | yes (spec sub-score) |
| Rationale lacks explanatory linkage | no | yes (expl sub-score) |
| Rationale lacks resolution path | no | yes (bshift sub-score) |
| Response cites fabricated parameters that look plausible | no | no (this is a Type III error — neither method catches it) |
The geometric methods alone can miss cases where the rationale is plausibly written but uninformative to a human reviewer. The rule-based sub-scores alone can miss cases where the response is structurally rich but ignores the actual context provided. Together they cover more of the failure surface.
For Type III errors, complement groundlens with claim-level fact- checking on the outputs that pass both signals.
Calibration for banking domain¶
The bundled banking calibration corpus covers seven sub-domains:
from groundlens import compute_dgi
from groundlens.data import banking_reference_pairs_path
result = compute_dgi(
question="What triggers a SAR filing under the Bank Secrecy Act?",
response=llm_output,
reference_csv=str(banking_reference_pairs_path()),
)
For deployment-specific calibration, extend the bundled corpus with verified pairs from the deployment's own workflow. A starting target is 100-200 pairs per sub-domain to reach AUROC > 0.95. See the domain calibration guide for the calibration procedure.
Calibration data sourcing: for banking deployments, sources include internal model validation test cases, compliance committee post-mortems with the final-decision rationale, and synthetic pairs reviewed by a compliance officer. Do not source from the deployment's own LLM outputs unless they have been independently verified — calibrating against unverified pairs causes the reference direction to drift toward the LLM's own biases.
Threshold tuning¶
The defaults are derived from research benchmarks. For banking deployments, the appropriate tuning depends on risk tier:
from groundlens.rules import banking_rules
# Conservative — high-stakes paths (credit decisioning, sanctions screening)
strict_rules = banking_rules(quality_floor=0.4)
# Standard — medium-risk paths (KYC review, fraud triage)
standard_rules = banking_rules(quality_floor=0.3)
# Permissive — low-stakes paths (customer service, content moderation)
relaxed_rules = banking_rules(quality_floor=0.2)
Document the chosen threshold and the rationale in the deployment's model inventory entry, per SR 11-7 §7.
Deployment-specific rule extensions¶
Beyond the bundled banking_rules(), deployments often add a small
number of rules specific to their internal policy:
from groundlens.rules import banking_rules, ChecklistRule, RuleEvidence
# Start from the bundled banking ruleset
base = banking_rules()
# Add an internal-policy rule, e.g. "rationale must cite the policy clause"
def _check_policy_clause(question, response, context, metadata):
matched = "policy clause" in response.lower() or "internal policy" in response.lower()
return RuleEvidence(matched=matched, span="policy clause", explanation="cites internal policy")
internal_rule = ChecklistRule(
id="spec.internal_policy",
description="cites internal policy clause",
weight=0.15,
sub_score="spec",
check=_check_policy_clause,
)
# Compose a new ruleset
from groundlens.rules import RuleSet
deployment_rules = RuleSet(
name="banking_internal_v1",
rules=base.rules + (internal_rule,),
quality_floor=base.quality_floor,
)
Version the ruleset (here banking_internal_v1) and tag the audit
log entries with the ruleset version in metadata. Any change to the
rule set is then visible in the audit trail.
Self-hosted deployment¶
Banking environments typically require self-hosted deployment, not SaaS. groundlens has no required external services:
- The embedding model (
all-MiniLM-L6-v2, 80MB) ships withsentence-transformersand can be cached locally or in an internal artifact registry. - The audit log uses SQLite, which is single-file and embeddable.
- No outbound network calls during scoring.
A minimum container build looks like:
FROM python:3.12-slim
# Pre-fetch the embedding model into the image
RUN pip install --no-cache-dir groundlens \
&& python -c "from sentence_transformers import SentenceTransformer; \
SentenceTransformer('all-MiniLM-L6-v2')"
# Audit log volume mount
VOLUME ["/var/lib/groundlens"]
ENV GROUNDLENS_AUDIT_DB=/var/lib/groundlens/audit.sqlite
Examiner readiness checklist¶
When an examiner requests evidence for a banking AI governance review, the following should be exportable in minutes:
- [x] Audit log integrity —
log.verify_chain()reports valid. - [x] Period export —
log.export_jsonl()produces a JSONL file covering the examined period, with every input, output, configuration, and hash for every evaluation. - [x] Compliance mapping —
from groundlens.compliance import all_mappings; all_mappings()returns the documented design intent against SR 11-7 / EU AI Act / NIST AI RMF. - [x] Compliance report —
ComplianceReport.to_markdown()generates an examiner-ready report combining summary statistics with explicit clause references. - [x] Source code — groundlens is MIT licensed and the full source is available at the published version of the deployment.
- [x] Calibration corpus — the reference_csv used for DGI is stored alongside the deployment and version-controlled.
Run this checklist before any scheduled examination. If any item is not green, the deployment is not examiner-ready.
Performance and scale¶
Per-evaluation latency on a single CPU core:
| Method | Latency | Memory |
|---|---|---|
| SGI (with cached model) | 5-15 ms | ~150 MB resident |
| DGI (with cached model) | 5-15 ms | ~150 MB resident |
| Rules only | <1 ms | <1 MB additional |
| Audit log record | 0.5-2 ms | per-write |
For batch throughput: 100-200 evaluations per second per worker on commodity hardware. Horizontal scaling is trivial because the scorer is stateless; the audit log is the only shared component and SQLite supports the concurrent reader patterns common in monitoring.
Known limitations for banking deployment¶
Document the following in the deployment's model validation report:
- Type III errors — embedding-based methods do not detect within-frame factual errors. Pair with claim-level fact-checking for high-stakes paths.
- Calibration scope — the DGI calibration reflects the corpus used. Outside the calibrated domain, DGI degrades to general- purpose detection.
- Single-writer audit log — SQLite-backed audit log supports single-writer single-process semantics. For multi-process deployments, split into per-process logs and reconcile downstream.
- Determinism contract — determinism holds against a fixed embedding model version. Embedding model upgrades require re- calibration and a documented re-validation per SR 11-7 §3.
References¶
- Marin, J. (2025). Semantic Grounding Index for LLM Hallucination Detection. arXiv:2512.13771.
- Marin, J. (2026). A Geometric Taxonomy of Hallucinations in Large Language Models. arXiv:2602.13224v3.
- Board of Governors of the Federal Reserve System and Office of the Comptroller of the Currency. Supervisory Guidance on Model Risk Management (SR 11-7). 2011.
- European Parliament and Council. Regulation (EU) 2024/1689 on Artificial Intelligence (EU AI Act). 2024.
- NIST. AI Risk Management Framework 1.0. 2023.