SR 11-7 Compliance¶
The Federal Reserve / OCC Supervisory Guidance on Model Risk Management (SR 11-7, April 2011) is the dominant US standard for managing risks introduced by quantitative models used in banking decisions. It applies to any model whose output affects business decisions — and that includes LLM-based scoring and triage pipelines.
groundlens is designed to support SR 11-7 compliance through deterministic scoring, full audit trail, and explicit documentation of design intent against the standard's clauses.
Why groundlens fits SR 11-7¶
SR 11-7 sets expectations around four pillars: model development, implementation, and use; validation; governance, policies, and controls; and ongoing monitoring. groundlens contributes to all four:
| SR 11-7 expectation | groundlens contribution |
|---|---|
| Conceptual soundness (§3) | SGI and DGI are published research with explicit mathematical formulations and known limitations |
| Outcomes analysis (§3) | Deterministic scoring means every output is reproducible byte-for-byte for retrospective review |
| Process verification (§3) | The rules module exposes audit-friendly sub-scores (spec / expl / bshift) with per-rule evidence |
| Documentation (§5) | Full audit log via groundlens.audit.AuditLog, including inputs, outputs, configuration, and hash-chained integrity |
| Governance, policies, controls (§7) | Thresholds are explicit constants in groundlens._internal.thresholds; rule weights configurable per deployment |
| Ongoing monitoring (§4) | Batch evaluation supports tracking flagged rate and score distributions over time |
| Effective challenge (§3) | The independent validation function can replay any past decision from the audit log and verify the chain has not been altered |
Replacing LLM-as-judge under SR 11-7¶
The most common pattern for hallucination detection — a second LLM acting as judge — is hard to defend under SR 11-7. groundlens removes the second LLM entirely.
| SR 11-7 concern | LLM-as-judge | groundlens |
|---|---|---|
| Conceptual soundness | The judge LLM shares failure modes with the model under evaluation; circular trust | Geometric scorer in embedding space; deterministic mathematical operations |
| Reproducibility | Non-deterministic sampling; same input may produce different verdicts | Same inputs always produce the same score, byte-for-byte |
| Outcomes analysis | Outcomes shift when the judge LLM is updated or its prompts retuned | Method does not change with LLM upgrades; historical comparisons stay valid |
| Documentation | The judge LLM's reasoning is opaque and version-dependent | Distance ratios and angle alignments; every score decomposable into inspectable components |
| Effective challenge | Hard to mount an independent challenge against an opaque LLM judge | An auditor can replay any score from logged inputs and verify the result |
§3 Model Validation¶
SR 11-7 §3 requires three components of independent model validation: conceptual soundness, ongoing monitoring, and outcomes analysis.
Conceptual soundness¶
groundlens implements three published methods with explicit mathematical definitions:
- SGI (arXiv:2512.13771) — distance ratio
dist(response, question) / dist(response, context)in sentence- transformer embedding space. - DGI (arXiv:2602.13224v3) — directional alignment
dot(unit(phi(r) - phi(q)), mu_hat)with calibrated reference direction. - Rule sub-scores — checklist evaluation producing specificity, explanatory linkage, and boundary shift signals in [0, 1].
Limitations are documented explicitly in Confabulation Boundary: Type III within-frame errors (right vocabulary, wrong facts) are not detectable by embedding geometry. This is documented as a property, not a defect.
Ongoing monitoring¶
from groundlens import evaluate_batch
from groundlens.audit import AuditLog
log = AuditLog(db_path="production_audit.sqlite")
# Per-batch summary fed to the monitoring dashboard
results = evaluate_batch(batch_inputs)
flagged_rate = sum(r.flagged for r in results) / len(results)
for inputs, result in zip(batch_inputs, results):
log.record(
identifier=inputs["case_id"],
method=result.method,
score=result.value,
flagged=result.flagged,
inputs=inputs,
metadata={"batch_id": batch_id, "model_version": "2026.6.7"},
)
Define performance triggers (e.g. flagged rate > 15% for three consecutive days) that initiate investigation per SR 11-7 §4.
Outcomes analysis¶
Because every evaluation is deterministic, retrospective analysis is straightforward: replay any historical decision from the audit log, and the produced score matches the originally logged score byte-for-byte. This is the property SR 11-7 §3 calls for and that sampling-based judges cannot provide.
§5 Documentation¶
SR 11-7 §5 calls for documentation that allows third parties to
understand and replicate the model's behavior. The
groundlens.audit.AuditLog captures the necessary
material per evaluation:
from groundlens import compute_sgi
from groundlens.audit import AuditLog
from groundlens.compliance import sgi_compliance_mapping
log = AuditLog(db_path="audit.sqlite")
result = compute_sgi(
question=case["question"],
context=case["context"],
response=case["response"],
)
log.record(
identifier=case["case_id"],
method=result.method,
score=result.value,
flagged=result.flagged,
inputs={
"question": case["question"],
"context": case["context"],
"response": case["response"],
},
metadata={
"model": "all-MiniLM-L6-v2",
"groundlens_version": "2026.6.7",
"operator": "model_validation_unit",
},
compliance_mapping={
"standards": list(sgi_compliance_mapping().standards()),
},
)
The hash chain in AuditLog provides cryptographic evidence the log
has not been altered between evaluation and review.
§6 Vendor Models¶
SR 11-7 §6 explicitly extends model risk management to vendor models. groundlens is open source under MIT license: the full method, implementation, and validation procedures are public and inspectable. The acquiring bank does not need to take vendor documentation on faith — every line of code is available for independent review at github.com/groundlens-dev/groundlens.
The bundled calibration corpus is also open: see
groundlens.data.banking_reference_pairs_path
for the banking-specific corpus shipped with the library, and the
calibration guide for extending it with
deployment-specific pairs while keeping documentation transparent.
§7 Governance, Policies, and Controls¶
Thresholds and policy are explicit and version-controlled in the codebase:
from groundlens._internal.thresholds import (
SGI_REVIEW, # 0.95 — flagged below this
SGI_STRONG_PASS, # 1.20 — strong pass above this
DGI_PASS, # 0.30 — DGI pass threshold
)
For domain-specific risk tolerance, the rules module accepts a
quality_floor parameter:
from groundlens.rules import banking_rules
# Higher floor = more conservative flagging — useful for high-stakes paths
strict_rules = banking_rules(quality_floor=0.4)
Any deviation from defaults should be documented in the model inventory entry for the deployment, with the rationale (risk tier, historical false-positive / false-negative rates, business justification).
Audit Trail¶
The audit log is single-source-of-truth for examiner review:
from groundlens.audit import AuditLog
log = AuditLog(db_path="audit.sqlite")
# Quick chain integrity check before producing an examiner export
verification = log.verify_chain()
assert verification.valid, (
f"Audit chain broken at entry {verification.broken_at_entry_id}: "
f"{verification.reason}"
)
# Export the period requested by the examiner
log.export_jsonl("examiner_export_2026Q2.jsonl")
A broken chain is unambiguous evidence of post-hoc modification — the hash recomputation will not match the stored value, which is one of the strongest documentation guarantees available under SR 11-7 §5.
Known limitations for SR 11-7 compliance¶
Document the following explicitly in any SR 11-7 model validation report covering groundlens:
- Within-frame errors not detectable. Type III hallucinations (correct frame, wrong facts) are documented as a property of embedding-based methods, not a defect. Complement with claim-level fact-checking for high-stakes domains.
- Calibration drift. The DGI reference direction shifts as the underlying language patterns shift. Schedule calibration refresh with a frequency proportional to deployment volume and risk appetite.
- Threshold sensitivity. Defaults are derived empirically from research benchmarks; tune to the specific risk tolerance of the deployment and document the tuning process in the model inventory.
Documentation requirement
The SR 11-7 model validation report should include these
limitations explicitly, alongside the documented design intent of
each scoring path via groundlens.compliance.get_mapping(). SR
11-7 prioritizes honest disclosure of model boundaries over claims
of universal capability.
References¶
- Board of Governors of the Federal Reserve System and Office of the Comptroller of the Currency. Supervisory Guidance on Model Risk Management (SR 11-7 / OCC 2011-12). April 4, 2011. federalreserve.gov/supervisionreg/srletters/sr1107.htm
- Marin, J. (2025). Semantic Grounding Index for LLM Hallucination Detection. arXiv:2512.13771.
- Marin, J. (2026). A Geometric Taxonomy of Hallucinations in Large Language Models. arXiv:2602.13224v3.