SR 11-7 Compliance¶

The Federal Reserve / OCC Supervisory Guidance on Model Risk Management (SR 11-7, April 2011) is the dominant US standard for managing risks introduced by quantitative models used in banking decisions. It applies to any model whose output affects business decisions — and that includes LLM-based scoring and triage pipelines.

groundlens is designed to support SR 11-7 compliance through deterministic scoring, full audit trail, and explicit documentation of design intent against the standard's clauses.

Why groundlens fits SR 11-7¶

SR 11-7 sets expectations around four pillars: model development, implementation, and use; validation; governance, policies, and controls; and ongoing monitoring. groundlens contributes to all four:

SR 11-7 expectation	groundlens contribution
Conceptual soundness (§3)	SGI and DGI are published research with explicit mathematical formulations and known limitations
Outcomes analysis (§3)	Deterministic scoring means every output is reproducible byte-for-byte for retrospective review
Process verification (§3)	The `rules` module exposes audit-friendly sub-scores (spec / expl / bshift) with per-rule evidence
Documentation (§5)	Full audit log via `groundlens.audit.AuditLog`, including inputs, outputs, configuration, and hash-chained integrity
Governance, policies, controls (§7)	Thresholds are explicit constants in `groundlens._internal.thresholds`; rule weights configurable per deployment
Ongoing monitoring (§4)	Batch evaluation supports tracking flagged rate and score distributions over time
Effective challenge (§3)	The independent validation function can replay any past decision from the audit log and verify the chain has not been altered

Making LLM-as-judge defensible under SR 11-7¶

A second LLM acting as judge is hard to defend under SR 11-7 on its own: non-deterministic, unreproducible, and unable to explain itself. But you cannot remove it either, because a plausible wrong figure stated in the right frame is invisible to geometry and it is exactly the error a model risk function cares about.

groundlens makes the judge defensible by moving it downstream. Stage 1 is deterministic, reproducible and cheap, and it runs on everything. Stage 2, the judge or the human, runs only on what Stage 1 escalates, and the audit log records which cases were escalated and why. That log, not the score, is the SR 11-7 artifact.

SR 11-7 concern	LLM-as-judge	groundlens
Conceptual soundness	The judge LLM shares failure modes with the model under evaluation; circular trust	Geometric scorer in embedding space; deterministic mathematical operations
Reproducibility	Non-deterministic sampling; same input may produce different results	Same inputs always produce the same score, byte-for-byte
Outcomes analysis	Outcomes shift when the judge LLM is updated or its prompts retuned	Method does not change with LLM upgrades; historical comparisons stay valid
Documentation	The judge LLM's reasoning is opaque and version-dependent	Distance ratios and angle alignments; every score decomposable into inspectable components
Effective challenge	Hard to mount an independent challenge against an opaque LLM judge	An auditor can replay any score from logged inputs and verify the result

§3 Model Validation¶

SR 11-7 §3 requires three components of independent model validation: conceptual soundness, ongoing monitoring, and outcomes analysis.

Conceptual soundness¶

groundlens implements three published methods with explicit mathematical definitions:

SGI (arXiv:2512.13771) — distance ratio dist(response, question) / dist(response, context) in sentence- transformer embedding space.
DGI (arXiv:2602.13224v3) — directional alignment dot(unit(phi(r) - phi(q)), mu_hat) with calibrated reference direction.
Rule sub-scores — checklist evaluation producing specificity, explanatory linkage, and boundary shift signals in [0, 1].

Limitations are documented explicitly in Confabulation Boundary: Type III within-frame errors (right vocabulary, wrong facts) are not detectable by embedding geometry. This is documented as a property, not a defect.

Ongoing monitoring¶

from groundlens import evaluate_batch
from groundlens.audit import AuditLog

log = AuditLog(db_path="production_audit.sqlite")

# Per-batch summary fed to the monitoring dashboard
results = evaluate_batch(batch_inputs)
flagged_rate = sum(r.flagged for r in results) / len(results)

for inputs, result in zip(batch_inputs, results):
    log.record(
        identifier=inputs["case_id"],
        method=result.method,
        score=result.value,
        flagged=result.flagged,
        inputs=inputs,
        metadata={"batch_id": batch_id, "model_version": "2026.6.7"},
    )

Define performance triggers (e.g. flagged rate > 15% for three consecutive days) that initiate investigation per SR 11-7 §4.

Outcomes analysis¶

Because every evaluation is deterministic, retrospective analysis is straightforward: replay any historical decision from the audit log, and the produced score matches the originally logged score byte-for-byte. This is the property SR 11-7 §3 calls for and that sampling-based judges cannot provide.

§5 Documentation¶

SR 11-7 §5 calls for documentation that allows third parties to understand and replicate the model's behavior. The groundlens.audit.AuditLog captures the necessary material per evaluation:

from groundlens import compute_sgi
from groundlens.audit import AuditLog
from groundlens.compliance import sgi_compliance_mapping

log = AuditLog(db_path="audit.sqlite")

result = compute_sgi(
    question=case["question"],
    context=case["context"],
    response=case["response"],
)

log.record(
    identifier=case["case_id"],
    method=result.method,
    score=result.value,
    flagged=result.flagged,
    inputs={
        "question": case["question"],
        "context": case["context"],
        "response": case["response"],
    },
    metadata={
        "model": "sentence-transformers/sentence-t5-large",
        "groundlens_version": "2026.6.7",
        "operator": "model_validation_unit",
    },
    compliance_mapping={
        "standards": list(sgi_compliance_mapping().standards()),
    },
)

The hash chain in AuditLog provides cryptographic evidence the log has not been altered between evaluation and review.

§6 Vendor Models¶

SR 11-7 §6 explicitly extends model risk management to vendor models. groundlens is open source under the Apache-2.0 license: the full method, implementation, and validation procedures are public and inspectable. The acquiring bank does not need to take vendor documentation on faith — every line of code is available for independent review at github.com/groundlens-dev/groundlens.

The bundled calibration corpus is open: see groundlens.data.reference_pairs_path for the cross-domain corpus shipped with the library (212 verified (question, grounded_response, fabricated_response) triples across nine domains, sourced from the open groundlens-dev/grounding-benchmark repository under CC BY 4.0). A regulated deployment extends it with deployment-specific verified pairs; see the calibration guide for the procedure.

§7 Governance, Policies, and Controls¶

Thresholds and policy are explicit and version-controlled in the codebase:

from groundlens._internal.thresholds import (
    SGI_REVIEW,        # 0.95 — flagged below this
    SGI_STRONG_PASS,   # 1.20 — strong pass above this
    DGI_PASS,          # 0.30 — DGI pass threshold
)

For domain-specific risk tolerance, the rules module accepts a quality_floor parameter:

from groundlens.rules import banking_rules

# Higher floor = more conservative flagging — useful for high-stakes paths
strict_rules = banking_rules(quality_floor=0.4)

Any deviation from defaults should be documented in the model inventory entry for the deployment, with the rationale (risk tier, historical false-positive / false-negative rates, business justification).

Audit Trail¶

The audit log is single-source-of-truth for examiner review:

from groundlens.audit import AuditLog

log = AuditLog(db_path="audit.sqlite")

# Quick chain integrity check before producing an examiner export
verification = log.verify_chain()
assert verification.valid, (
    f"Audit chain broken at entry {verification.broken_at_entry_id}: "
    f"{verification.reason}"
)

# Export the period requested by the examiner
log.export_jsonl("examiner_export_2026Q2.jsonl")

A broken chain is unambiguous evidence of post-hoc modification — the hash recomputation will not match the stored value, which is one of the strongest documentation guarantees available under SR 11-7 §5.

Document the following explicitly in any SR 11-7 model validation report covering groundlens:

Within-frame errors not detectable. Type III hallucinations (correct frame, wrong facts) are documented as a property of embedding-based methods, not a defect. Complement with claim-level fact-checking for high-stakes domains.
Calibration drift. The DGI reference direction shifts as the underlying language patterns shift. Schedule calibration refresh with a frequency proportional to deployment volume and risk appetite.
Threshold sensitivity. Defaults are derived empirically from research benchmarks; tune to the specific risk tolerance of the deployment and document the tuning process in the model inventory.

Documentation requirement

The SR 11-7 model validation report should include these limitations explicitly, alongside the documented design intent of each scoring path via groundlens.compliance.get_mapping(). SR 11-7 prioritizes honest disclosure of model boundaries over claims of universal capability.

References¶

Board of Governors of the Federal Reserve System and Office of the Comptroller of the Currency. Supervisory Guidance on Model Risk Management (SR 11-7 / OCC 2011-12). April 4, 2011. federalreserve.gov/supervisionreg/srletters/sr1107.htm
Marin, J. (2025). Semantic Grounding Index for LLM Hallucination Detection. arXiv:2512.13771.
Marin, J. (2026). A Geometric Taxonomy of Hallucinations in Large Language Models. arXiv:2602.13224v3.