Skip to content

SR 11-7 Compliance

The Federal Reserve / OCC Supervisory Guidance on Model Risk Management (SR 11-7, April 2011) is the dominant US standard for managing risks introduced by quantitative models used in banking decisions. It applies to any model whose output affects business decisions — and that includes LLM-based scoring and triage pipelines.

groundlens is designed to support SR 11-7 compliance through deterministic scoring, full audit trail, and explicit documentation of design intent against the standard's clauses.

Why groundlens fits SR 11-7

SR 11-7 sets expectations around four pillars: model development, implementation, and use; validation; governance, policies, and controls; and ongoing monitoring. groundlens contributes to all four:

SR 11-7 expectation groundlens contribution
Conceptual soundness (§3) SGI and DGI are published research with explicit mathematical formulations and known limitations
Outcomes analysis (§3) Deterministic scoring means every output is reproducible byte-for-byte for retrospective review
Process verification (§3) The rules module exposes audit-friendly sub-scores (spec / expl / bshift) with per-rule evidence
Documentation (§5) Full audit log via groundlens.audit.AuditLog, including inputs, outputs, configuration, and hash-chained integrity
Governance, policies, controls (§7) Thresholds are explicit constants in groundlens._internal.thresholds; rule weights configurable per deployment
Ongoing monitoring (§4) Batch evaluation supports tracking flagged rate and score distributions over time
Effective challenge (§3) The independent validation function can replay any past decision from the audit log and verify the chain has not been altered

Replacing LLM-as-judge under SR 11-7

The most common pattern for hallucination detection — a second LLM acting as judge — is hard to defend under SR 11-7. groundlens removes the second LLM entirely.

SR 11-7 concern LLM-as-judge groundlens
Conceptual soundness The judge LLM shares failure modes with the model under evaluation; circular trust Geometric scorer in embedding space; deterministic mathematical operations
Reproducibility Non-deterministic sampling; same input may produce different verdicts Same inputs always produce the same score, byte-for-byte
Outcomes analysis Outcomes shift when the judge LLM is updated or its prompts retuned Method does not change with LLM upgrades; historical comparisons stay valid
Documentation The judge LLM's reasoning is opaque and version-dependent Distance ratios and angle alignments; every score decomposable into inspectable components
Effective challenge Hard to mount an independent challenge against an opaque LLM judge An auditor can replay any score from logged inputs and verify the result

§3 Model Validation

SR 11-7 §3 requires three components of independent model validation: conceptual soundness, ongoing monitoring, and outcomes analysis.

Conceptual soundness

groundlens implements three published methods with explicit mathematical definitions:

  • SGI (arXiv:2512.13771) — distance ratio dist(response, question) / dist(response, context) in sentence- transformer embedding space.
  • DGI (arXiv:2602.13224v3) — directional alignment dot(unit(phi(r) - phi(q)), mu_hat) with calibrated reference direction.
  • Rule sub-scores — checklist evaluation producing specificity, explanatory linkage, and boundary shift signals in [0, 1].

Limitations are documented explicitly in Confabulation Boundary: Type III within-frame errors (right vocabulary, wrong facts) are not detectable by embedding geometry. This is documented as a property, not a defect.

Ongoing monitoring

from groundlens import evaluate_batch
from groundlens.audit import AuditLog

log = AuditLog(db_path="production_audit.sqlite")

# Per-batch summary fed to the monitoring dashboard
results = evaluate_batch(batch_inputs)
flagged_rate = sum(r.flagged for r in results) / len(results)

for inputs, result in zip(batch_inputs, results):
    log.record(
        identifier=inputs["case_id"],
        method=result.method,
        score=result.value,
        flagged=result.flagged,
        inputs=inputs,
        metadata={"batch_id": batch_id, "model_version": "2026.6.7"},
    )

Define performance triggers (e.g. flagged rate > 15% for three consecutive days) that initiate investigation per SR 11-7 §4.

Outcomes analysis

Because every evaluation is deterministic, retrospective analysis is straightforward: replay any historical decision from the audit log, and the produced score matches the originally logged score byte-for-byte. This is the property SR 11-7 §3 calls for and that sampling-based judges cannot provide.

§5 Documentation

SR 11-7 §5 calls for documentation that allows third parties to understand and replicate the model's behavior. The groundlens.audit.AuditLog captures the necessary material per evaluation:

from groundlens import compute_sgi
from groundlens.audit import AuditLog
from groundlens.compliance import sgi_compliance_mapping

log = AuditLog(db_path="audit.sqlite")

result = compute_sgi(
    question=case["question"],
    context=case["context"],
    response=case["response"],
)

log.record(
    identifier=case["case_id"],
    method=result.method,
    score=result.value,
    flagged=result.flagged,
    inputs={
        "question": case["question"],
        "context": case["context"],
        "response": case["response"],
    },
    metadata={
        "model": "all-MiniLM-L6-v2",
        "groundlens_version": "2026.6.7",
        "operator": "model_validation_unit",
    },
    compliance_mapping={
        "standards": list(sgi_compliance_mapping().standards()),
    },
)

The hash chain in AuditLog provides cryptographic evidence the log has not been altered between evaluation and review.

§6 Vendor Models

SR 11-7 §6 explicitly extends model risk management to vendor models. groundlens is open source under MIT license: the full method, implementation, and validation procedures are public and inspectable. The acquiring bank does not need to take vendor documentation on faith — every line of code is available for independent review at github.com/groundlens-dev/groundlens.

The bundled calibration corpus is also open: see groundlens.data.banking_reference_pairs_path for the banking-specific corpus shipped with the library, and the calibration guide for extending it with deployment-specific pairs while keeping documentation transparent.

§7 Governance, Policies, and Controls

Thresholds and policy are explicit and version-controlled in the codebase:

from groundlens._internal.thresholds import (
    SGI_REVIEW,        # 0.95 — flagged below this
    SGI_STRONG_PASS,   # 1.20 — strong pass above this
    DGI_PASS,          # 0.30 — DGI pass threshold
)

For domain-specific risk tolerance, the rules module accepts a quality_floor parameter:

from groundlens.rules import banking_rules

# Higher floor = more conservative flagging — useful for high-stakes paths
strict_rules = banking_rules(quality_floor=0.4)

Any deviation from defaults should be documented in the model inventory entry for the deployment, with the rationale (risk tier, historical false-positive / false-negative rates, business justification).

Audit Trail

The audit log is single-source-of-truth for examiner review:

from groundlens.audit import AuditLog

log = AuditLog(db_path="audit.sqlite")

# Quick chain integrity check before producing an examiner export
verification = log.verify_chain()
assert verification.valid, (
    f"Audit chain broken at entry {verification.broken_at_entry_id}: "
    f"{verification.reason}"
)

# Export the period requested by the examiner
log.export_jsonl("examiner_export_2026Q2.jsonl")

A broken chain is unambiguous evidence of post-hoc modification — the hash recomputation will not match the stored value, which is one of the strongest documentation guarantees available under SR 11-7 §5.

Known limitations for SR 11-7 compliance

Document the following explicitly in any SR 11-7 model validation report covering groundlens:

  1. Within-frame errors not detectable. Type III hallucinations (correct frame, wrong facts) are documented as a property of embedding-based methods, not a defect. Complement with claim-level fact-checking for high-stakes domains.
  2. Calibration drift. The DGI reference direction shifts as the underlying language patterns shift. Schedule calibration refresh with a frequency proportional to deployment volume and risk appetite.
  3. Threshold sensitivity. Defaults are derived empirically from research benchmarks; tune to the specific risk tolerance of the deployment and document the tuning process in the model inventory.

Documentation requirement

The SR 11-7 model validation report should include these limitations explicitly, alongside the documented design intent of each scoring path via groundlens.compliance.get_mapping(). SR 11-7 prioritizes honest disclosure of model boundaries over claims of universal capability.

References

  • Board of Governors of the Federal Reserve System and Office of the Comptroller of the Currency. Supervisory Guidance on Model Risk Management (SR 11-7 / OCC 2011-12). April 4, 2011. federalreserve.gov/supervisionreg/srletters/sr1107.htm
  • Marin, J. (2025). Semantic Grounding Index for LLM Hallucination Detection. arXiv:2512.13771.
  • Marin, J. (2026). A Geometric Taxonomy of Hallucinations in Large Language Models. arXiv:2602.13224v3.