CLI Reference¶

groundlens provides a command-line interface for quick checks, batch evaluation, calibration, and benchmarking. All commands are available via the groundlens entry point.

groundlens --help
groundlens --version

`groundlens check`¶

Evaluate a single response for hallucination risk.

# With context (uses SGI)
groundlens check \
    --question "What is the capital of France?" \
    --response "The capital of France is Paris." \
    --context "France is in Western Europe. Its capital is Paris."

# Without context (uses DGI)
groundlens check \
    --question "What causes seasons on Earth?" \
    --response "Seasons are caused by Earth's 23.5-degree axial tilt."

Output:

Method:      sgi
Score:       1.2341
Normalized:  0.6142
Flagged:     False
Explanation: SGI=1.234 -- strong context engagement (pass)

Options:

Flag	Required	Default	Description
`--question`	Yes	---	The input question
`--response`	Yes	---	The LLM response to evaluate
`--context`	No	None	Source context (enables SGI when provided)
`--model`	No	`all-MiniLM-L6-v2`	Sentence-transformer model

`groundlens evaluate`¶

Batch evaluate a CSV file of question/response pairs.

groundlens evaluate input.csv --output results.csv

Input CSV format:

question,response,context
"What is X?","X is Y.","According to the manual, X is Y."
"What causes rain?","Rain is caused by condensation.",

The context column is optional. When present, SGI is used; when absent or empty, DGI is used.

Output CSV includes all original columns plus:

Column	Description
`groundlens_method`	`sgi` or `dgi`
`groundlens_score`	Raw score value
`groundlens_normalized`	Score in [0, 1]
`groundlens_flagged`	`True` or `False`
`groundlens_explanation`	Human-readable interpretation

Options:

Flag	Required	Default	Description
`input_csv`	Yes (positional)	---	Input CSV file path
`--output`	Yes	---	Output CSV file path
`--model`	No	`all-MiniLM-L6-v2`	Sentence-transformer model
`--reference-csv`	No	None	DGI calibration CSV path

CI/CD integration

Use groundlens evaluate in your CI pipeline to gate deployments on hallucination scores. Parse the output CSV and fail the build if any row has groundlens_flagged=True.

`groundlens calibrate`¶

Compute a DGI reference direction from domain-specific calibration pairs.

groundlens calibrate \
    --pairs domain_pairs.csv \
    --output calibration.json

Input CSV format:

question,response
"What is the dosage for ibuprofen?","The recommended dosage is 200-400mg every 4-6 hours."
"What are the side effects of aspirin?","Common side effects include stomach irritation and bleeding risk."

The CSV must contain verified grounded (question, response) pairs from your target domain. A minimum of 5 pairs is required; 20--100 pairs is recommended for reliable calibration.

Output:

Calibration complete.
  Pairs:         47
  Embedding dim: 384
  Concentration: 12.34
  Saved to:      calibration.json

The saved JSON contains the reference direction vector (mu_hat), the concentration parameter (\(\kappa\)), and metadata. Use it with:

groundlens evaluate input.csv --output results.csv --reference-csv domain_pairs.csv

Options:

Flag	Required	Default	Description
`--pairs`	Yes	---	CSV with question,response columns
`--output`	Yes	---	Output JSON file path
`--model`	No	`all-MiniLM-L6-v2`	Sentence-transformer model

`groundlens benchmark`¶

Run the confabulation benchmark against a HuggingFace dataset.

groundlens benchmark
groundlens benchmark --dataset cert-framework/human-confabulation-benchmark

Output:

Loading dataset: cert-framework/human-confabulation-benchmark
Running benchmark on 200 items...
  Processed 200/200

--- Benchmark Results ---
SGI AUROC: 0.8234 (n=150)
DGI AUROC: 0.9580 (n=200)

Dependencies

The benchmark command requires datasets (HuggingFace) and scikit-learn for AUROC computation. Install with:

pip install datasets scikit-learn

Options:

Flag	Required	Default	Description
`--dataset`	No	`cert-framework/human-confabulation-benchmark`	HuggingFace dataset name
`--model`	No	`all-MiniLM-L6-v2`	Sentence-transformer model

CLI Reference¶

groundlens check¶

groundlens evaluate¶

groundlens calibrate¶

groundlens benchmark¶

`groundlens check`¶

`groundlens evaluate`¶

`groundlens calibrate`¶

`groundlens benchmark`¶