Build an 'If unsure, do X' policy with semantic-entropy signals. Use LLM clustering to score uncertainty and auto-route to cite-first, retrieval, or escalation—plus ready prompts, a mini lab, and production tips.
When to escalate, re-ask, or add retrieval—using signals you can actually compute in a prompt-only stack.
If you only make one upgrade to your prompts this quarter, make it this: teach your system to notice when it’s unsure—and do something smarter than guessing. By the end of this guide you’ll build an “If unsure, do X” policy you can drop into real workflows: re-query for clarity, switch to cite-first mode, add retrieval, or escalate to a stronger model or tool. We’ll adapt semantic entropy—a meaning-level uncertainty signal shown to track confabulations—to prompt-only proxies you can run today (no logits, no fine-tuning). The result is fewer ungrounded answers and clearer routes to truth.
Why now. Recent work shows that semantic disagreement across samples is a strong indicator of “confabulations” (arbitrary, fluent, wrong statements) and outperforms shallower uncertainty measures on several tasks. The original method is heavier weight; newer variants and approximations open a path for practical deployment. (Nature, PubMed) At the same time, researchers caution this won’t catch confident, systematic errors—so your policy must also include explicit verification and retrieval paths. (Universität Oxford)
Uncertainty isn’t just about wording. Ask a model the same question five times. If you get five paraphrases of the same answer, you have low semantic uncertainty—even if the wording varies. If you get different meanings (different years, names, formulas), your semantic uncertainty is high. Semantic entropy formalizes this: cluster multiple generations by meaning, then compute the entropy over those clusters. Higher entropy → higher disagreement → higher likelihood of a confabulation. (Nature, oatml.cs.ox.ac.uk)
In the original Nature paper, the pipeline: generate candidate factoids, ask targeted questions, sample multiple answers, cluster by meaning, then compute entropy; high entropy flagged risky content. It worked well at catching confabulations across datasets. (Nature)
Reality check. The full method is compute-hungry. But follow-ups introduce cheaper approximations, like Semantic Entropy Probes (SEPs) that estimate uncertainty from a single pass’s hidden states (useful insight even if you can’t train probes). For prompt-only stacks, we’ll mimic the spirit: multi-sample + meaning checks via an LLM-as-judge. (arXiv)
You’ll build a three-stage routine that runs in milliseconds per branch and fits neatly around your current prompts.
Probe (generate a small bundle of candidates).
Compare meanings (cluster answers by equivalence using a compact judge prompt).
Act on thresholds (route to answer, cite-first, retrieval, or escalation).
You want meaning diversity, not just temperature noise. Use 6–10 samples with controlled variation:
Vary seeds and temperature in a narrow band (e.g., T=0.5–0.8).
Light role/constraint perturbations (e.g., “Answer concisely as a fact-checker” vs. “Answer like a careful researcher”).
Stable output schema so you’re judging content, not format (e.g., {"answer": "...", "support": ["..."]}).
💡 Insight: Two to three prompt variants often reveal more semantic spread than ten identical prompts at higher temperature. You’re testing meaning agreement, not creativity.
For each pair (or for each item vs. a “current best”), ask a micro-judge:
Judge prompt (1 sentence). “Do A and B assert the same factual claim? Answer same / different and explain briefly.”
Keep it scoped to the claim, not the rationale. Build clusters by union-find: start a new cluster when the judge says different. Count members per cluster.
Entropy proxy. Let pip_ibe cluster frequency over kkclusters; compute H=−∑pilogpiH = -\sum p_i \log p_i.
Normalize by logk\log kif you want a 0–1 scale.
No math API? Ask the model to compute the normalized entropy inside a structured “judge” step and to return the number clearly. You’re not asking for chain-of-thought, just a scalar and the cluster labels.
Why this works. The research signal is about meaning-level disagreement, not token probabilities. Paraphrase-aware clustering approximates the “semantic space” used in the original method—without logits. (Nature, oatml.cs.ox.ac.uk)
Define three bands from your dev data:
Low uncertainty (H ≤ L): Answer normally.
Medium (L < H ≤ M): Switch to cite-first mode and include confidence notes.
High (H > M): Trigger retrieval or escalate to a stronger model / tool; optionally re-ask the user to clarify.
Choose L and M via calibration on a small labeled set (or from practical defaults below), then log decisions to refine.
Scenario. You run a help-bot for product docs. A user asks: “Does Model X support passkeys on Linux?” Your internal docs are spotty.
Probe (8 candidates). Two prompt variants × two temperatures × two seeds.
Cluster. Judge says:
Cluster A (5 items): “Yes, via PAM module since v1.3.”
Cluster B (2 items): “No, Linux support planned for v1.4.”
Cluster C (1 item): “Works only on Ubuntu with a third-party module.”
Entropy proxy HHcomputes to ~0.86 normalized.
Policy. Threshold M=0.6. High uncertainty → retrieve docs + cite-first.
Rerun answer with retrieved passages; if clusters now collapse (H ≤ 0.2), respond; else escalate to an expert queue.
This is exactly what semantic entropy is for: agreeing when grounded, diverging when the model’s “semantic landscape” is fractured. (Nature)
Below are compact snippets you can paste into your orchestration. They avoid hidden chain-of-thought and keep outputs structured.
A) Probe prompt (answer once). What it does: produces a single, schema-conformant answer.
You are a careful assistant. Answer the user question below. Return JSON with keys: answer, support (array of citable strings or "N/A"). If unsure, do not guess; say "Unknown" in answer. Question: {{USER_QUESTION}} Constraints: be concise; no extra claims; include support if known. Output JSON only.
B) Judge prompt (pairwise meaning check). What it does: tells you whether two answers assert the same claim.
Task: Do Answer A and Answer B assert the same factual claim? Focus on the main claim only, ignore phrasing. A: {{ANSWER_A}} B: {{ANSWER_B}} Respond with: {"relation":"same|different","rationale":"one short sentence"}
C) Router policy (pseudocode). What it does: maps entropy to actions.
H = semantic_entropy(cluster_sizes) # normalized 0..1 if H <= 0.25: return best_answer(cite_first=False) elif H <= 0.60: return best_answer(cite_first=True) # require sources inline else: ctx = retrieve_topk(query=expand_query(USER_QUESTION)) if ctx: rerun = answer_with_context(USER_QUESTION, ctx) if rerun.H <= 0.25: return best_answer(rerun, cite_first=True) return escalate_to_stronger_model_or_human(USER_QUESTION)
D) Cite-first mode (answer with sources first). What it does: forces sourcing before claims—valuable in the medium band.
Answer in two parts: 1) "Sources" — list 1–3 specific citations (short quotes or doc IDs). 2) "Answer" — only claims supported by those sources. If sources are insufficient, say "Insufficient sources" and stop.
How many samples? The Nature paper used multiple re-asks to build robust clusters; in production, 6–10 is a sweet spot for cost/latency. For short factual Q&A, 6 often suffices; for open-ended synthesis, edge to 8–10. (Nature)
Clustering tricks.
If pairwise judging is too slow, compare each candidate only to a running representative of each cluster.
To avoid spurious “different,” add a tie-break: if judge rationale mentions scope/format, ask a second judge “Does this difference change the factual claim?” before splitting clusters.
Thresholds. Start with L=0.25, M=0.60. Calibrate on a tiny dev set: flag 30–50 prompts where you know correctness, run the probe, plot H vs. errors, shift thresholds to meet your precision/recall goals. (The Oxford team emphasizes that high semantic entropy is particularly good at spotting confabulations, not systematic confident errors—so keep a retrieval/verification path for those.) (Universität Oxford)
Cheaper variants.
Tri-ask only: sample 3 answers; if you see ≥2 distinct meanings, treat as high-uncertainty.
Re-paraphrase probe: ask the same question via two paraphrases and check cross-agreement.
Confidence-from-diversity: have the model list “two plausible alternatives” before answering; more alternatives → more uncertainty. (Use this as a weak, early filter.)
Beyond entropy. Emerging work proposes alternate energy-style measures and improved estimators. Treat them as R&D tracks, not blockers to shipping a simple proxy today. (arXiv, aclanthology.org)
When not to use it.
Deterministic transforms (regex-like text ops, format conversions): semantic entropy just adds latency.
Stable, high-recall retrieval tasks: invest in retrieval quality and citation checks first.
Known-safe closed sets (SKU code lookups): route straight to the database.
All clusters wrong, low entropy. The model agrees—confidently wrong. That’s the classic limit of semantic entropy. Fix: add mandatory retrieval for queries about named entities, dates, or stats; or trigger cite-first by policy for those categories. (Universität Oxford)
Judge is inconsistent. Pairwise judging can be noisy. Fix: normalize answers to a short, canonical statement first (“Extract the main claim as a one-sentence proposition”), then judge those. A cleaner proposition reduces spurious splits.
Latency spikes. Parallelize the probe, standardize outputs, and cut to tri-ask for low-risk routes (e.g., internal tooltips) while keeping full 8-ask for higher-risk endpoints (e.g., external help center).
“Cite-first” gives weak sources. Strengthen the rubric: “Only use sources containing the exact claim or a direct quote; otherwise say ‘Insufficient sources.’”
Goal. Wire a tiny router on your own model endpoint and watch it switch modes.
Pick 6–8 real user questions from last week—mix in some you know the model often bungles (e.g., niche product versions).
Run the probe (8 candidates) with two small prompt variants.
Cluster with the judge prompt, compute HH(ask the model for the scalar).
Route: normal if H≤0.25H≤0.25; cite-first if 0.25<H≤0.600.25<H≤0.60; retrieval otherwise (even a naive vector search over your docs is fine).
Record: final answer, H, action taken, and outcome (correct / needs fix).
Expected pattern. “Easy” questions collapse to one meaning (H~0.0–0.2). Questions about fast-changing details (versions, dates) show multiple meanings (H>0.6) and benefit immediately from retrieval + cite-first, or escalation if your corpus is thin. This matches reported strengths of semantic-uncertainty signals on confabulation-prone queries. (Nature)
Cost/latency: A full 8-ask adds overhead. Use risk-based routing: run tri-ask on low-risk surfaces; escalate to 8-ask only if tri-ask finds a disagreement. Cache judgments for repeated questions.
Observability: Log H, cluster counts, action chosen, and whether retrieval resolved disagreement. Over a week, pick new thresholds that minimize downstream escalations while keeping error rates acceptable.
Human-in-the-loop: For H>0.8 with no adequate sources, send to an expert queue and store the resolved answer as a gold demo to reduce future uncertainty.
Security & integrity: Never let the judge prompt read chain-of-thought; you’re classifying claims, not revealing reasoning. Return scalars and short rationales only.
The core idea isn’t mystical. When the model’s internal distribution spreads across meanings, not just phrasing, it’s telling you “I’m not anchored.” The Nature study quantifies that spread and shows it correlates with confabulations; later work explores cheaper or stronger estimators. Your prompt-only proxy measures the same signal—agreement of claims—and gives you a practical route to act: require sources, add retrieval, or escalate. (Nature, arXiv)
Uncertainty-aware prompting is about listening to the model’s meaning-level hesitation and turning it into action. Semantic entropy gives a principled frame: disagreements in what is being said predict confabulations better than superficial variation. We adapted that into a lightweight, prompt-only routine: probe with a few diverse generations, cluster by meaning with a tiny judge, compute a simple entropy proxy, and route accordingly.
You now have a working “If unsure, do X” policy: answer when clusters align, cite-first when they mostly align, retrieve or escalate when they split. Calibrate thresholds on your data, log everything, and expect immediate gains on the very queries that used to produce the most embarrassing guesses.
Next steps
Add a mandatory retrieval rule for entity/date/number questions regardless of H; measure the impact on corrections.
Build a weeklong calibration set; tune L/M thresholds to your precision/recall targets.
Experiment with a tri-ask early filter and an 8-ask fallback to balance latency with protection.
OATML explainer on semantic entropy. Helpful intuition with examples of meaning vs. wording variation. (oatml.cs.ox.ac.uk)
Semantic Entropy Probes (SEPs). Cheaper, probe-based estimator; useful conceptual grounding even if you won’t train probes. (arXiv)
Oxford news release on limitations. Why confident, systematic errors require additional guardrails (retrieval, rules). (Universität Oxford)
Beyond SE (emerging). Explorations of alternative energy/uncertainty measures; good for R&D curiosity. (arXiv, aclanthology.org)
Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.
Explore PathsReady to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.