urn models into reliable planners with Programs as Thoughts. Learn PAL for reasoning with Python snippets, PoT for tabular and finance tasks, when these methods beat Chain of Thought, how to add self consistency voting, and how to sandbox execution.
Promise: By the end of this guide you’ll turn messy math and finance questions into tiny, trustworthy programs the model writes—and your interpreter executes. You’ll know when to reach for plain Chain-of-Thought (CoT), when to switch to PAL (Program-Aided Language models), and when Program-of-Thoughts (PoT) plus self-consistency is the most reliable path.
Language models are talented planners, but they’re sloppy calculators. CoT asks the model to both reason and compute in natural language; that’s where arithmetic slips appear. PAL and PoT split the job: the model expresses reasoning as short code, and a real interpreter (Python) performs the computation. That simple division sharply reduces numeric and logic errors, and in the original studies it often beats much larger models doing CoT alone. (arXiv)
A useful mental model:
CoT = “Explain the steps and do the math in prose.”
PAL = “Write a tiny Python function that carries out the steps. Let Python do the math.” (arXiv)
PoT = “Represent the reasoning as a program of steps (often more structured and table-aware) and execute it; combine with self-consistency to vote across multiple programs.” (arXiv)
Empirically, PAL showed large gains on math word problems (e.g., beating PaLM-540B CoT on GSM8K by ~15 percentage points with Codex-PAL), while PoT reported ~12% average improvements over CoT across math and finance benchmarks, with further boosts when paired with self-consistency. (arXiv)
You’ll orchestrate three simple phases:
Parse & Plan in code. Prompt the model to produce a minimal Python function solve() (no I/O, no imports), using variables derived from the problem.
Execute in a sandbox. Run the function in a restricted interpreter; capture its return value.
(Optional) Vote. For harder tasks, sample multiple programs (temperature > 0), execute each, and pick the most consistent answer (self-consistency). (arXiv)
💡 Insight: When you force code generation, you automatically get variable binding, order of operations, and precise numerics. That’s where a lot of CoT failures hide.
Scenario: You’re reconciling two ETF buys and want the volume-weighted average price (VWAP) for your cost basis.
Orders:
Buy 30 @ 50.25
Buy 20 @ 52.40
Expected: VWAP=30⋅50.25+20⋅52.4050=51.11\text{VWAP} = \frac{30\cdot 50.25 + 20\cdot 52.40}{50} = 51.11.
Prompt (PAL harness). This system message tells the model exactly what kind of code to emit and why.
What it does: Instructs the model to emit a single solve() function that returns the numeric answer; no imports or I/O.
jsonYou are a code-only solver. Write a single Python function `def solve(): ...` that: - Parses the problem embedded in comments. - Uses plain Python (no imports, no I/O, no randomness). - Returns the final numeric answer as a float (not a string). - Avoids printing; just `return` the answer. - Keep code short and readable. # Problem We bought 30 shares at 50.25 and 20 shares at 52.40. Compute the volume-weighted average price of the position after these two buys.
One plausible model output (abridged):
def solve(): qty1, px1 = 30, 50.25 qty2, px2 = 20, 52.40 total_qty = qty1 + qty2 total_cost = qty1*px1 + qty2*px2 return total_cost / total_qty # 51.11
You run this in your sandbox and surface 51.11 to the user. The “thinking” didn’t vanish; it became executable reasoning.
They’re siblings with different habits.
PAL is a style of few-shot prompting where the model writes runnable code as the intermediate step. It shines on word problems and algorithmic tasks (calendar arithmetic, combinatorics, string ops) because the code can be entirely bespoke and minimal. (arXiv)
PoT is a format that explicitly structures reasoning as a small program “plan” and often expects table/finance contexts. In the original work, PoT targeted math datasets (GSM8K, AQuA, SVAMP, etc.) and financial QA datasets (FinQA, ConvFinQA, TAT-QA), consistently outscoring CoT; adding self-consistency typically pushed it to SOTA or near-SOTA. (arXiv)
A pragmatic split: reach for PAL when the question is free-form or algorithmic; reach for PoT when the question is numerical and structured (tabular statements, multi-step asset calculations), and you’d like to ensemble several sampled programs.
Finance questions often mix retrieval/lookup (which row? which column?) with precise arithmetic (ratios, YoY growth, weighted sums). PoT templates ask the model to:
Restate variables as code.
Transform a small table into lists/dicts.
Compute via pure Python.
Return only the numeric answer.
Then you sample N programs (e.g., 5–20), run them, and pick the most frequent numeric value within a tolerance (e.g., 1e-6). This is textbook self-consistency, but the voting happens over program outputs rather than prose answers. (arXiv)
Minimal voting wrapper (conceptual):
answers = [] for _ in range(N): # vary temperature > 0 at decode-time code = llm(prompt, temperature=0.7) # returns a def solve(): ... ans = run_safely(code) # sandboxed exec, returns float answers.append(round(ans, 6)) # choose the modal value; if continuous, use clustering within tol final = most_common_with_tolerance(answers, tol=1e-6)
Why it helps: each sample may choose slightly different decompositions (e.g., rounding order, unit normalization). The correct result tends to attract agreement across samples. (arXiv)
If your failures are semantic (the model misunderstood the story) and the math is trivial, CoT may be fastest and cheapest. But if your failures are numerical/logic—off-by-one dates, rounding, unit conversion, or multi-step arithmetic—switch to PAL/PoT immediately.
Accuracy: PAL/PoT typically beat vanilla CoT on arithmetic-heavy tasks. The PAL paper reports large jumps on GSM8K (e.g., Codex-PAL > PaLM-540B CoT by ~15 points); PoT reports ~12% average gains vs. CoT across math/finance, and more with self-consistency. (arXiv)
Latency & cost: One program + run is often cheaper than generating long CoT prose. But self-consistency (multiple samples) multiplies tokens and interpreter calls—use it only when single-shot is unstable. (arXiv)
Determinism: With PAL/PoT, setting temperature=0 and constraining the scaffold yields highly repeatable outputs; CoT can drift stylistically and numerically.
1) PAL system scaffold. Use this when you want the model to output a tiny, self-contained solver.
What it does: Forces a single function with a numeric return and no side effects.
jsonYou are a Python reasoning assistant. Output ONLY valid Python with: def solve(): """ - Read the problem in comments below. - Use plain Python (no imports, file I/O, network, randomness). - Use clear variables; avoid magic numbers. - Return the final numeric answer (float or int). """ ... # Problem {{PROBLEM_TEXT}}
2) PoT for tables/finance with voting. Use this when questions refer to small tables or multiple steps.
What it does: Asks for a short program that parses a mini-table and returns a single number; ideal for self-consistency.
jsonWrite a short Python program with a single function solve() that: - Encodes the table below as data structures in code. - Computes the requested quantity precisely. - Returns ONLY the numeric answer (float or int). - No imports, I/O, randomness, or printing. # Table (CSV-like; header in row 1) {{TABLE_TEXT}} # Task {{QUESTION_TEXT}}
3) Self-consistency decode note (for your orchestrator): Raise temperature to 0.6–0.8 and sample 5–20 programs; vote on the numeric return with a tolerance. This is the same strategy that improved CoT, now applied to program outputs. (arXiv)
Executing model-written code demands a few non-negotiables:
Sandbox the interpreter. Disallow import, filesystem, and network; run with a short CPU/memory/time budget; kill on exceptions or long loops.
Enforce a schema. Require exactly one symbol—solve()—that returns a number. Reject anything else.
Unit tests when stakes are high. Provide 1–2 hidden test cases in the prompt scaffold (e.g., “if input were X, expected is Y”) and verify before accepting the final return.
Deterministic numerics. Standardize rounding (banker’s vs. half-up) and decimal precision in the harness when dealing with currency.
These measures turn “programs as thoughts” from a neat trick into a dependable tool.
The model references imports or prints. It’s pattern-matching from prior code. Tighten your scaffold (“no imports, no print, just return”), and add a simple rejection rule in your runner.
Variable name drift or table misread. If the table is embedded as text, the model may miss a header or unit. Make column names short and unambiguous, and show one explicit example of row-to-dict mapping in the instruction comment.
Infinite loops or heavy computation. Rare but possible. Add a strict instruction (“no loops over more than 10^6 iterations”), and enforce a time/step limit in your sandbox.
Float jitter breaks voting. Round each program’s return (e.g., 6 decimals) before tallying. If answers cluster (not identical), choose the densest cluster center rather than exact mode.
Still unstable? Add a one-line check step to the scaffold (“verify the result by recomputing from primitives”) or sample a few more programs at a slightly lower temperature.
Goal: Use PoT + self-consistency on a tiny finance question.
Data (CSV-like):
Quarter,Revenue_MUSD Q2-2024,120.5 Q2-2025,138.2
Task: Compute year-over-year growth (%) from Q2-2024 to Q2-2025.
Your prompt (paste into the PoT template):
# Table Quarter,Revenue_MUSD Q2-2024,120.5 Q2-2025,138.2 # Task Return the YoY growth percentage from Q2-2024 to Q2-2025 as a float.
One plausible program:
def solve(): last = 120.5 now = 138.2 return (now - last) / last * 100.0
Expected answer: ~14.6888 → 14.69% (rounded to two decimals).
Now try sampling 5 programs at temperature 0.7 and vote on the rounded (2-dp) percentage. You should see strong agreement near 14.69.
Beyond Python? PAL/PoT work with any deterministic interpreter. For analytics teams, a SQL-first variant is natural: emit a single aggregate query against a tiny, in-prompt table. The trade-off is setup friction; Python remains the easiest universal runtime.
When not to use PAL/PoT: If the question is open-ended or evaluative (“Explain the market impact of X”), code yields little. Fall back to CoT, or combine with tool-use for retrieval, then summarize.
Combining with other strategies: Self-consistency is a near-free multiplier. Beam search or ToT-style branch-and-score can help on long multi-part problems, but consider cost. (For background on self-consistency, see Wang et al.) (arXiv)
Evaluation: For numeric QA, adopt exact-match with tolerance and log failure types: “execution error,” “wrong parse,” or “wrong math.” If “wrong math” dominates CoT, move to PAL/PoT; if “wrong parse” dominates PAL/PoT, refine the scaffold and table format.
Programs-as-thoughts reframes reasoning: the model plans, the interpreter calculates. PAL is your go-to for bespoke, free-form program snippets that execute a plan precisely. PoT scales that idea to tabular and financial QA, and when you pair it with self-consistency, it becomes a robust, state-of-the-art pattern for numerical questions.
Choose the technique by error profile and latency. If prose reasoning is fine but numbers wobble, stop asking the model to be a calculator. Hand the arithmetic to Python and let the model think in code.
A small harness—scaffold, sandbox, and optional voting—turns this from an experiment into week-to-week reliability. Once you’ve built it, you’ll find PAL/PoT cheaper and steadier than sprawling chains of thought for anything math-heavy.
Wrap a minimal PAL/PoT runner with sandboxing, timeouts, and a tolerance-based voter; wire it to your preferred LLM.
Build a tiny numeric eval set (10–20 questions) with exact answers; track error types to decide when to enable voting.
Extend your scaffold for tables (CSV → Python dicts) and set consistent rounding rules for currency tasks.
PAL: Program-Aided Language Models. Gao, Madaan, Zhou, Alon, Liu, Yang, Callan, Neubig. arXiv:2211.10435. (Key idea: generate programs as intermediate steps and offload computation to Python; reports large gains on GSM8K vs. PaLM-540B CoT.) (arXiv)
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Chen, Ma, Wang, Cohen. arXiv:2211.12588 / TMLR 2023. (Key idea: structure reasoning as programs; ~12% avg gain over CoT across math and finance; +self-consistency for SOTA.) (arXiv)
Self-Consistency Improves Chain of Thought Reasoning in Language Models. Wang, Wei, Schuurmans, Le, Chi, Narang, Chowdhery, Zhou. arXiv:2203.11171 / ICLR 2023. (Decoding strategy: sample diverse reasoning paths and vote.) (arXiv)
(Background on CoT itself: Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” 2022.) (arXiv)
Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.
Explore PathsReady to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.