From Pipeline to Parliament: 5 Agents That Underwrite a Deal
The Problem with Linear Pipelines
The V2 underwriting system looked clean on a whiteboard:
Raw Documents → Gemini Extraction → DealSchema → 11 SME Evals (parallel)
→ Deterministic Aggregation → Single-Pass Report → Questions
Every stage runs once. No iteration, no feedback loops, no self-assessment. The extraction either got it right or it didn't. Each of the 11 subject matter experts — GPU engineer, financial analyst, legal analyst, and eight others — evaluated in total isolation. When they disagreed, the system just... concatenated their opinions. Question deduplication was literally "compare the first 100 characters."
This worked for a while. Then we started seeing the cracks:
- No evidence tracking. An SME might flag "CapEx assumptions are aggressive" with zero link to which document, which page, which number.
- No cross-SME reasoning. Legal says PASS, risk says FAIL — the system presents both opinions side by side with no attempt to resolve the conflict.
- No quality gate. A bad extraction propagates silently through every downstream stage. Garbage in, garbage out, with no checkpoint in between.
- No fact-checking. Every claim from an LLM call was taken at face value.
The V3 Architecture: 5 Autonomous Agents
V3 replaces the linear pipeline with a system of agents that can think, verify, and argue:
ExtractionAgent (5-phase, reflective)
↓ DealSchema
11× SMEAgent (4-phase, tool-grounded)
↓ 11× SMEEvaluation
ICPanelAgent (5-step: aggregate → synthesize → debate → challenge → questions)
↓ Enhanced UnderwritingPacket
ReportAgent (3-phase: draft → self-review → revise)
QuestionAgent (3-phase: gap-analysis → web-filter → prioritize)
Each agent is an autonomous state machine. They don't just call an LLM once and return the result — they iterate through phases, reflect on their own output quality, use tools to verify claims, and only emit a final result when a quality threshold is met.
The Runtime: Observe-Think-Act-Reflect
Every agent runs inside the same AgentRuntime — a generic execution engine that implements a single loop:
for iteration in range(max_iterations):
action = agent.plan(context) # Think: what to do next?
match action.type:
case DONE: break
case TOOL_CALL: execute tool → on_tool_result()
case LLM_CALL: call LLM → on_llm_response()
case REFLECT: self-assess → on_reflection()
case MESSAGE: inter-agent message → on_message_sent()
Three design decisions that matter:
- Runtime is agent-agnostic. Any
BaseAgentsubclass plugs in. The runtime doesn't know or care whether it's running an extraction agent or a report writer. - Context is immutable. Every
with_tool_result(),with_llm_response(),with_memory()call returns a new instance. No aliasing bugs, no surprise mutations. - Every step is traced. Full cost, token count, duration, tool inputs/outputs — all recorded in
AgentTrace. You can audit exactly what happened and why.
The Five Agents
ExtractionAgent — Don't Extract Once, Extract Until It's Right
V2 ran Gemini extraction once and hoped for completeness. V3's ExtractionAgent has five phases:
- Bulk extract — run the legacy extraction pipeline for an initial DealSchema
- Reflect — self-assess completeness. What fields are missing? What looks wrong?
- Targeted reads — re-read specific documents for the top-5 critical missing fields
- Verify — cross-reference high-value fields (operator name, GPU model, total CapEx) across documents, web search, and fact-checking
- Final reflect — quality gate before emitting
Model: gemini-3.1-pro. Quality threshold: 80/100.
SMEAgent — 11 Domain Experts That Show Their Work
V2's SME was a single LLM call producing an opinionated blob. V3's SMEAgent runs four phases per domain:
- Initial evaluation — structured analysis with evidence-linked strengths and concerns
- Reflect — check evidence backing ratio (target: 80%+), confidence (target: 0.7+)
- Tool verification — use
query_schema,web_search,calculateto back up findings - Final reflect — quality gate
The quality formula weights evidence heavily: concerns_backed × 40 + strengths_backed × 30 + confidence × 20 + summary_quality × 10.
Each SME now produces domain-scoped decisions (PASS / PASS_WITH_CONDITIONS / NEEDS_CLARIFICATION / FAIL) instead of an overall deal recommendation. An SME stays in their lane.
11 domains: gpu_engineer, financial_analyst, legal_analyst, risk_underwriting, datacenter_engineer, product_manager, operations_analyst, commercial_analyst, market_analyst, security_compliance, energy_infrastructure.
ICPanelAgent — The Parliament
This is the most interesting agent. Where V2 just concatenated SME outputs, V3's IC Panel runs a 5-step synthesis pipeline:
| Step | What It Does |
|---|---|
| Aggregate | Run deterministic compiler for a baseline packet |
| Synthesize | Write a cohesive executive summary from 11 SME summaries |
| Debate | When SMEs disagree (legal says PASS, risk says FAIL), generate structured debate with evidence from each side and a reasoned resolution |
| Challenge | Validate CRITICAL-severity concerns individually — if evidence is weak or another SME has mitigating findings, downgrade the concern |
| Question Synthesis | Semantic dedup of questions across all SMEs (replaces first-100-chars matching) |
The debate and challenge steps are conditional — they only fire when there are actual conflicts or CRITICAL concerns. No unnecessary work.
Quality threshold: 90/100. This is the highest bar in the system, because the IC Panel output is what humans actually read.
ReportAgent — Draft, Critique, Revise
V2 generated reports in one shot. V3 runs three phases:
- Draft — generate initial report from the UnderwritingPacket
- Self-review — the same model critiques its own draft: Are concerns addressed? Are numbers sourced? Is the tone appropriate for the allocator type?
- Revise — apply improvements, add missing citations, fix unsourced numbers
Reports are generated per allocator bucket: EQUITY reports focus on IRR and upside scenarios, DEBT_SENIOR reports focus on credit rationale and covenant analysis.
QuestionAgent — Ask Only What You Can't Google
V2 generated all questions in one shot. V3's QuestionAgent filters out noise:
- Gap analysis — identify missing fields, conflicts, and draft questions
- Web filter — check if answers are publicly available. If they are, remove those questions (saves the operator time)
- Prioritize — rank remaining questions by deal impact, CRITICAL first
Evidence Provenance
Every finding in V3 traces back to its source:
- Source document — which uploaded file
- Location — where in the document
- Snippet — the exact text that supports the claim
Concern severity uses a 4-level system (CRITICAL / HIGH / MEDIUM / LOW) instead of V2's binary pro/con. Every concern also gets suggested mitigations.
Communication: Phased to Prevent Infinite Loops
Agents can message each other through a structured bus (REQUEST, RESPONSE, CHALLENGE, ESCALATION, INFO). But unrestricted communication would create infinite loops. So it's phased:
| Phase | Rules |
|---|---|
| 1 | No messages — SMEs evaluate independently |
| 2 | SME-to-SME only, max 2 rounds per pair |
| 3 | No messages — IC Panel reads all evaluations |
| 4 | IC Panel → SME only, max 1 round per SME |
| 5 | No messages — final aggregation |
The Numbers
Cost per deal (Arkane Cloud test case):
| Agent | Tokens | Duration |
|---|---|---|
| SME (gpu_engineer) | 75K | ~5 min |
| IC Panel | 20K | ~30 sec |
| Report (equity) | 95K | ~3 min |
| Question Agent | 21K | ~1 min |
| Total (single SME) | ~211K | ~10 min |
Full pipeline with all 11 SMEs: a few dollars per deal -- orders of magnitude cheaper than a human analyst.
63 tests total: 58 integration tests (mocked LLM, runs in 0.1s) + 5 end-to-end tests (real API calls). The entire test suite was written in a single Claude Code session — that's a story for another post.
What Actually Changed
The shift from V2 to V3 isn't about adding more LLM calls. It's about changing the relationship between the system and its own output:
- From "call once and hope" to iterative refinement. Agents reflect on quality and re-try until a threshold is met.
- From opinions to evidence. Every claim links to a source document with location and snippet.
- From isolated evaluation to structured debate. When SMEs disagree, the conflict is resolved with reasoning — not hidden by concatenation.
- From single-pass reports to self-critiqued reports. The model reviews its own work before shipping.
- From "generate all questions" to "ask only what matters." Web filtering removes questions that have publicly available answers.
The pipeline didn't get smarter by using a better model. It got smarter by giving agents the ability to question themselves and each other.