From Pipeline to Parliament: 5 Agents That Underwrite a Deal

The Problem with Linear Pipelines

The V2 underwriting system looked clean on a whiteboard:

Raw Documents → Gemini Extraction → DealSchema → 11 SME Evals (parallel)
→ Deterministic Aggregation → Single-Pass Report → Questions

Every stage runs once. No iteration, no feedback loops, no self-assessment. The extraction either got it right or it didn't. Each of the 11 subject matter experts — GPU engineer, financial analyst, legal analyst, and eight others — evaluated in total isolation. When they disagreed, the system just... concatenated their opinions. Question deduplication was literally "compare the first 100 characters."

This worked for a while. Then we started seeing the cracks:

No evidence tracking. An SME might flag "CapEx assumptions are aggressive" with zero link to which document, which page, which number.
No cross-SME reasoning. Legal says PASS, risk says FAIL — the system presents both opinions side by side with no attempt to resolve the conflict.
No quality gate. A bad extraction propagates silently through every downstream stage. Garbage in, garbage out, with no checkpoint in between.
No fact-checking. Every claim from an LLM call was taken at face value.

The V3 Architecture: 5 Autonomous Agents

V3 replaces the linear pipeline with a system of agents that can think, verify, and argue:

ExtractionAgent (5-phase, reflective)
    ↓ DealSchema
11× SMEAgent (4-phase, tool-grounded)
    ↓ 11× SMEEvaluation
ICPanelAgent (5-step: aggregate → synthesize → debate → challenge → questions)
    ↓ Enhanced UnderwritingPacket
ReportAgent (3-phase: draft → self-review → revise)
QuestionAgent (3-phase: gap-analysis → web-filter → prioritize)

Each agent is an autonomous state machine. They don't just call an LLM once and return the result — they iterate through phases, reflect on their own output quality, use tools to verify claims, and only emit a final result when a quality threshold is met.

The Runtime: Observe-Think-Act-Reflect

Every agent runs inside the same AgentRuntime — a generic execution engine that implements a single loop:

for iteration in range(max_iterations):
    action = agent.plan(context)          # Think: what to do next?

    match action.type:
        case DONE:     break
        case TOOL_CALL: execute tool → on_tool_result()
        case LLM_CALL:  call LLM → on_llm_response()
        case REFLECT:   self-assess → on_reflection()
        case MESSAGE:   inter-agent message → on_message_sent()

Three design decisions that matter:

Runtime is agent-agnostic. Any BaseAgent subclass plugs in. The runtime doesn't know or care whether it's running an extraction agent or a report writer.
Context is immutable. Every with_tool_result(), with_llm_response(), with_memory() call returns a new instance. No aliasing bugs, no surprise mutations.
Every step is traced. Full cost, token count, duration, tool inputs/outputs — all recorded in AgentTrace. You can audit exactly what happened and why.

The Five Agents

ExtractionAgent — Don't Extract Once, Extract Until It's Right

V2 ran Gemini extraction once and hoped for completeness. V3's ExtractionAgent has five phases:

Bulk extract — run the legacy extraction pipeline for an initial DealSchema
Reflect — self-assess completeness. What fields are missing? What looks wrong?
Targeted reads — re-read specific documents for the top-5 critical missing fields
Verify — cross-reference high-value fields (operator name, GPU model, total CapEx) across documents, web search, and fact-checking
Final reflect — quality gate before emitting

Model: gemini-3.1-pro. Quality threshold: 80/100.

SMEAgent — 11 Domain Experts That Show Their Work

V2's SME was a single LLM call producing an opinionated blob. V3's SMEAgent runs four phases per domain:

Initial evaluation — structured analysis with evidence-linked strengths and concerns
Reflect — check evidence backing ratio (target: 80%+), confidence (target: 0.7+)
Tool verification — use query_schema, web_search, calculate to back up findings
Final reflect — quality gate

The quality formula weights evidence heavily: concerns_backed × 40 + strengths_backed × 30 + confidence × 20 + summary_quality × 10.

Each SME now produces domain-scoped decisions (PASS / PASS_WITH_CONDITIONS / NEEDS_CLARIFICATION / FAIL) instead of an overall deal recommendation. An SME stays in their lane.

11 domains: gpu_engineer, financial_analyst, legal_analyst, risk_underwriting, datacenter_engineer, product_manager, operations_analyst, commercial_analyst, market_analyst, security_compliance, energy_infrastructure.

ICPanelAgent — The Parliament

This is the most interesting agent. Where V2 just concatenated SME outputs, V3's IC Panel runs a 5-step synthesis pipeline:

Step	What It Does
Aggregate	Run deterministic compiler for a baseline packet
Synthesize	Write a cohesive executive summary from 11 SME summaries
Debate	When SMEs disagree (legal says PASS, risk says FAIL), generate structured debate with evidence from each side and a reasoned resolution
Challenge	Validate CRITICAL-severity concerns individually — if evidence is weak or another SME has mitigating findings, downgrade the concern
Question Synthesis	Semantic dedup of questions across all SMEs (replaces first-100-chars matching)

The debate and challenge steps are conditional — they only fire when there are actual conflicts or CRITICAL concerns. No unnecessary work.

Quality threshold: 90/100. This is the highest bar in the system, because the IC Panel output is what humans actually read.

ReportAgent — Draft, Critique, Revise

V2 generated reports in one shot. V3 runs three phases:

Draft — generate initial report from the UnderwritingPacket
Self-review — the same model critiques its own draft: Are concerns addressed? Are numbers sourced? Is the tone appropriate for the allocator type?
Revise — apply improvements, add missing citations, fix unsourced numbers

Reports are generated per allocator bucket: EQUITY reports focus on IRR and upside scenarios, DEBT_SENIOR reports focus on credit rationale and covenant analysis.

QuestionAgent — Ask Only What You Can't Google

V2 generated all questions in one shot. V3's QuestionAgent filters out noise:

Gap analysis — identify missing fields, conflicts, and draft questions
Web filter — check if answers are publicly available. If they are, remove those questions (saves the operator time)
Prioritize — rank remaining questions by deal impact, CRITICAL first

Evidence Provenance

Every finding in V3 traces back to its source:

Source document — which uploaded file
Location — where in the document
Snippet — the exact text that supports the claim

Concern severity uses a 4-level system (CRITICAL / HIGH / MEDIUM / LOW) instead of V2's binary pro/con. Every concern also gets suggested mitigations.

Communication: Phased to Prevent Infinite Loops

Agents can message each other through a structured bus (REQUEST, RESPONSE, CHALLENGE, ESCALATION, INFO). But unrestricted communication would create infinite loops. So it's phased:

Phase	Rules
1	No messages — SMEs evaluate independently
2	SME-to-SME only, max 2 rounds per pair
3	No messages — IC Panel reads all evaluations
4	IC Panel → SME only, max 1 round per SME
5	No messages — final aggregation

The Numbers

Cost per deal (Arkane Cloud test case):

Agent	Tokens	Duration
SME (gpu_engineer)	75K	~5 min
IC Panel	20K	~30 sec
Report (equity)	95K	~3 min
Question Agent	21K	~1 min
Total (single SME)	~211K	~10 min

Full pipeline with all 11 SMEs: a few dollars per deal -- orders of magnitude cheaper than a human analyst.

63 tests total: 58 integration tests (mocked LLM, runs in 0.1s) + 5 end-to-end tests (real API calls). The entire test suite was written in a single Claude Code session — that's a story for another post.

What Actually Changed

The shift from V2 to V3 isn't about adding more LLM calls. It's about changing the relationship between the system and its own output:

From "call once and hope" to iterative refinement. Agents reflect on quality and re-try until a threshold is met.
From opinions to evidence. Every claim links to a source document with location and snippet.
From isolated evaluation to structured debate. When SMEs disagree, the conflict is resolved with reasoning — not hidden by concatenation.
From single-pass reports to self-critiqued reports. The model reviews its own work before shipping.
From "generate all questions" to "ask only what matters." Web filtering removes questions that have publicly available answers.

The pipeline didn't get smarter by using a better model. It got smarter by giving agents the ability to question themselves and each other.