Two Years Building AI Agents for Financial Services: Architecture Lessons from Fintool
@nicbstme
In regulated industries, AI accuracy is an architectural problem that must be designed for from the start — companies that treat it as a fine-tuning step will find the production gap far more costly than anticipated.
Two Years Building AI Agents for Financial Services: Architecture Lessons from Fintool
By Nicolas Bustamante
Building AI agents that actually work in financial services is a different problem than building AI agents that impress in a demo. After two years developing Fintool — an AI agent platform that now serves clients like Kennedy Capital and PwC, and achieves 97% accuracy on the rigorous FinanceBench benchmark — here are the architectural lessons that matter most.
The Demo-to-Production Gap Is Larger in Regulated Industries
Every software project has a gap between a working prototype and a production system. In financial services, that gap is substantially wider. Regulatory requirements, auditability expectations, and the cost of errors all conspire to make the last 20% of accuracy the hardest and most important work you will do.
Financial services does not tolerate the same error rates that might be acceptable in a consumer application. A hallucinated figure in a financial analysis is not a minor UX failure — it is a compliance risk and a trust-destroying event. This forces a different engineering posture from day one.
Context Management for Financial Documents
Financial documents — 10-Ks, earnings transcripts, fund prospectuses — are long, dense, and structurally complex. Naive retrieval strategies that work well for general knowledge bases break down when precision matters at the level of a specific footnote or line item.
Effective context management in this domain means being deliberate about what the agent sees, when it sees it, and how that context is structured. Chunking strategies, document hierarchy preservation, and targeted retrieval based on query intent all become first-order concerns rather than implementation details.
Tool Design for Structured Data Extraction
Unstructured language models need structured scaffolding when operating on financial data. The design of the tools agents use — how they call calculations, how they query tables, how they cross-reference figures across documents — determines whether the system is reliable or merely plausible-sounding.
Tool design is where domain expertise becomes engineering constraint. The tools have to reflect how a skilled financial analyst actually navigates these documents, not how a generalist might assume they do.
Embedding Domain Expertise in Agent Prompts
Prompt engineering in accuracy-critical domains is not about clever phrasing. It is about encoding the judgment of subject matter experts directly into how the agent reasons. This means understanding what a competent analyst checks, what edge cases exist in financial reporting, and where ambiguity in source documents typically hides.
The agents that perform at 97% on FinanceBench are not smarter models — they are models guided by well-structured domain knowledge at every step of their reasoning process.
Evaluation Frameworks That Match the Stakes
You cannot improve what you do not measure, and general-purpose LLM benchmarks do not capture what matters in finance. Building domain-specific evaluation frameworks — with test sets drawn from real financial documents and graded against expert answers — is not optional. It is the feedback loop that makes systematic improvement possible.
Evaluation in accuracy-critical domains also needs to distinguish between types of failure. A missing figure and a wrong figure are not the same kind of error. Your evaluation framework should reflect that distinction.
What This Means for Leaders
If you are considering AI transformation in a regulated or accuracy-critical domain, the core lesson from Fintool's two years is this: the technical architecture has to be designed around the accuracy requirements of your domain, not retrofitted to meet them later. That means investing in evaluation infrastructure early, involving domain experts in agent design, and treating the demo-to-production gap as a known cost to plan for — not a surprise to manage around.