Tidalwave and Columbia University’s DAPLab Release First Public Benchmark for AI Accuracy in Mortgage Origination
Tidalwave and Columbia University’s DAPLab Release First Public Benchmark for AI Accuracy in Mortgage Origination
Joint study finds Tidalwave’s mortgage-trained SOLO scored 95% on underwriting compliance checks where generic LLM scored 42%
NEW YORK--(BUSINESS WIRE)--Tidalwave, the agentic AI mortgage platform, and Columbia University’s DAPLab today released results from the first public benchmark measuring AI accuracy on real mortgage origination tasks. The study evaluated how Tidalwave’s SOLO performs against Anthropic’s Claude 4.5 on realistic questions loan officers ask during loan origination. The joint study finds that loan officers using Tidalwave’s mortgage-trained SOLO receive significantly more accurate answers to underwriting questions than when using general-purpose large language models such as Anthropic’s Claude 4.5.
Tidalwave, the agentic AI mortgage platform, and Columbia University’s DAP Lab today released results from the first public benchmark measuring AI accuracy on real mortgage origination tasks.
Share
The study tested Tidalwave’s SOLO against Anthropic’s Claude 4.5 on 90 questions that loan officers routinely ask during the origination process: Does the payroll match the stated employer? Are there buy-now-pay-later payments? Could any deposits be from a foreign source? Is there undisclosed income?
On boolean verification questions, the type that flag payroll mismatches, undisclosed liabilities, and suspicious transactions, Tidalwave’s SOLO scored 95% versus 42% for the baseline model.
Tidalwave’s SOLO scored 84% overall accuracy compared to 71% for Claude 4.5. The widest gap was on yes/no compliance checks, the questions that determine whether a loan gets flagged or approved.
Benchmark results |
||
Question Type |
Tidalwave’s SOLO |
Anthropic’s Claude 4.5 |
Yes/no compliance checks |
95% |
42% |
Transaction identification |
83% |
80% |
Account verification |
67% |
86% |
Overall accuracy |
84% |
71% |
Source: Tidalwave / Columbia University Benchmark Technical Report, 2025–2026. Measured by F1 score across 90 questions and 10 borrower scenarios. |
||
Why the compliance gap matters
Yes/no compliance checks are the backbone of loan quality review. They’re the questions that catch payroll mismatches, undisclosed debts, suspicious deposit patterns, and structurally inconsistent bank statements. A 42% accuracy rate means a general-purpose model produces the wrong answer more often than the right one on exactly the questions where errors lead to bad loans, compliance violations, or missed fraud.
The gap exists because general-purpose models process a loan file as raw text. Tidalwave’s SOLO is integrated with Fannie Mae and Freddie Mac underwriting systems and trained on structured mortgage data, including ULAD (Uniform Loan Application Dataset) files and bank statement transaction records. It understands what the numbers in a loan file mean, not just what they say.
Tidalwave’s SOLO scored lower than Claude 4.5 in one category, account verification (67% vs. 86%). The company attributes this gap to its practice of stripping personally identifiable information (PII) from SOLO-enabled AI interactions and says its next-generation capability is designed to close that performance gap while safeguarding sensitive data.
Why this benchmark matters now
Loan officers across the U.S. are already using general-purpose AI tools to work through 500+ page loan files and 43-day closing timelines. The average lender loses $600+ per loan originated. The pressure to adopt AI is real. But until now, no public benchmark has measured whether the AI tools lenders are adopting actually produce accurate answers on the questions that matter for loan quality. This study is the first to put a number on that gap.
“42% on compliance questions should worry every lender relying on off-the-shelf AI right now," said Diane Yu, co-founder and CEO of Tidalwave. "When I was building technology at Better.com, I watched general-purpose tools fail on mortgage data over and over. They'd miss a payroll mismatch or fail to recognize a deposit, and a human had to catch it every time. That's why we built Tidalwave’s SOLO differently, and that's why we tested it with Columbia University, not internally. If you're going to tell lenders your AI is accurate, you should be willing to prove it publicly."
Yu previously co-founded FreeWheel, a video ad-tech company acquired by Comcast for $320M. She previously served as CTO at digital mortgage lender Better (NASDAQ: BETR) before starting Tidalwave.
Study methodology
The benchmark was conducted in fall and winter 2025 as a collaboration between Tidalwave’s engineering team and researchers at Columbia University. Researchers built 90 questions across 10 borrower scenarios, each with a complete loan application file (ULAD) and up to two months of bank statement transaction data. All borrower data was fully synthetic, constructed from synthetic Plaid data to protect privacy.
A mortgage industry subject matter expert designed all questions from actual Tidalwave’s SOLO usage patterns. The benchmark intentionally included edge cases: foreign transactions, mismatches between bank statements and applications, and deposits from lesser-known vendors, to test agents under realistic conditions. Performance was measured using F1 score, a standard accuracy metric that gives partial credit for partially correct answers on list-type questions and binary scoring on yes/no questions.
“We partnered with Tidalwave on this benchmark to reflect the actual decision points loan officers face during origination, not abstract NLP tasks,” said Zhou Yu, Associate Professor at Columbia University. “By using realistic borrower scenarios, synthetic but structured data, and F1 scoring on both retrieval and yes/no checks, we can see where systems truly help loan officers and where they quietly fail. We hope this becomes a template for evaluating AI in other high-stakes, regulated workflows as well.”
The full technical report, including dataset statistics and failure mode analysis, is available here.
About Tidalwave: Tidalwave is an agentic AI platform that automates the full mortgage lifecycle, from application through closing. The company integrates directly with Fannie Mae DU and Freddie Mac LPA, along with verification partners Plaid, Argyle, and Truv. Lenders on the platform have automated up to 70% of manual tasks, cut processing from 45 days to under 15, and saved up to $1,500 per loan. Customers include DHI Mortgage (D.R. Horton’s lending arm, 70,000+ loans annually), NEXA Mortgage (3,200+ loan officers), and First Colony Mortgage. Tidalwave raised a $22M Series A in November 2025 and was named to the 2026 HousingWire Tech100. For more information, visit tidalwave.ai.
About Columbia University’s DAPLab: The Data, Agents, and Processes Lab (DAPLab) at Columbia University is building the foundations for a future where AI agents safely and reliably automate complex work. DAPLab is co-directed by Eugene Wu and Zhou Yu, including 16 Columbia faculty members. DAPLab brings together researchers in data systems, applied AI, operating systems, HCI, algorithms, and business to invent the infrastructure, algorithms, and design principles needed to deploy agents in the real world. DAPLab's work spans the full stack—from systems and training frameworks to human–agent interaction, process automation, and digital twins. We build open-source tools, collaborate closely with industry partners, and prototype agentic technologies that reimagine how work gets done. Based in the heart of New York City—home to some of the world’s largest enterprises— uniquely positioned to explore how agent automation transforms real organizational processes. DAPLab is a home for students, researchers, and partners who want to shape the next generation of AI-native systems.
Contacts
Tanya Gillogley
tgillogley@tidalhq.com
