arxiv:2603.08262

FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use

Published on Mar 9

· Submitted by

Jiaxuan Lu on Mar 18

shanghai ailab

Upvote

Authors:

Jiaxuan Lu ,

Yemin Wang ,

Shujian Deng ,

Lingzhi Chen ,

Abstract

FinToolBench presents the first real-world benchmark for evaluating financial tool learning agents, featuring 760 executable tools and comprehensive evaluation criteria beyond simple execution success.

AI-generated summary

The integration of Large Language Models (LLMs) into the financial domain is driving a paradigm shift from passive information retrieval to dynamic, agentic interaction. While general-purpose tool learning has witnessed a surge in benchmarks, the financial sector, characterized by high stakes, strict compliance, and rapid data volatility, remains critically underserved. Existing financial evaluations predominantly focus on static textual analysis or document-based QA, ignoring the complex reality of tool execution. Conversely, general tool benchmarks lack the domain-specific rigor required for finance, often relying on toy environments or a negligible number of financial APIs. To bridge this gap, we introduce FinToolBench, the first real-world, runnable benchmark dedicated to evaluating financial tool learning agents. Unlike prior works limited to a handful of mock tools, FinToolBench establishes a realistic ecosystem coupling 760 executable financial tools with 295 rigorous, tool-required queries. We propose a novel evaluation framework that goes beyond binary execution success, assessing agents on finance-critical dimensions: timeliness, intent type, and regulatory domain alignment. Furthermore, we present FATR, a finance-aware tool retrieval and reasoning baseline that enhances stability and compliance. By providing the first testbed for auditable, agentic financial execution, FinToolBench sets a new standard for trustworthy AI in finance. The tool manifest, execution environment, and evaluation code will be open-sourced to facilitate future research.

View arXiv page View PDF GitHub 16 Add to collection

Community

Blue-Giant

Paper author Paper submitter 1 day ago

We introduce FinToolBench, a benchmark for evaluating LLM agents in realistic financial tool-use scenarios. It focuses not only on tool-calling capability, but also on finance-specific requirements such as timeliness, intent alignment, and domain compliance.

We release a runnable benchmark with real-world financial tools, evaluation protocols, and a finance-aware baseline (FATR).

Github: https://github.com/Double-wk/FinToolBench