AI Accounting Benchmark: New OpenAI GPT-5.4 Model Tops 19 AI Systems in Real Accounting Workflow Test

Woosung Chun
CFO, DualEntry
Woosung Chun
CFO, DualEntry

Woosung Chun is the CFO of DualEntry with experience in corporate finance, accounting, strategy, and acquisitions. He previously grew from scratch and led the M&A and Finance teams at Benitago, where he completed more than 12 acquisitions in 2 years. He graduated with a BS from NYU Stern.

Learn about our editorial policies.
Last updated
March 11, 2026
Reviewed by

Learn about our editorial policies.
Contents
More

Subscribe to the
DualEntry Newsletter

Get Fresh Al finance insights, reports and more delivered straight to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Summarize this article

Independent benchmark across 101 accounting tasks finds leading AI models still struggle to reach enterprise-grade financial accuracy

NEW YORK — March 2026

DualEntry today released the results of a large-scale benchmark evaluating how modern AI models perform across real accounting workflows. The benchmark tested 19 leading AI models on 101 domain-specific accounting tasks, covering transaction classification, journal entry creation, bank reconciliation, financial reporting, and month-end close operations.

The newly released OpenAI GPT-5.4 model achieved the highest overall accuracy at 77.3%, significantly outperforming other models tested in the benchmark.

Despite rapid improvements in reasoning models, the results highlight ongoing reliability gaps in financial automation: no model exceeded 80% accuracy, and most systems failed more than one-third of accounting tasks.

Key Findings

  • OpenAI GPT-5.4 achieved the highest accuracy at 77.3%.
  • The second-best model, Gemini 3.1 Pro, scored 66%, more than 11 percentage points behind GPT-5.4.
  • Most models scored below 65% accuracy across accounting workflows.
  • Older models such as GPT-4 scored only 19.8% on the same task set.
  • Even the best performing model still fails roughly 1 in 4 accounting tasks.

AI Model Accuracy on Accounting Workflows

Accuracy of leading AI models across 101 accounting workflow tasks in the DualEntry benchmark.

AI Accounting Benchmark Leaderboard

Rank Model Accuracy
1 OpenAI GPT-5.4 77.3%
2 Gemini 3.1 Pro 66.0%
3 Z.ai GLM-5 65.3%
4 MiniMax M2.5 65.3%
5 Claude Sonnet 4.6 63.4%
6 Claude Haiku 4.5 61.4%
7 Claude Sonnet 4.5 59.4%
8 OpenAI GPT-5.2 58.4%
9 OpenAI GPT-5.1 57.4%
10 Qwen3 Coder Next 57.4%

Full Model Comparison

Top models in the DualEntry AI Accounting Benchmark. Full leaderboard includes 19 models.

Models That Struggled Most

Model Accuracy
Claude Opus 4.6 38.6%
Nemotron Nano 12B 32.7%
Gemini 2.5 Flash Lite 27.7%
GPT-4 19.8%
GPT-4-0613 19.8%

The results illustrate the rapid evolution of reasoning models, with newer models significantly outperforming earlier generations.

What This Means for Enterprise AI

Large language models are increasingly capable at generating structured text, categorizing transactions, and drafting journal entries. These capabilities can accelerate repetitive accounting tasks such as first-pass transaction classification and draft financial reporting.

However, accounting systems do not run on drafts.

Financial operations depend on validated records, entries that balance, reconciliations that resolve to zero, and reports that withstand audit scrutiny. The distance between a plausible draft and a validated record is where operational risk emerges.

“Large language models are powerful drafting tools, but finance doesn’t run on drafts; it runs on validated records,” said Santiago Nestares, co-founder of DualEntry. “The benchmark shows that AI can accelerate accounting workflows, but without system-level controls and validation, errors can quickly cascade through financial reporting.”

Benchmark Methodology

The benchmark was designed as a task-oriented evaluation of real accounting workflows, rather than trivia-style knowledge questions.

A total of 101 accounting tasks were constructed using a provisioned chart of accounts and minimal context designed to simulate real operational environments.

Tasks were divided into eight workflow categories.

Category Questions What It Tests
Transaction Classification 13 Mapping bank transactions to the correct chart of accounts
Journal Entry Creation 13 Creating balanced journal entries
Accounts Payable 13 Bills, vendor payments, credits
Accounts Receivable 12 Invoices, customer payments
Bank Reconciliation 12 Identifying reconciling items
Financial Reporting 13 Ratios, cash flow, balance sheet analysis
Month-End Close 12 Accruals, deferrals, depreciation
AI Accounting Knowledge 13 Conceptual accounting knowledge

Each model was tested in an isolated environment with no connection to external financial systems.

All responses were graded using deterministic binary scoring, correct or incorrect, with no partial credit or subjective interpretation.

Multiple runs per model were permitted to compute overall accuracy and difficulty tiers.

Tested AI Providers

The benchmark evaluated models from leading AI developers including:

  • OpenAI
  • Google
  • Anthropic
  • Alibaba
  • Zhipu AI
  • MiniMax
  • Moonshot AI
  • NVIDIA

A total of 19 models were evaluated.

Full Benchmark Results

Full benchmark results and model comparisons are available here:

https://www.dualentry.com/accounting-ai-benchmark

About DualEntry

DualEntry is an AI-native ERP platform designed for companies scaling from mid-market to IPO. The platform embeds AI directly into financial workflows including journal entry drafting, reconciliation, and reporting while maintaining the validation controls required for enterprise accounting.

Media Contact

Ana Marturet
anam@dualentry.com

Website: https://www.dualentry.com/

See the full power of DualEntry in 30 minutes

Go live in 24 hours

By clicking "Schedule Demo" you agree to the use of your data in accordance with DualEntry's Privacy Notice, including for marketing purposes.