Skip to main content

Japanese Financial Numerical Reasoning QA Benchmark

Project description

jfinqa

Japanese Financial Numerical Reasoning QA Benchmark.

PyPI Python CI Downloads HuggingFace License Leaderboard

What is this?

jfinqa is a benchmark for evaluating LLMs on Japanese financial numerical reasoning. Unlike existing benchmarks that focus on classification or simple lookup, jfinqa requires multi-step arithmetic over financial statement tables extracted from real Japanese corporate disclosures (EDINET). Questions include DuPont decomposition (6-step), growth rate calculations, and cross-statement ratio analysis.

Three Subtasks

Subtask Description Example
Numerical Reasoning Calculate financial metrics from table data "2024年3月期の売上高成長率は何%か?"
Consistency Checking Verify internal consistency of reported figures "資産合計は流動資産と固定資産の合計と一致するか?"
Temporal Reasoning Analyze trends and changes across periods "売上高が最も低かったのはどの年度か?"

Dataset Statistics

Total Numerical Reasoning Consistency Checking Temporal Reasoning
Questions 1000 550 200 250
Companies 68
Accounting Standards J-GAAP 58%, IFRS 38%, US-GAAP 4%
Avg. program steps 2.59 2.84 2.00 2.54
Avg. table rows 13.3
Max program steps 6 (DuPont)

Baseline Results

Model Overall Numerical Reasoning Consistency Checking Temporal Reasoning
GPT-4o 87.0% 80.2% 90.5% 99.2%
Gemini 2.0 Flash 80.4% 86.2% 83.5% 65.2%
GPT-4o-mini 67.7% 79.3% 83.5% 29.6%
Qwen2.5-3B-Instruct 39.6% 46.4% 51.0% 15.6%

1000 questions, zero-shot, temperature=0. Evaluation uses numerical matching with 1% tolerance. Qwen2.5-3B-Instruct run locally with MLX (4-bit quantization).

View full leaderboard →

Error Analysis

Systematic error analysis revealed both benchmark design issues and genuine LLM failure patterns.

Key findings:

  • Clear capability gradient: GPT-4o (87%) > Gemini 2.0 Flash (80%) > GPT-4o-mini (68%) >> Qwen2.5-3B (40%), validating the benchmark discriminates across model sizes and capabilities.
  • Temporal reasoning separates frontier models: GPT-4o achieves 99.2% on TR, while Gemini drops to 65.2% and GPT-4o-mini to 29.6%. This subtask requires strict output format compliance ("増収"/"減収" rather than "はい"/"いいえ"), which strongly differentiates models.
  • Gemini 2.0 Flash leads on numerical reasoning (86.2% vs GPT-4o's 80.2%), suggesting strong arithmetic capabilities, but falls behind on consistency checking and temporal reasoning where format compliance matters more.
  • DuPont decomposition is the hardest subtask: 6-step ROE decomposition questions (56 questions) see significant accuracy drops even for frontier models, while 3B models rarely solve them correctly.
  • GPT-4o-mini has a systematic prompt compliance issue in temporal reasoning. It answers "はい" (yes) to questions like "増収か減収か?" despite correctly analyzing the direction in its reasoning chain (122 of 176 TR errors follow this pattern).
  • J-GAAP balance sheet structure is a major error source. Models confuse 純資産合計 (net assets) with 株主資本 (shareholders' equity), and decompose 総資産 into 4 sub-categories instead of the standard 2.
  • Qwen2.5-3B-Instruct struggles most with temporal reasoning (15.6%) and consistency checking (51.0%), suggesting that smaller models have difficulty with instruction-following and multi-step verification tasks in Japanese.

Key Features

  • FinQA-compatible: Same data format as FinQA for cross-benchmark comparison
  • Japan-specific: Handles J-GAAP, IFRS, US-GAAP, and Japanese number formats (百万円, 億円, △)
  • Dual evaluation: Exact match and numerical match with tolerance
  • lm-evaluation-harness integration: Ready-to-use YAML task configs
  • Source provenance: Every question links back to its EDINET filing

Quick Start

Installation

pip install jfinqa
# or
uv add jfinqa

Evaluate Your Model

from jfinqa import load_dataset, evaluate

# Load benchmark questions
questions = load_dataset("numerical_reasoning")

# Provide predictions
predictions = {"nr_001": "25.0%", "nr_002": "16.0%"}
result = evaluate(questions, predictions=predictions)
print(result.summary())

Or Use a Model Function

from jfinqa import load_dataset, evaluate

questions = load_dataset()

def my_model(question: str, context: str) -> str:
    # Your model inference here
    return "42.5%"

result = evaluate(questions, model_fn=my_model)
print(result.summary())

CLI

# Inspect dataset questions
jfinqa inspect -s numerical_reasoning -n 5

# Evaluate predictions file
jfinqa evaluate -p predictions.json

# Evaluate with local data
jfinqa evaluate -p predictions.json -d local_data.json -s numerical_reasoning

lm-evaluation-harness

PR #3570 is pending. Once merged:

lm-eval run --model openai-completions \
    --model_args model=gpt-4o \
    --tasks jfinqa \
    --num_fewshot 0

Before merge, use --include_path:

lm-eval run --model openai-completions \
    --model_args model=gpt-4o \
    --tasks jfinqa \
    --num_fewshot 0 \
    --include_path lm_eval_tasks/

Data Format

Each question follows the FinQA schema with additional metadata:

{
  "id": "nr_001",
  "subtask": "numerical_reasoning",
  "pre_text": ["以下はA社の連結損益計算書の抜粋である。"],
  "post_text": ["当期は前期比で増収増益となった。"],
  "table": {
    "headers": ["", "2024年3月期", "2023年3月期"],
    "rows": [
      ["売上高", "1,500,000", "1,200,000"],
      ["営業利益", "200,000", "150,000"]
    ]
  },
  "qa": {
    "question": "2024年3月期の売上高成長率は何%か?",
    "program": ["subtract(1500000, 1200000)", "divide(#0, 1200000)", "multiply(#1, 100)"],
    "answer": "25.0%",
    "gold_evidence": [0]
  },
  "edinet_code": "E00001",
  "filing_year": "2024",
  "accounting_standard": "J-GAAP"
}

Japanese Number Handling

jfinqa correctly normalizes Japanese financial number formats:

Input Extracted Value Notes
△1,000 -1,000 Triangle negative marker
12,345 12,345 Fullwidth digits + comma removal
24,956百万円 24,956 Compound financial units treated as labels
50億 5,000,000,000 Bare kanji multiplier applied
42.5% 42.5 Percentage

Development

git clone https://github.com/ajtgjmdjp/jfinqa
cd jfinqa
uv sync --dev --extra dev
uv run pytest -v
uv run ruff check .
uv run mypy src/

Data Attribution

Source financial data is obtained from EDINET (Electronic Disclosure for Investors' NETwork), operated by the Financial Services Agency of Japan (金融庁). EDINET data is provided under the Public Data License 1.0.

The data format is compatible with FinQA (Chen et al., 2021).

Related Projects

  • FinQA — English financial QA benchmark (Chen et al., 2021)
  • TAT-QA — Tabular and textual QA
  • edinet-mcp — EDINET XBRL parser (companion project)
  • EDINET-Bench — Sakana AI's financial classification benchmark

Citation

If you use jfinqa in your research, please cite it as follows:

@dataset{jfinqa2025,
  title={jfinqa: Japanese Financial Numerical Reasoning QA Benchmark},
  author={ajtgjmdjp},
  year={2025},
  url={https://github.com/ajtgjmdjp/jfinqa},
  license={Apache-2.0}
}

License

Apache-2.0. See NOTICE for third-party attributions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jfinqa-0.3.2.tar.gz (249.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jfinqa-0.3.2-py3-none-any.whl (22.5 kB view details)

Uploaded Python 3

File details

Details for the file jfinqa-0.3.2.tar.gz.

File metadata

  • Download URL: jfinqa-0.3.2.tar.gz
  • Upload date:
  • Size: 249.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for jfinqa-0.3.2.tar.gz
Algorithm Hash digest
SHA256 f718aec2f73e9a8c1d10aa2650c8de582399e90f8cfc586fb179becb4087a8e9
MD5 23fe592e2589b678ebbde4e5c0d77612
BLAKE2b-256 8b3b79a017b7573dffde9294d4201c8a63cdd70a1ab2e4ab00a3a8af818b27d9

See more details on using hashes here.

File details

Details for the file jfinqa-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: jfinqa-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for jfinqa-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a7bb4b72f0c1211a35604cfddeee93714cde6b69ac149dc992593f5b22516546
MD5 6e85e252507519c46d6bc931a917adbe
BLAKE2b-256 8396da1bfd08fc04999063f875050372bb8de0aa3567d9c969a81dcd195987ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page