Skip to main content

ABAP-Bench: A Comprehensive Benchmark for Evaluating LLM Understanding of SAP ABAP and S/4HANA Modernization

Project description

ABAP-Bench

Version Tasks Dimensions Scoring License

A professional AI benchmark for evaluating LLM understanding of SAP ABAP and S/4HANA modernization.

专业评估大语言模型对 SAP ABAP 及 S/4HANA 现代化理解能力的基准测试


Overview / 概述

ABAP-Bench evaluates large language models on real-world SAP enterprise software tasks: migrating legacy ABAP code to S/4HANA APIs, detecting defects in business-critical programs, rewriting classic reports as Fiori applications, and more.

Enterprise software modernization is a $50B+ market. As organizations migrate from ECC to S/4HANA, LLMs are increasingly used to assist developers — yet no standardized benchmark existed to measure their true capability in this domain. ABAP-Bench fills that gap.

Why it matters / 为什么重要:

  • SAP systems manage >70% of global transaction revenue; migration errors have real financial consequences
  • ABAP is a niche language with sparse training data; general coding benchmarks do not capture domain-specific knowledge
  • S/4HANA introduces breaking API changes (BAPI → I_Journal APIs, classical ALV → CL_SALV, etc.) that require precise understanding
  • China-specific regulatory and localization requirements (Golden Tax, ChinaTax, PIPL) demand dedicated evaluation coverage

Leaderboard (v4.0) / 排行榜

Last updated: 2026-04-03 | Scoring: 4-layer (Rubric 30% + Quality 20% + Semantic 20% + Judge 30%)

Rank Model Score (/100) Migration Defects Rewriting China Risk Security Architecture Performance Ecosystem
1 Qwen3 235B 75
2 Grok 4 75
3 GLM-5.1 74
4 DeepSeek R1 (0528) 73
5 MiMo-V2-Pro 68
6 MiniMax M2.7 66

Scores above are from v3.2 (30 tasks, rubric-only scoring). Full v4.0 re-evaluation with 60 tasks and 4-layer scoring is in progress.

Submit your model via Issues.


Quick Start / 快速开始

# Clone and install
git clone https://github.com/abap-bench/abap-bench
cd abap-bench
pip install -e .

# Configure API keys
cp .env.example .env
# Edit .env with your keys

# Run full benchmark (all models in configs/models.yaml)
python -m src.run_benchmark

# Run single model
python -m src.run_benchmark --model glm-5.1

# Score a response (3-layer, no API needed)
python -m src.evaluate_v2 --task T01 --response "I_JournalEntry ACDOCA..." --breakdown

# Score with LLM-as-Judge (4-layer, needs API key)
python -m src.evaluate_v2 --task T01 --response "..." --with-judge --judge-model glm-4-plus --judge-backend zhipuai --breakdown

# Run judge on a full result file
python -m src.judge batch results/v4.0/glm-5.1.json --output results/judge/glm-5.1.json

Benchmark Design / 基准设计

9 Evaluation Dimensions / 9 个评估维度

┌──────────────────────────────────────────────────────────────────────────┐
│                           ABAP-Bench v4.0                                │
│                        60 Tasks × 20 pts = 1200                          │
├───────────────┬───────────────┬───────────────┬──────────────────────────┤
│  D1 Migration │  D2 Defects   │  D3 Rewriting │  D4 China Compliance     │
│  T01,T09,T10  │  T02,T11,T12  │  T03,T13,T14  │  T04,T15,T16             │
│  T31,T32,T33  │  T34,T35,T36  │  T37,T38,T39  │  T40,T41,T42             │
├───────────────┼───────────────┼───────────────┼──────────────────────────┤
│  D5 Risk      │  D6 Security  │  D7 Architect │  D8 Performance          │
│  T05,T17,T18  │  T06,T19,T20  │  T07,T21,T22  │  T08,T23,T24             │
│  T43,T44,T45  │  T46,T47,T48  │  T49,T50,T51  │  T52,T53,T54             │
├───────────────┴───────────────┴───────────────┴──────────────────────────┤
│  D9 Modern Ecosystem: T25-T30, T55-T60 (12 tasks)                       │
│  Clean Core · Unit Testing · Fiori · BAdI · LUW · Integration Suite      │
│  Workflow · Output Mgmt · IDoc/ALE · BDC · Change Mgmt · Code Inspector  │
└──────────────────────────────────────────────────────────────────────────┘
# Dimension Tasks Description
D1 Code Migration 6 ECC → S/4HANA API replacement (BKPF→ACDOCA, BAPI→Released API)
D2 Defect Discovery 6 Finding hidden bugs in ABAP code (N+1 queries, scope leaks, silent data loss)
D3 Code Rewriting 6 Modernizing classical ABAP to clean code, RAP, ABAP Cloud
D4 China Compliance 6 Golden Tax, ChinaTax VAT, PIPL privacy, 五险一金 payroll
D5 Migration Risk 6 Change impact analysis, RFC dependency chains, transport risks
D6 Security & Auth 6 Authority checks, SQL injection, authorization trace, transport security
D7 S/4HANA Architecture 6 ACDOCA, CDS views, FI-CO integration, ledger architecture
D8 Performance Engineering 6 SELECT optimization, HANA column store, parallel processing
D9 Modern Ecosystem 12 Clean Core, unit testing, Fiori, BAdI, LUW, IDoc, workflow, BDC

4-Layer Scoring / 四层评分

┌─────────────────────────────────────────────────────────────┐
│  Layer 4: LLM-as-Judge (optional)              30%         │
│  ├── Correctness · Completeness · Specificity              │
│  ├── Structure · Insight (each 1-5, total /25)             │
│  └── Reference-guided via golden answers                   │
├─────────────────────────────────────────────────────────────┤
│  Layer 3: Semantic Similarity                  20% (30%*)  │
│  ├── BM25 text similarity against golden answers           │
│  └── Concept coverage (key_concepts hit rate)              │
├─────────────────────────────────────────────────────────────┤
│  Layer 2: Code & Structure Quality             20% (30%*)  │
│  ├── ABAP syntax checks (for code tasks)                   │
│  └── Answer structure analysis (for knowledge tasks)       │
├─────────────────────────────────────────────────────────────┤
│  Layer 1: Rubric Matching                      30% (40%*)  │
│  ├── Keyword matching (weighted_terms, keyword_group)      │
│  ├── Compound matching (key_term + context_keywords)       │
│  └── Penalty rules (incorrect S/4HANA statements: -1~-3)  │
└─────────────────────────────────────────────────────────────┘
  * Weights in parentheses: 3-layer mode (Layer 4 disabled)

3-layer mode (default, no API needed): Rubric 40% + Quality 30% + Semantic 30% 4-layer mode (--with-judge): Rubric 30% + Quality 20% + Semantic 20% + Judge 30%


Project Structure / 项目结构

ABAP-Bench/
├── README.md                    # This file
├── pyproject.toml               # Python packaging (pip install -e .)
├── benchmark_card.yaml          # HuggingFace Dataset Card format
├── CITATION.cff                 # Citation metadata
├── CHANGELOG.md                 # Version history
├── LICENSE                      # Apache-2.0
├── .env.example                 # API key template
│
├── data/
│   ├── tasks.jsonl              # 60 task definitions (JSONL)
│   ├── dimensions.json          # 9 dimensions metadata
│   ├── rubrics/                 # 60 scoring rubric JSONs (T01-T60)
│   ├── golden/                  # 60 golden reference answers (T01-T60)
│   └── test_code/               # ABAP code samples for code-review tasks
│       ├── zvat_invoice_process.abap
│       ├── zhr_salary_calc.abap
│       └── zdyn_query.abap
│
├── src/
│   ├── __init__.py              # Package init (version: 4.0.0)
│   ├── run_benchmark.py         # Main runner: load tasks → call LLM → score → save
│   ├── evaluate.py              # Scoring engine v1 (T01-T30, rubric-only)
│   ├── evaluate_v2.py           # Scoring engine v2 (4-layer, all 60 tasks)
│   ├── judge.py                 # LLM-as-Judge module (Layer 4)
│   └── models.py                # Multi-backend LLM client (zero external deps)
│
├── configs/
│   └── models.yaml              # Model registry (7 models, 4 backends)
│
├── results/
│   ├── schema.json              # Result file JSON Schema
│   └── v4.0/                    # Per-model evaluation results
│
├── scripts/
│   ├── validate_rubrics.py      # Data integrity validation
│   └── migrate_from_legacy.py   # v3.2 → v4.0 migration helper
│
├── tests/                       # Unit & integration tests
│   ├── test_evaluate.py
│   ├── test_data_integrity.py
│   └── test_judge.py
│
└── docs/
    ├── DECONTAMINATION.md       # Data provenance & contamination statement
    └── IMPLEMENTATION_PLAN.md   # Development roadmap (P0-P5)

Adding New Tasks / 添加新任务

  1. Append a JSON line to data/tasks.jsonl:
{"task_id":"T61","title":"New Task","dimension":"Code Migration Knowledge","max_score":20,"prompt_template":"...","requires_test_code":false,"version":"4.1"}
  1. Create rubric: data/rubrics/T61.json
  2. Create golden answer: data/golden/T61.json
  3. Update data/dimensions.json to include T61
  4. Validate: python scripts/validate_rubrics.py
  5. Test: python -m src.evaluate_v2 --task T61 --response "..." --breakdown

Known Limitations / 已知局限

  • No code execution: ABAP requires licensed SAP systems; scoring relies on static analysis + LLM-as-Judge instead of unit tests
  • 60 tasks: Below the 100+ statistical significance threshold of top benchmarks (SWE-bench: 2294, BigCodeBench: 1140)
  • No human correlation study: Inter-annotator agreement not yet measured (planned: Spearman ρ target > 0.85)
  • Primarily Chinese prompts: May disadvantage models weaker in Chinese language understanding

See IMPLEMENTATION_PLAN.md for the full future roadmap.


Citation / 引用

@misc{abapbench2026,
  title        = {ABAP-Bench: A Benchmark for Evaluating LLM Understanding of SAP ABAP and S/4HANA Modernization},
  author       = {ABAP-Bench Contributors},
  year         = {2026},
  version      = {4.0},
  howpublished = {\url{https://github.com/abap-bench/abap-bench}},
  note         = {60 tasks, 9 dimensions, 4-layer scoring}
}

License / 许可证

This project is licensed under the Apache License 2.0. See LICENSE for details.

Benchmark task prompts and rubrics are released under Apache-2.0. Golden reference answers (data/golden/) are provided for evaluation use only and should NOT be included in LLM training data. Model responses collected during evaluation remain the property of their respective model providers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abap_bench-4.0.0.tar.gz (55.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

abap_bench-4.0.0-py3-none-any.whl (54.3 kB view details)

Uploaded Python 3

File details

Details for the file abap_bench-4.0.0.tar.gz.

File metadata

  • Download URL: abap_bench-4.0.0.tar.gz
  • Upload date:
  • Size: 55.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for abap_bench-4.0.0.tar.gz
Algorithm Hash digest
SHA256 d9f66127c1e7cddf6cfac780dff8f77f55c7dc493fcd527b06c11b244a2b22d8
MD5 e80dd9780e98c2c48b4f5a5b4540dfa9
BLAKE2b-256 bf9b446aa3a11f763d8c307cff527e3064d5c5540a723c92e46b153cf6e9d4da

See more details on using hashes here.

File details

Details for the file abap_bench-4.0.0-py3-none-any.whl.

File metadata

  • Download URL: abap_bench-4.0.0-py3-none-any.whl
  • Upload date:
  • Size: 54.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for abap_bench-4.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d70c8c0d223cd102e318afe9643c6f598f4635d1154104053fc3ed3d1919312b
MD5 4af1144c95e276bc699cfe407c027d71
BLAKE2b-256 29038dd178936ff0b655dc9e55f701cb33c3909e29626cdfbbd91de73db59b6e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page