ABAP-Bench: A Comprehensive Benchmark for Evaluating LLM Understanding of SAP ABAP and S/4HANA Modernization
Project description
ABAP-Bench
A professional AI benchmark for evaluating LLM understanding of SAP ABAP and S/4HANA modernization.
专业评估大语言模型对 SAP ABAP 及 S/4HANA 现代化理解能力的基准测试
Overview / 概述
ABAP-Bench evaluates large language models on real-world SAP enterprise software tasks: migrating legacy ABAP code to S/4HANA APIs, detecting defects in business-critical programs, rewriting classic reports as Fiori applications, and more.
Enterprise software modernization is a $50B+ market. As organizations migrate from ECC to S/4HANA, LLMs are increasingly used to assist developers — yet no standardized benchmark existed to measure their true capability in this domain. ABAP-Bench fills that gap.
Why it matters / 为什么重要:
- SAP systems manage >70% of global transaction revenue; migration errors have real financial consequences
- ABAP is a niche language with sparse training data; general coding benchmarks do not capture domain-specific knowledge
- S/4HANA introduces breaking API changes (BAPI → I_Journal APIs, classical ALV → CL_SALV, etc.) that require precise understanding
- China-specific regulatory and localization requirements (Golden Tax, ChinaTax, PIPL) demand dedicated evaluation coverage
Leaderboard (v4.0) / 排行榜
Last updated: 2026-04-03 | Scoring: 4-layer (Rubric 30% + Quality 20% + Semantic 20% + Judge 30%)
| Rank | Model | Score (/100) | Migration | Defects | Rewriting | China | Risk | Security | Architecture | Performance | Ecosystem |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Qwen3 235B | 75 | — | — | — | — | — | — | — | — | — |
| 2 | Grok 4 | 75 | — | — | — | — | — | — | — | — | — |
| 3 | GLM-5.1 | 74 | — | — | — | — | — | — | — | — | — |
| 4 | DeepSeek R1 (0528) | 73 | — | — | — | — | — | — | — | — | — |
| 5 | MiMo-V2-Pro | 68 | — | — | — | — | — | — | — | — | — |
| 6 | MiniMax M2.7 | 66 | — | — | — | — | — | — | — | — | — |
Scores above are from v3.2 (30 tasks, rubric-only scoring). Full v4.0 re-evaluation with 60 tasks and 4-layer scoring is in progress.
Submit your model via Issues.
Quick Start / 快速开始
# Clone and install
git clone https://github.com/abap-bench/abap-bench
cd abap-bench
pip install -e .
# Configure API keys
cp .env.example .env
# Edit .env with your keys
# Run full benchmark (all models in configs/models.yaml)
python -m src.run_benchmark
# Run single model
python -m src.run_benchmark --model glm-5.1
# Score a response (3-layer, no API needed)
python -m src.evaluate_v2 --task T01 --response "I_JournalEntry ACDOCA..." --breakdown
# Score with LLM-as-Judge (4-layer, needs API key)
python -m src.evaluate_v2 --task T01 --response "..." --with-judge --judge-model glm-4-plus --judge-backend zhipuai --breakdown
# Run judge on a full result file
python -m src.judge batch results/v4.0/glm-5.1.json --output results/judge/glm-5.1.json
Benchmark Design / 基准设计
9 Evaluation Dimensions / 9 个评估维度
┌──────────────────────────────────────────────────────────────────────────┐
│ ABAP-Bench v4.0 │
│ 60 Tasks × 20 pts = 1200 │
├───────────────┬───────────────┬───────────────┬──────────────────────────┤
│ D1 Migration │ D2 Defects │ D3 Rewriting │ D4 China Compliance │
│ T01,T09,T10 │ T02,T11,T12 │ T03,T13,T14 │ T04,T15,T16 │
│ T31,T32,T33 │ T34,T35,T36 │ T37,T38,T39 │ T40,T41,T42 │
├───────────────┼───────────────┼───────────────┼──────────────────────────┤
│ D5 Risk │ D6 Security │ D7 Architect │ D8 Performance │
│ T05,T17,T18 │ T06,T19,T20 │ T07,T21,T22 │ T08,T23,T24 │
│ T43,T44,T45 │ T46,T47,T48 │ T49,T50,T51 │ T52,T53,T54 │
├───────────────┴───────────────┴───────────────┴──────────────────────────┤
│ D9 Modern Ecosystem: T25-T30, T55-T60 (12 tasks) │
│ Clean Core · Unit Testing · Fiori · BAdI · LUW · Integration Suite │
│ Workflow · Output Mgmt · IDoc/ALE · BDC · Change Mgmt · Code Inspector │
└──────────────────────────────────────────────────────────────────────────┘
| # | Dimension | Tasks | Description |
|---|---|---|---|
| D1 | Code Migration | 6 | ECC → S/4HANA API replacement (BKPF→ACDOCA, BAPI→Released API) |
| D2 | Defect Discovery | 6 | Finding hidden bugs in ABAP code (N+1 queries, scope leaks, silent data loss) |
| D3 | Code Rewriting | 6 | Modernizing classical ABAP to clean code, RAP, ABAP Cloud |
| D4 | China Compliance | 6 | Golden Tax, ChinaTax VAT, PIPL privacy, 五险一金 payroll |
| D5 | Migration Risk | 6 | Change impact analysis, RFC dependency chains, transport risks |
| D6 | Security & Auth | 6 | Authority checks, SQL injection, authorization trace, transport security |
| D7 | S/4HANA Architecture | 6 | ACDOCA, CDS views, FI-CO integration, ledger architecture |
| D8 | Performance Engineering | 6 | SELECT optimization, HANA column store, parallel processing |
| D9 | Modern Ecosystem | 12 | Clean Core, unit testing, Fiori, BAdI, LUW, IDoc, workflow, BDC |
4-Layer Scoring / 四层评分
┌─────────────────────────────────────────────────────────────┐
│ Layer 4: LLM-as-Judge (optional) 30% │
│ ├── Correctness · Completeness · Specificity │
│ ├── Structure · Insight (each 1-5, total /25) │
│ └── Reference-guided via golden answers │
├─────────────────────────────────────────────────────────────┤
│ Layer 3: Semantic Similarity 20% (30%*) │
│ ├── BM25 text similarity against golden answers │
│ └── Concept coverage (key_concepts hit rate) │
├─────────────────────────────────────────────────────────────┤
│ Layer 2: Code & Structure Quality 20% (30%*) │
│ ├── ABAP syntax checks (for code tasks) │
│ └── Answer structure analysis (for knowledge tasks) │
├─────────────────────────────────────────────────────────────┤
│ Layer 1: Rubric Matching 30% (40%*) │
│ ├── Keyword matching (weighted_terms, keyword_group) │
│ ├── Compound matching (key_term + context_keywords) │
│ └── Penalty rules (incorrect S/4HANA statements: -1~-3) │
└─────────────────────────────────────────────────────────────┘
* Weights in parentheses: 3-layer mode (Layer 4 disabled)
3-layer mode (default, no API needed): Rubric 40% + Quality 30% + Semantic 30%
4-layer mode (--with-judge): Rubric 30% + Quality 20% + Semantic 20% + Judge 30%
Project Structure / 项目结构
ABAP-Bench/
├── README.md # This file
├── pyproject.toml # Python packaging (pip install -e .)
├── benchmark_card.yaml # HuggingFace Dataset Card format
├── CITATION.cff # Citation metadata
├── CHANGELOG.md # Version history
├── LICENSE # Apache-2.0
├── .env.example # API key template
│
├── data/
│ ├── tasks.jsonl # 60 task definitions (JSONL)
│ ├── dimensions.json # 9 dimensions metadata
│ ├── rubrics/ # 60 scoring rubric JSONs (T01-T60)
│ ├── golden/ # 60 golden reference answers (T01-T60)
│ └── test_code/ # ABAP code samples for code-review tasks
│ ├── zvat_invoice_process.abap
│ ├── zhr_salary_calc.abap
│ └── zdyn_query.abap
│
├── src/
│ ├── __init__.py # Package init (version: 4.0.0)
│ ├── run_benchmark.py # Main runner: load tasks → call LLM → score → save
│ ├── evaluate.py # Scoring engine v1 (T01-T30, rubric-only)
│ ├── evaluate_v2.py # Scoring engine v2 (4-layer, all 60 tasks)
│ ├── judge.py # LLM-as-Judge module (Layer 4)
│ └── models.py # Multi-backend LLM client (zero external deps)
│
├── configs/
│ └── models.yaml # Model registry (7 models, 4 backends)
│
├── results/
│ ├── schema.json # Result file JSON Schema
│ └── v4.0/ # Per-model evaluation results
│
├── scripts/
│ ├── validate_rubrics.py # Data integrity validation
│ └── migrate_from_legacy.py # v3.2 → v4.0 migration helper
│
├── tests/ # Unit & integration tests
│ ├── test_evaluate.py
│ ├── test_data_integrity.py
│ └── test_judge.py
│
└── docs/
├── DECONTAMINATION.md # Data provenance & contamination statement
└── IMPLEMENTATION_PLAN.md # Development roadmap (P0-P5)
Adding New Tasks / 添加新任务
- Append a JSON line to
data/tasks.jsonl:
{"task_id":"T61","title":"New Task","dimension":"Code Migration Knowledge","max_score":20,"prompt_template":"...","requires_test_code":false,"version":"4.1"}
- Create rubric:
data/rubrics/T61.json - Create golden answer:
data/golden/T61.json - Update
data/dimensions.jsonto include T61 - Validate:
python scripts/validate_rubrics.py - Test:
python -m src.evaluate_v2 --task T61 --response "..." --breakdown
Known Limitations / 已知局限
- No code execution: ABAP requires licensed SAP systems; scoring relies on static analysis + LLM-as-Judge instead of unit tests
- 60 tasks: Below the 100+ statistical significance threshold of top benchmarks (SWE-bench: 2294, BigCodeBench: 1140)
- No human correlation study: Inter-annotator agreement not yet measured (planned: Spearman ρ target > 0.85)
- Primarily Chinese prompts: May disadvantage models weaker in Chinese language understanding
See IMPLEMENTATION_PLAN.md for the full future roadmap.
Citation / 引用
@misc{abapbench2026,
title = {ABAP-Bench: A Benchmark for Evaluating LLM Understanding of SAP ABAP and S/4HANA Modernization},
author = {ABAP-Bench Contributors},
year = {2026},
version = {4.0},
howpublished = {\url{https://github.com/abap-bench/abap-bench}},
note = {60 tasks, 9 dimensions, 4-layer scoring}
}
License / 许可证
This project is licensed under the Apache License 2.0. See LICENSE for details.
Benchmark task prompts and rubrics are released under Apache-2.0. Golden reference answers (data/golden/) are provided for evaluation use only and should NOT be included in LLM training data. Model responses collected during evaluation remain the property of their respective model providers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file abap_bench-4.0.0.tar.gz.
File metadata
- Download URL: abap_bench-4.0.0.tar.gz
- Upload date:
- Size: 55.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9f66127c1e7cddf6cfac780dff8f77f55c7dc493fcd527b06c11b244a2b22d8
|
|
| MD5 |
e80dd9780e98c2c48b4f5a5b4540dfa9
|
|
| BLAKE2b-256 |
bf9b446aa3a11f763d8c307cff527e3064d5c5540a723c92e46b153cf6e9d4da
|
File details
Details for the file abap_bench-4.0.0-py3-none-any.whl.
File metadata
- Download URL: abap_bench-4.0.0-py3-none-any.whl
- Upload date:
- Size: 54.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d70c8c0d223cd102e318afe9643c6f598f4635d1154104053fc3ed3d1919312b
|
|
| MD5 |
4af1144c95e276bc699cfe407c027d71
|
|
| BLAKE2b-256 |
29038dd178936ff0b655dc9e55f701cb33c3909e29626cdfbbd91de73db59b6e
|