ABAP-Bench: A Comprehensive Benchmark for Evaluating LLM Understanding of SAP ABAP and S/4HANA Modernization

These details have not been verified by PyPI

Project links

Project description

ABAP-Bench

Version Tasks Dimensions Scoring License

A professional AI benchmark for evaluating LLM understanding of SAP ABAP and S/4HANA modernization.

专业评估大语言模型对 SAP ABAP 及 S/4HANA 现代化理解能力的基准测试

Overview / 概述

ABAP-Bench evaluates large language models on real-world SAP enterprise software tasks: migrating legacy ABAP code to S/4HANA APIs, detecting defects in business-critical programs, rewriting classic reports as Fiori applications, and more.

Enterprise software modernization is a $50B+ market. As organizations migrate from ECC to S/4HANA, LLMs are increasingly used to assist developers — yet no standardized benchmark existed to measure their true capability in this domain. ABAP-Bench fills that gap.

Why it matters / 为什么重要：

SAP systems manage >70% of global transaction revenue; migration errors have real financial consequences
ABAP is a niche language with sparse training data; general coding benchmarks do not capture domain-specific knowledge
S/4HANA introduces breaking API changes (BAPI → I_Journal APIs, classical ALV → CL_SALV, etc.) that require precise understanding
China-specific regulatory and localization requirements (Golden Tax, ChinaTax, PIPL) demand dedicated evaluation coverage

Leaderboard (v4.0) / 排行榜

Last updated: 2026-04-03 | Scoring: 4-layer (Rubric 30% + Quality 20% + Semantic 20% + Judge 30%)

Rank	Model	Score (/100)	Migration	Defects	Rewriting	China	Risk	Security	Architecture	Performance	Ecosystem
1	Qwen3 235B	75	—	—	—	—	—	—	—	—	—
2	Grok 4	75	—	—	—	—	—	—	—	—	—
3	GLM-5.1	74	—	—	—	—	—	—	—	—	—
4	DeepSeek R1 (0528)	73	—	—	—	—	—	—	—	—	—
5	MiMo-V2-Pro	68	—	—	—	—	—	—	—	—	—
6	MiniMax M2.7	66	—	—	—	—	—	—	—	—	—

Scores above are from v3.2 (30 tasks, rubric-only scoring). Full v4.0 re-evaluation with 60 tasks and 4-layer scoring is in progress.

Submit your model via Issues.

Quick Start / 快速开始

# Clone and install
git clone https://github.com/abap-bench/abap-bench
cd abap-bench
pip install -e .

# Configure API keys
cp .env.example .env
# Edit .env with your keys

# Run full benchmark (all models in configs/models.yaml)
python -m src.run_benchmark

# Run single model
python -m src.run_benchmark --model glm-5.1

# Score a response (3-layer, no API needed)
python -m src.evaluate_v2 --task T01 --response "I_JournalEntry ACDOCA..." --breakdown

# Score with LLM-as-Judge (4-layer, needs API key)
python -m src.evaluate_v2 --task T01 --response "..." --with-judge --judge-model glm-4-plus --judge-backend zhipuai --breakdown

# Run judge on a full result file
python -m src.judge batch results/v4.0/glm-5.1.json --output results/judge/glm-5.1.json

Benchmark Design / 基准设计

9 Evaluation Dimensions / 9 个评估维度

┌──────────────────────────────────────────────────────────────────────────┐
│                           ABAP-Bench v4.0                                │
│                        60 Tasks × 20 pts = 1200                          │
├───────────────┬───────────────┬───────────────┬──────────────────────────┤
│  D1 Migration │  D2 Defects   │  D3 Rewriting │  D4 China Compliance     │
│  T01,T09,T10  │  T02,T11,T12  │  T03,T13,T14  │  T04,T15,T16             │
│  T31,T32,T33  │  T34,T35,T36  │  T37,T38,T39  │  T40,T41,T42             │
├───────────────┼───────────────┼───────────────┼──────────────────────────┤
│  D5 Risk      │  D6 Security  │  D7 Architect │  D8 Performance          │
│  T05,T17,T18  │  T06,T19,T20  │  T07,T21,T22  │  T08,T23,T24             │
│  T43,T44,T45  │  T46,T47,T48  │  T49,T50,T51  │  T52,T53,T54             │
├───────────────┴───────────────┴───────────────┴──────────────────────────┤
│  D9 Modern Ecosystem: T25-T30, T55-T60 (12 tasks)                       │
│  Clean Core · Unit Testing · Fiori · BAdI · LUW · Integration Suite      │
│  Workflow · Output Mgmt · IDoc/ALE · BDC · Change Mgmt · Code Inspector  │
└──────────────────────────────────────────────────────────────────────────┘

#	Dimension	Tasks	Description
D1	Code Migration	6	ECC → S/4HANA API replacement (BKPF→ACDOCA, BAPI→Released API)
D2	Defect Discovery	6	Finding hidden bugs in ABAP code (N+1 queries, scope leaks, silent data loss)
D3	Code Rewriting	6	Modernizing classical ABAP to clean code, RAP, ABAP Cloud
D4	China Compliance	6	Golden Tax, ChinaTax VAT, PIPL privacy, 五险一金 payroll
D5	Migration Risk	6	Change impact analysis, RFC dependency chains, transport risks
D6	Security & Auth	6	Authority checks, SQL injection, authorization trace, transport security
D7	S/4HANA Architecture	6	ACDOCA, CDS views, FI-CO integration, ledger architecture
D8	Performance Engineering	6	SELECT optimization, HANA column store, parallel processing
D9	Modern Ecosystem	12	Clean Core, unit testing, Fiori, BAdI, LUW, IDoc, workflow, BDC

4-Layer Scoring / 四层评分

┌─────────────────────────────────────────────────────────────┐
│  Layer 4: LLM-as-Judge (optional)              30%         │
│  ├── Correctness · Completeness · Specificity              │
│  ├── Structure · Insight (each 1-5, total /25)             │
│  └── Reference-guided via golden answers                   │
├─────────────────────────────────────────────────────────────┤
│  Layer 3: Semantic Similarity                  20% (30%*)  │
│  ├── BM25 text similarity against golden answers           │
│  └── Concept coverage (key_concepts hit rate)              │
├─────────────────────────────────────────────────────────────┤
│  Layer 2: Code & Structure Quality             20% (30%*)  │
│  ├── ABAP syntax checks (for code tasks)                   │
│  └── Answer structure analysis (for knowledge tasks)       │
├─────────────────────────────────────────────────────────────┤
│  Layer 1: Rubric Matching                      30% (40%*)  │
│  ├── Keyword matching (weighted_terms, keyword_group)      │
│  ├── Compound matching (key_term + context_keywords)       │
│  └── Penalty rules (incorrect S/4HANA statements: -1~-3)  │
└─────────────────────────────────────────────────────────────┘
  * Weights in parentheses: 3-layer mode (Layer 4 disabled)

3-layer mode (default, no API needed): Rubric 40% + Quality 30% + Semantic 30% 4-layer mode (--with-judge): Rubric 30% + Quality 20% + Semantic 20% + Judge 30%

Project Structure / 项目结构

ABAP-Bench/
├── README.md                    # This file
├── pyproject.toml               # Python packaging (pip install -e .)
├── benchmark_card.yaml          # HuggingFace Dataset Card format
├── CITATION.cff                 # Citation metadata
├── CHANGELOG.md                 # Version history
├── LICENSE                      # Apache-2.0
├── .env.example                 # API key template
│
├── data/
│   ├── tasks.jsonl              # 60 task definitions (JSONL)
│   ├── dimensions.json          # 9 dimensions metadata
│   ├── rubrics/                 # 60 scoring rubric JSONs (T01-T60)
│   ├── golden/                  # 60 golden reference answers (T01-T60)
│   └── test_code/               # ABAP code samples for code-review tasks
│       ├── zvat_invoice_process.abap
│       ├── zhr_salary_calc.abap
│       └── zdyn_query.abap
│
├── src/
│   ├── __init__.py              # Package init (version: 4.0.0)
│   ├── run_benchmark.py         # Main runner: load tasks → call LLM → score → save
│   ├── evaluate.py              # Scoring engine v1 (T01-T30, rubric-only)
│   ├── evaluate_v2.py           # Scoring engine v2 (4-layer, all 60 tasks)
│   ├── judge.py                 # LLM-as-Judge module (Layer 4)
│   └── models.py                # Multi-backend LLM client (zero external deps)
│
├── configs/
│   └── models.yaml              # Model registry (7 models, 4 backends)
│
├── results/
│   ├── schema.json              # Result file JSON Schema
│   └── v4.0/                    # Per-model evaluation results
│
├── scripts/
│   ├── validate_rubrics.py      # Data integrity validation
│   └── migrate_from_legacy.py   # v3.2 → v4.0 migration helper
│
├── tests/                       # Unit & integration tests
│   ├── test_evaluate.py
│   ├── test_data_integrity.py
│   └── test_judge.py
│
└── docs/
    ├── DECONTAMINATION.md       # Data provenance & contamination statement
    └── IMPLEMENTATION_PLAN.md   # Development roadmap (P0-P5)

Adding New Tasks / 添加新任务

Append a JSON line to data/tasks.jsonl:

{"task_id":"T61","title":"New Task","dimension":"Code Migration Knowledge","max_score":20,"prompt_template":"...","requires_test_code":false,"version":"4.1"}

Create rubric: data/rubrics/T61.json
Create golden answer: data/golden/T61.json
Update data/dimensions.json to include T61
Validate: python scripts/validate_rubrics.py
Test: python -m src.evaluate_v2 --task T61 --response "..." --breakdown

Known Limitations / 已知局限

No code execution: ABAP requires licensed SAP systems; scoring relies on static analysis + LLM-as-Judge instead of unit tests
60 tasks: Below the 100+ statistical significance threshold of top benchmarks (SWE-bench: 2294, BigCodeBench: 1140)
No human correlation study: Inter-annotator agreement not yet measured (planned: Spearman ρ target > 0.85)
Primarily Chinese prompts: May disadvantage models weaker in Chinese language understanding

See IMPLEMENTATION_PLAN.md for the full future roadmap.

Citation / 引用

@misc{abapbench2026,
  title        = {ABAP-Bench: A Benchmark for Evaluating LLM Understanding of SAP ABAP and S/4HANA Modernization},
  author       = {ABAP-Bench Contributors},
  year         = {2026},
  version      = {4.0},
  howpublished = {\url{https://github.com/abap-bench/abap-bench}},
  note         = {60 tasks, 9 dimensions, 4-layer scoring}
}

License / 许可证

This project is licensed under the Apache License 2.0. See LICENSE for details.

Benchmark task prompts and rubrics are released under Apache-2.0. Golden reference answers (data/golden/) are provided for evaluation use only and should NOT be included in LLM training data. Model responses collected during evaluation remain the property of their respective model providers.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

4.0.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abap_bench-4.0.0.tar.gz (55.2 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

abap_bench-4.0.0-py3-none-any.whl (54.3 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file abap_bench-4.0.0.tar.gz.

File metadata

Download URL: abap_bench-4.0.0.tar.gz
Upload date: Apr 3, 2026
Size: 55.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for abap_bench-4.0.0.tar.gz
Algorithm	Hash digest
SHA256	`d9f66127c1e7cddf6cfac780dff8f77f55c7dc493fcd527b06c11b244a2b22d8`
MD5	`e80dd9780e98c2c48b4f5a5b4540dfa9`
BLAKE2b-256	`bf9b446aa3a11f763d8c307cff527e3064d5c5540a723c92e46b153cf6e9d4da`

See more details on using hashes here.

File details

Details for the file abap_bench-4.0.0-py3-none-any.whl.

File metadata

Download URL: abap_bench-4.0.0-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 54.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for abap_bench-4.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d70c8c0d223cd102e318afe9643c6f598f4635d1154104053fc3ed3d1919312b`
MD5	`4af1144c95e276bc699cfe407c027d71`
BLAKE2b-256	`29038dd178936ff0b655dc9e55f701cb33c3909e29626cdfbbd91de73db59b6e`

See more details on using hashes here.

abap-bench 4.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ABAP-Bench

Overview / 概述

Leaderboard (v4.0) / 排行榜

Quick Start / 快速开始

Benchmark Design / 基准设计

9 Evaluation Dimensions / 9 个评估维度

4-Layer Scoring / 四层评分

Project Structure / 项目结构

Adding New Tasks / 添加新任务

Known Limitations / 已知局限

Citation / 引用

License / 许可证

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes