MMATH - packaged by NVIDIA

These details have not been verified by PyPI

Project description

MMATH

This is the official repository for the paper "MMATH: A Multilingual Benchmark for Mathematical Reasoning".

📖 Introduction

MMATH is a new benchmark specifically designed for multilingual complex reasoning. It comprises 374 carefully selected math problems from high-quality sources, including AIME, CNMO, and MATH-500, and covers ten typologically and geographically diverse languages. Each problem is translated and validated through a rigorous pipeline that combines frontier LLMs with human verification, ensuring semantic consistency.

NVIDIA Eval Factory

MMATH provides you with evaluation clients that are specifically built to evaluate model endpoints using our Standard API.

Launching an evaluation for an LLM

Install the package

pip install nvidia-mmath

(Optional) Set a token to your API endpoint if it's protected

export MY_API_KEY="your_api_key_here"
export HF_TOKEN="your_huggingface_token_here"

List the available evaluations

eval-factory ls

Available tasks:

mmath_en (in mmath)
mmath_zh (in mmath)
mmath_ar (in mmath)
mmath_es (in mmath)
mmath_fr (in mmath)
mmath_ja (in mmath)
mmath_ko (in mmath)
mmath_pt (in mmath)
mmath_th (in mmath)
mmath_vi (in mmath)

Run the evaluation of your choice

eval-factory run_eval \
    --eval_type mmath_en \
    --model_id meta/llama-3.1-8b-instruct \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --model_type chat \
    --api_key_name MY_API_KEY \
    --output_dir /workspace/results

Gather the results

cat /workspace/results/results.yml

Command-Line Tool

Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the mmath evaluations:

Commands

1. List Evaluation Types

eval-factory ls

Displays the evaluation types available within the harness.

2. Run an evaluation

The eval-factory run_eval command executes the evaluation process. Below are the flags and their descriptions:

Required flags:

--eval_type <string>: The type of evaluation to perform (e.g., mmath_en, mmath_zh, etc.)
--model_id <string>: The name or identifier of the model to evaluate.
--model_url <url>: The API endpoint where the model is accessible.
--model_type <string>: The type of the model to evaluate, currently either "chat", "completions", or "vlm".
--output_dir <directory>: The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here.

Optional flags:

--api_key_name <string>: The name of the environment variable that stores the Bearer token for the API, if authentication is required.
--run_config <path>: Specifies the path to a YAML file containing the evaluation definition.
--overrides <string>: Override configuration parameters (e.g., 'config.params.limit_samples=10').

Example

eval-factory run_eval \
    --eval_type mmath_en \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results

If the model API requires authentication, set the API key in an environment variable and reference it using the --api_key_name flag:

export MY_API_KEY="your_api_key_here"

eval-factory run_eval \
    --eval_type mmath_en \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --api_key_name MY_API_KEY \
    --output_dir ./evaluation_results

Configuring evaluations via YAML

Evaluations in MMATH are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.

Example of a YAML config:

config:
  type: mmath_en
  params:
    parallelism: 50
    limit_samples: 20
target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    type: chat
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key: NVIDIA_API_KEY

The priority of overrides is as follows:

command line arguments
user config (as seen above)
task defaults (defined per task type)
framework defaults

The --dry_run option allows you to print the final run configuration and command without executing the evaluation.

Example:

eval-factory run_eval \
    --eval_type mmath_en \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results \
    --dry_run

🛠️ Code and Data Usage

MMATH has the following resources. The mmath folder contains the benchmark data. The train folder contains the training data at Section 3. mmath_eval.py is the main program to evaluate the accuracy results, while calculate_lcr.py calculates the value of LCR as defined in our paper.

│  calculate_lcr.py
│  mmath_eval.py
│  utils.py
│
├─mmath
│      ar.json
│      en.json
│      es.json
│      fr.json
│      ja.json
│      ko.json
│      pt.json
│      th.json
│      vi.json
│      zh.json
│
└─train
        enSFT-3k.json
        enThink-3k.json
        nativeThink-3k.json

The example in mmath/xx.json has the following format:

{
        "question": "Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+\\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop.", # The question 
        "answer": "204",			# The answer
        "data_source": "AIME2024",  # The data source, which might be AIME2024/AIME2025/CNMO/MATH500
        "data_source_id": 0,        # The index in original data
        "lang": "en",				# Language type
        "gid": 0					# The global id in our benchmark MMATH
},

The example in train/yy.json has the following format:

{
        "index":0,
        "answer":"364",
        "conversations":[
            {
                "from":"user",
                "value":"For a positive integer \\( n \\), consider the function\n\n\\[ \nf(n)=\\frac{4n+\\sqrt{4n^2-1}}{\\sqrt{2n+1}+\\sqrt{2n-1}} \n\\]\n\nCalculate the value of\n\n\\[ \nf(1)+f(2)+f(3)+\\cdots+f(40) \n\\]"
            },
            {
                "from":"assistant",
                "value":"<think>\nOkay, let's see. I need to find the sum of f(1) + f(2) + ...Thus, the sum is:\n\n\\[\n\\frac{1}{2} (729 - 1) = \\frac{728}{2} = 364.\n\\]\n\nThe final answer is:\n\n\\[\n\\boxed{364}\n\\]"
            }
        ]
},

🧪 Experiment Setup

Environment Setups

To accelerate the process of environment setup, we use uv to manage the packages. And our training code is based on LLaMA-Factory, you can install it based on your requirements (e.g, with -e option).

conda create -n mmath python=3.10
conda activate mmath
pip install uv
uv pip install -r requirements.txt

Evaluation Commands

Accuracy Results

To calculate accuracy results on our benchmark, you can run with:

export CUDA_VISIBLE_DEVICES=0,1
python mmath_eval.py --model_name_or_path DeepSeek-R1-Distill-Qwen-32B --tensor_parallel_size 2

This will generate a results directory with a subdirectory with the model name (e.g, DeepSeek-R1-Distill-Qwen-32B). Inside the directory are the results belonging to different languages, such as en.json.

LCR Results

As for calculating LCR, please download the lid.176.bin with:

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

After that, please set the model_list_full variable in calculate_lcr.py. Then, you can run the command using python calculate_lcr.py.

model_list_full = [
    "DeepSeek-R1-Distill-Qwen-32B",
    # ...
]

This will rewrite some keys in results/model_name/xx.json and output a LaTeX table about the whole results.

Training Setups

As mentioned before, our training code is based on LLaMA-Factory. Here we provide the hyperparameters used in our paper.

### model
model_name_or_path: Qwen2.5-32B-Instruct
trust_remote_code: true

### method
stage: sft
template: qwen
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
packing: false

### dataset
dataset: en-Think
cutoff_len: 32768
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: Qwen2.5-32B-Instruct-en-Think
logging_steps: 10
save_strategy: epoch
plot_loss: true
overwrite_output_dir: true
save_only_model: true
save_total_limit: 10

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-5
num_train_epochs: 3
lr_scheduler_type: cosine
warmup_ratio: 0.1 
save_total_limit: 10 
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
enable_liger_kernel: true

📄 Attribution

This project is a fork of the original MMATH: A Multilingual Benchmark for Mathematical Reasoning repository created by the RUCAIBox team at Renmin University of China.

Original Repository

Repository: RUCAIBox/MMATH
Original Paper: MMATH: A Multilingual Benchmark for Mathematical Reasoning

Original Authors

The original MMATH benchmark was created by:

Wenyang Luo - Renmin University of China
Wayne Xin Zhao - Renmin University of China
Jing Sha - Renmin University of China
Shijin Wang - Renmin University of China
Ji-Rong Wen - Renmin University of China

License

The original MMATH repository is licensed under the MIT License. This fork maintains the same license terms while adding NVIDIA-specific packaging and evaluation capabilities.

Acknowledgments

We thank the original MMATH authors for creating this comprehensive multilingual mathematical reasoning benchmark and making it publicly available to the research community.

📄 Citation

@article{luo2025mmath,
  title={MMATH: A Multilingual Benchmark for Mathematical Reasoning},
  author={Luo, Wenyang and Zhao, Wayne Xin and Sha, Jing and Wang, Shijin and Wen, Ji-Rong},
  journal={arXiv preprint arXiv:2505.19126},
  year={2025}
}

Project details

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

26.3

Mar 16, 2026

26.1

Feb 27, 2026

25.11

Dec 4, 2025

25.10

Oct 31, 2025

25.9.1

Oct 23, 2025

25.9

Oct 3, 2025

25.8.1

Sep 16, 2025

This version

25.8

Sep 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nvidia_mmath-25.8-py3-none-any.whl (401.7 kB view details)

Uploaded Sep 4, 2025 Python 3

File details

Details for the file nvidia_mmath-25.8-py3-none-any.whl.

File metadata

Download URL: nvidia_mmath-25.8-py3-none-any.whl
Upload date: Sep 4, 2025
Size: 401.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.18

File hashes

Hashes for nvidia_mmath-25.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`56610dd2c84a80fdbfda34054dd7df01a946f36e18ddf0c5010c3f04d7001ec0`
MD5	`d96e46eedce84596442e9980fffcf119`
BLAKE2b-256	`c8bf2563d8688b795a01d7e9425cd50b8a48b07521d4730ed8872a82ababc3e9`

See more details on using hashes here.

nvidia-mmath 25.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers