MMATH - packaged by NVIDIA
Project description
MMATH
This is the official repository for the paper "MMATH: A Multilingual Benchmark for Mathematical Reasoning".
📖 Introduction
MMATH is a new benchmark specifically designed for multilingual complex reasoning. It comprises 374 carefully selected math problems from high-quality sources, including AIME, CNMO, and MATH-500, and covers ten typologically and geographically diverse languages. Each problem is translated and validated through a rigorous pipeline that combines frontier LLMs with human verification, ensuring semantic consistency.
NVIDIA Eval Factory
MMATH provides you with evaluation clients that are specifically built to evaluate model endpoints using our Standard API.
Launching an evaluation for an LLM
Install the package
pip install nvidia-mmath
(Optional) Set a token to your API endpoint if it's protected
export MY_API_KEY="your_api_key_here"
export HF_TOKEN="your_huggingface_token_here"
List the available evaluations
eval-factory ls
Available tasks:
- mmath_en (in mmath)
- mmath_zh (in mmath)
- mmath_ar (in mmath)
- mmath_es (in mmath)
- mmath_fr (in mmath)
- mmath_ja (in mmath)
- mmath_ko (in mmath)
- mmath_pt (in mmath)
- mmath_th (in mmath)
- mmath_vi (in mmath)
Run the evaluation of your choice
eval-factory run_eval \
--eval_type mmath_en \
--model_id meta/llama-3.1-8b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--api_key_name MY_API_KEY \
--output_dir /workspace/results
Gather the results
cat /workspace/results/results.yml
Command-Line Tool
Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the mmath evaluations:
Commands
1. List Evaluation Types
eval-factory ls
Displays the evaluation types available within the harness.
2. Run an evaluation
The eval-factory run_eval command executes the evaluation process. Below are the flags and their descriptions:
Required flags:
--eval_type <string>: The type of evaluation to perform (e.g., mmath_en, mmath_zh, etc.)--model_id <string>: The name or identifier of the model to evaluate.--model_url <url>: The API endpoint where the model is accessible.--model_type <string>: The type of the model to evaluate, currently either "chat", "completions", or "vlm".--output_dir <directory>: The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here.
Optional flags:
--api_key_name <string>: The name of the environment variable that stores the Bearer token for the API, if authentication is required.--run_config <path>: Specifies the path to a YAML file containing the evaluation definition.--overrides <string>: Override configuration parameters (e.g., 'config.params.limit_samples=10').
Example
eval-factory run_eval \
--eval_type mmath_en \
--model_id meta/llama-3.1-8b-instruct \
--model_type chat \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--output_dir ./evaluation_results
If the model API requires authentication, set the API key in an environment variable and reference it using the --api_key_name flag:
export MY_API_KEY="your_api_key_here"
eval-factory run_eval \
--eval_type mmath_en \
--model_id meta/llama-3.1-8b-instruct \
--model_type chat \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--api_key_name MY_API_KEY \
--output_dir ./evaluation_results
Configuring evaluations via YAML
Evaluations in MMATH are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.
Example of a YAML config:
config:
type: mmath_en
params:
parallelism: 50
limit_samples: 20
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
type: chat
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key: NVIDIA_API_KEY
The priority of overrides is as follows:
- command line arguments
- user config (as seen above)
- task defaults (defined per task type)
- framework defaults
The --dry_run option allows you to print the final run configuration and command without executing the evaluation.
Example:
eval-factory run_eval \
--eval_type mmath_en \
--model_id meta/llama-3.1-8b-instruct \
--model_type chat \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--output_dir ./evaluation_results \
--dry_run
🛠️ Code and Data Usage
MMATH has the following resources. The mmath folder contains the benchmark data. The train folder contains the training data at Section 3. mmath_eval.py is the main program to evaluate the accuracy results, while calculate_lcr.py calculates the value of LCR as defined in our paper.
│ calculate_lcr.py
│ mmath_eval.py
│ utils.py
│
├─mmath
│ ar.json
│ en.json
│ es.json
│ fr.json
│ ja.json
│ ko.json
│ pt.json
│ th.json
│ vi.json
│ zh.json
│
└─train
enSFT-3k.json
enThink-3k.json
nativeThink-3k.json
The example in mmath/xx.json has the following format:
{
"question": "Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+\\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop.", # The question
"answer": "204", # The answer
"data_source": "AIME2024", # The data source, which might be AIME2024/AIME2025/CNMO/MATH500
"data_source_id": 0, # The index in original data
"lang": "en", # Language type
"gid": 0 # The global id in our benchmark MMATH
},
The example in train/yy.json has the following format:
{
"index":0,
"answer":"364",
"conversations":[
{
"from":"user",
"value":"For a positive integer \\( n \\), consider the function\n\n\\[ \nf(n)=\\frac{4n+\\sqrt{4n^2-1}}{\\sqrt{2n+1}+\\sqrt{2n-1}} \n\\]\n\nCalculate the value of\n\n\\[ \nf(1)+f(2)+f(3)+\\cdots+f(40) \n\\]"
},
{
"from":"assistant",
"value":"<think>\nOkay, let's see. I need to find the sum of f(1) + f(2) + ...Thus, the sum is:\n\n\\[\n\\frac{1}{2} (729 - 1) = \\frac{728}{2} = 364.\n\\]\n\nThe final answer is:\n\n\\[\n\\boxed{364}\n\\]"
}
]
},
🧪 Experiment Setup
Environment Setups
To accelerate the process of environment setup, we use uv to manage the packages. And our training code is based on LLaMA-Factory, you can install it based on your requirements (e.g, with -e option).
conda create -n mmath python=3.10
conda activate mmath
pip install uv
uv pip install -r requirements.txt
Evaluation Commands
Accuracy Results
To calculate accuracy results on our benchmark, you can run with:
export CUDA_VISIBLE_DEVICES=0,1
python mmath_eval.py --model_name_or_path DeepSeek-R1-Distill-Qwen-32B --tensor_parallel_size 2
This will generate a results directory with a subdirectory with the model name (e.g, DeepSeek-R1-Distill-Qwen-32B). Inside the directory are the results belonging to different languages, such as en.json.
LCR Results
As for calculating LCR, please download the lid.176.bin with:
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
After that, please set the model_list_full variable in calculate_lcr.py. Then, you can run the command using python calculate_lcr.py.
model_list_full = [
"DeepSeek-R1-Distill-Qwen-32B",
# ...
]
This will rewrite some keys in results/model_name/xx.json and output a LaTeX table about the whole results.
Training Setups
As mentioned before, our training code is based on LLaMA-Factory. Here we provide the hyperparameters used in our paper.
### model
model_name_or_path: Qwen2.5-32B-Instruct
trust_remote_code: true
### method
stage: sft
template: qwen
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
packing: false
### dataset
dataset: en-Think
cutoff_len: 32768
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4
### output
output_dir: Qwen2.5-32B-Instruct-en-Think
logging_steps: 10
save_strategy: epoch
plot_loss: true
overwrite_output_dir: true
save_only_model: true
save_total_limit: 10
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-5
num_train_epochs: 3
lr_scheduler_type: cosine
warmup_ratio: 0.1
save_total_limit: 10
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
enable_liger_kernel: true
📄 Attribution
This project is a fork of the original MMATH: A Multilingual Benchmark for Mathematical Reasoning repository created by the RUCAIBox team at Renmin University of China.
Original Repository
- Repository: RUCAIBox/MMATH
- Original Paper: MMATH: A Multilingual Benchmark for Mathematical Reasoning
Original Authors
The original MMATH benchmark was created by:
- Wenyang Luo - Renmin University of China
- Wayne Xin Zhao - Renmin University of China
- Jing Sha - Renmin University of China
- Shijin Wang - Renmin University of China
- Ji-Rong Wen - Renmin University of China
License
The original MMATH repository is licensed under the MIT License. This fork maintains the same license terms while adding NVIDIA-specific packaging and evaluation capabilities.
Acknowledgments
We thank the original MMATH authors for creating this comprehensive multilingual mathematical reasoning benchmark and making it publicly available to the research community.
📄 Citation
@article{luo2025mmath,
title={MMATH: A Multilingual Benchmark for Mathematical Reasoning},
author={Luo, Wenyang and Zhao, Wayne Xin and Sha, Jing and Wang, Shijin and Wen, Ji-Rong},
journal={arXiv preprint arXiv:2505.19126},
year={2025}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nvidia_mmath-25.8-py3-none-any.whl.
File metadata
- Download URL: nvidia_mmath-25.8-py3-none-any.whl
- Upload date:
- Size: 401.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56610dd2c84a80fdbfda34054dd7df01a946f36e18ddf0c5010c3f04d7001ec0
|
|
| MD5 |
d96e46eedce84596442e9980fffcf119
|
|
| BLAKE2b-256 |
c8bf2563d8688b795a01d7e9425cd50b8a48b07521d4730ed8872a82ababc3e9
|