A user-friendly MiDashengLM fine-tuning toolkit.
Project description
MDL-Toolkit
English | 中文
MDL-Toolkit is a user-friendly MiDashengLM fine-tuning toolkit that wraps the entire MDL fine-tuning workflow into a unified CLI. It uses a simple CSV data format and a LoRA-based approach to provide out-of-the-box fine-tuning, supports various memory optimization options and distributed training, works across GPU clusters of all sizes, and offers a quick inference command to help you efficiently complete fine-tuning tasks.
Installation
It is strongly recommended to install mdl-toolkit into a dedicated virtual environment to avoid dependency conflicts with other projects.
To install mdl-toolkit, you can use the following commands:
# Create and activate a dedicated virtual environment with uv
uv venv path/to/mdl-toolkit-venv
source path/to/mdl-toolkit-venv/bin/activate
# Or, use venv
python -m venv path/to/mdl-toolkit-venv
source path/to/mdl-toolkit-venv/bin/activate
# Or, use conda/mamba
mamba create -n mdl-toolkit python=3.13 pip
mamba activate mdl-toolkit
# Install mdl-toolkit
pip install mdl-toolkit
# Or, if you need optional features
pip install 'mdl-toolkit[modelscope,quantization]'
# You can now use the mdl-toolkit command
mdl-toolkit --help
For more installation options, please refer to the Installation Guide.
Usage
This section describes how to use mdl-toolkit for model training. We also provide a Jupyter Notebook demonstrating fine-tuning MiDashengLM with ESC-50.
Data Preparation
Before starting training, you need to prepare the dataset. mdl-toolkit uses a CSV-formatted dataset, where each row represents one audio sample, and the first row must contain column names. Irrelevant columns will be ignored. The dataset can contain the following columns:
audio: Required. The path to the audio file, or a URL starting withhttp://orhttps://. The specified path will be resolved relative to the directory where the script is run or the base directory specified by the--base-diroption. The specified URL will be downloaded when generating the dataset.system_prompt: Optional. System prompt text. If not provided or isnull, the command-line option will be used if provided; otherwise it will be set to empty.user_prompt: Optional. User prompt text. If not provided or isnull, the command-line option will be used if provided; otherwise it will be set to empty.prediction: Required for training; the model's predicted output, which will be used as labels for supervised learning during training. For inference, this column will be ignored and replaced with the inference result.
For example, for the ESC-50 dataset, you can use the following format:
audio,prediction
audio/1-100032-A-0.wav,"target: 0, category: dog"
audio/1-100038-A-14.wav,"target: 14, category: chirping_birds"
audio/1-100210-A-36.wav,"target: 36, category: vacuum_cleaner"
You can optionally specify system and user prompts:
audio,system_prompt,user_prompt,prediction
audio/1-100032-A-0.wav,null,What is the sound in the audio?,It sounds like a dog barking.
audio/1-100038-A-14.wav,Classify the audio according to the ESC-50 categories.,null,chirping_birds
audio/1-100210-A-36.wav,Answer user's question about the audio.,Is that a vacuum cleaner?,Yes.
System and user prompts can also be specified using command-line options.
Converting the Dataset
Running mdl-toolkit convert-dataset converts the CSV-formatted dataset into the format required for model training. The command reads the input CSV, loads audio files, performs necessary preprocessing, and saves the results to the specified output directory. Converting the dataset is optional—you can directly pass the CSV file to the training command to process it on the fly—but preconverting allows reuse across multiple training runs and improves efficiency.
mdl-toolkit convert-dataset \
path/to/input.csv \
--output path/to/output/
Training
Use the mdl-toolkit train command to start training. This command reads the converted dataset, loads the base model, and trains using default hyperparameters.
mdl-toolkit train \
--train-dataset path/to/converted/train/ \
--eval-dataset path/to/converted/eval/ \
--output path/to/output/
If you don't use an evaluation set, you can omit the --eval-dataset parameter.
During training, logs such as loss values and learning rate will be printed. Checkpoints will be saved under checkpoint-{step} subdirectories of the output directory. Training may take a long time depending on the dataset size, model size, and hardware. After training completes, the results will be saved under the final subdirectory of the output directory. By default, the final directory contains the full model weights with LoRA adapters merged, and you can load and use this model the same way as the base model.
Tuning Hyperparameters
mdl-toolkit provides a set of tunable hyperparameters to help optimize model performance during training. You can specify these hyperparameters via command-line options, for example:
mdl-toolkit train \
--lr 1e-4 \
--lora-rank 32 \
...
mdl-toolkit provides default values for all hyperparameters, but the defaults may not be suitable for all tasks. Below are some commonly used hyperparameters and their default values:
--lr: Default:1e-4. Learning rate, controls the rate at which the optimizer updates parameters.--lora-rank: Default:32. The rank of LoRA, which controls the complexity of the LoRA adapters. A higher rank can capture more features but also increases compute and storage overhead and the risk of overfitting.--batch-size: Default:8. The number of samples processed per GPU device in each training step. A larger batch size may improve training speed and stability but also increases memory usage.
For the full list of hyperparameters, default values, and other available options, please refer to the Command-Line Interface Reference.
Distributed Training
mdl-toolkit is compatible with torchrun or accelerate. To use distributed training, simply prepend the corresponding launcher. If you don't use distributed training, it will run on a single GPU by default. For more information, refer to the Distributed Training Guide.
For example, to use torchrun for distributed training on a single node:
torchrun --standalone --nproc-per-node gpu --no-python \
mdl-toolkit train \
--train-dataset path/to/converted/train/ \
--eval-dataset path/to/converted/eval/ \
--output path/to/output/
To use torchrun for multi-node distributed training, run the same command on each node, ensure all nodes can reach each other over the network, replace $NUM_NODES with the actual number of nodes, $JOB_ID with a unique job ID, and $HOST_NODE_ADDR with the address (and optional port) of the host node in the form <host>[:<port>]:
torchrun --nnodes $NUM_NODES --nproc-per-node gpu \
--rdzv-id $JOB_ID \
--rdzv-backend c10d \
--rdzv-endpoint $HOST_NODE_ADDR \
--no-python \
mdl-toolkit train \
--train-dataset path/to/converted/train/ \
--eval-dataset path/to/converted/eval/ \
--output path/to/output/
To use accelerate for distributed training, first run accelerate config on each node for configuration, then launch training with accelerate launch:
accelerate config # Follow the interactive prompts
accelerate launch \
mdl-toolkit train \
--train-dataset path/to/converted/train/ \
--eval-dataset path/to/converted/eval/ \
--output path/to/output/
Inference
To run inference with the merged model, the usage is the same as the base model. Some frameworks support loading LoRA adapters directly for inference. During inference, the system and user prompts fed to the model should match those used during training to ensure the model outputs as expected:
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
# Load the merged model from the final training output
model_path = "path/to/output/final/"
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "System prompt"}
],
},
{
"role": "user",
"content": [
{"type": "text", "text": "User prompt"},
{"type": "audio", "path": "/path/to/example.wav"},
],
},
]
For large-scale inference, it's recommended to use the vLLM framework for better performance and more comprehensive features.
In addition, MDL-Toolkit provides an inference command based on transformers, which makes it convenient to quickly run basic inference tasks after training, though it doesn't perform as well as specialized inference frameworks like vLLM. The inference input is a CSV file with the same format as the training dataset, except the prediction column becomes optional. The inference output will copy all input columns and replace the prediction column with the model's predictions. You can run inference with the following command:
mdl-toolkit inference \
path/to/input.csv \
--output path/to/output.csv \
--model-name model_name_or_path \
--max-length 128
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mdl_toolkit-0.1.0.tar.gz.
File metadata
- Download URL: mdl_toolkit-0.1.0.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b45b79590f2bb557e4131792f9cbef5b2bd344596096622ebf7b192619ac7fb
|
|
| MD5 |
1ee98d3fec2aa2675abea89ea5a6372f
|
|
| BLAKE2b-256 |
58bc88c606b0cd3e1afd438c989ff3e7b52296b8e8d9ad7eb99c50a119e4b471
|
File details
Details for the file mdl_toolkit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mdl_toolkit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8f42c37254e20465cf9a116dc4765b637e4785f7dac1449d89ccdcc9d53d9b8
|
|
| MD5 |
e2391d436e7b7e7017021bd6e104ddd5
|
|
| BLAKE2b-256 |
3f264dd03c555eb69441bebb91263e5a72082f31553cadf5c42481db1c7b122b
|