Modular Multimodal Intelligent Reformatting and Augmentation Generation Engine - Advanced platform for processing datasets using generative models including vision-language models.
Project description
MMIRAGE
MMIRAGE, which stands for Modular Multimodal Intelligent Reformatting and Augmentation Generation Engine, is an advanced platform designed to streamline the processing of datasets using generative models, including vision-language models (VLMs). It is engineered to handle large-scale data reformatting and augmentation tasks with efficiency and precision. By leveraging state-of-the-art generative models, MMIRAGE enables users to perform complex dataset transformations, ensuring compatibility across various formats and schemas. Its multi-node support and parallel processing capabilities make it an ideal choice for scenarios demanding substantial computational power, such as distributed training and inference workflows. MMIRAGE not only simplifies the integration of powerful language models but also provides a customizable framework for diverse use cases, from reformatting conversational datasets to generating Q/A pairs from plain text.
How to install
To install the library, you can clone it from GitHub and then use pip to install it directly. It is recommended to have already installed torch and sglang to take advantage of GPU acceleration.
git clone git@github.com:EPFLiGHT/MMIRAGE.git
pip install -e ./MMIRAGE
For testing and scripts that make use of the library, it is advised to create a .env file:
./scripts/generate_env.sh
Key features
- Multimodal Support: Process both text and images with vision-language models
- Easily configurable with a YAML file which configures the following parameters:
- The prompt to the LLM (using Jinja2 templating)
- Variables with the name and their JMESPath key to a JSON
- Image inputs for multimodal processing
- Parallelizable with multi-node support
- The training pipeline uses distributed inference with sharding
- Support a variety of LLMs and VLMs (Vision-Language Models)
- Support any dataset schemas (configurable with the YAML format)
- The ability to either output a JSON (or any other structured format) or plain text
- Modular architecture with pluggable processors, loaders, and writers
Example usage
Running (single command)
Run the pipeline via the Python CLI. Retry behavior is driven by your YAML config:
execution_params.retry: true→ automatically retries failed shards until completion ormax_retriesexecution_params.retry: false→ submits/runs once; you can later trigger retries viacheck
python -m mmirage.cli run --config configs/config_mock.yaml
To check status only:
python -m mmirage.cli check --config configs/config_mock.yaml
To check status and submit retries for failed shards:
python -m mmirage.cli check --config configs/config_mock.yaml --retry
Text-only: Reformatting dataset
Suppose you have a dataset with samples of the following format
{
"conversations" : [{"role": "user", "content": "Describe the image"}, {"role": "assistant", "content": "This is a badly formmatted answer"}],
"modalities" : ["<the images>"]
}
The dataset contains assistant answers that are badly formatted. The goal would be to use a LLM to format our answer in Markdown. With MMIRAGE, it would be as simple as defining a YAML configuration file:
processors:
- type: llm
server_args:
model_path: Qwen/Qwen3-8B
tp_size: 4
trust_remote_code: true
default_sampling_params:
temperature: 0.1
top_p: 1.0
max_new_tokens: 384
loading_params:
state_dir: /path/to/state/dir
datasets:
- path: /path/to/dataset
type: loadable
output_dir: /path/to/output/shards
num_shards: 4
shard_id: "$SLURM_ARRAY_TASK_ID"
batch_size: 64
processing_params:
inputs:
- name: assistant_answer
key: conversations[1].content
- name: user_prompt
key: conversations[0].content
- name: modalities
key: modalities
outputs:
- name: formatted_answer
type: llm
output_type: plain
prompt: |
Reformat the answer in a markdown format without adding anything else:
{{ assistant_answer }}
remove_columns: false
output_schema:
conversations:
- role: user
content: "{{ user_prompt }}"
- role: assistant
content: "{{ formatted_answer }}"
modalities: "{{ modalities }}"
execution_params:
mode: local
retry: false
Configuration explanation:
processors: List of processor configurations. Currently supportsllmtype for LLM-based generation.loading_params: Parameters for loading and sharding datasets.state_dir: Optional shared directory for shard status/retry state. Defaults to~/.cache/MMIRAGE/state_dir.datasets: List of dataset configurations with path, type, and output directory.
processing_params:inputs: Variables extracted from the input dataset using JMESPath queries.outputs: Variables created by processors. Prompts use Jinja2 templating ({{ variable }}).output_schema: Defines the structure of output samples.
execution_params:mode: "local" to run shard processing in the current Python environment or "slurm" to run through SLURM by submitting an sbatch array job.retry: If true, MMIRAGE automatically retries failed shards until they succeed ormax_retriesis reached. If false, the pipeline runs/submits once, and retries can be triggered later via the check/retry CLI commands.
Multimodal: Processing images with VLMs
MMIRAGE supports multimodal processing with vision-language models:
processors:
- type: llm
server_args:
model_path: Qwen/Qwen2-VL-7B-Instruct
tp_size: 4
trust_remote_code: true
chat_template: qwen2-vl # Required for VLMs
default_sampling_params:
temperature: 0.1
top_p: 0.95
max_new_tokens: 768
loading_params:
state_dir: path/to/state/dir
datasets:
- path: /path/to/image/dataset
type: loadable
output_dir: /path/to/output/shards
num_shards: 4
shard_id: "$SLURM_ARRAY_TASK_ID"
batch_size: 32
processing_params:
inputs:
- name: medical_image
key: image
type: image # Mark as image input
image_base_path: /path/to/images # Base directory for relative paths
- name: original_caption
key: caption
type: text
outputs:
- name: enhanced_caption
type: llm
output_type: plain
prompt: |
Describe the medical image in detail.
Original caption for context: {{ original_caption }}
remove_columns: false
output_schema:
image: "{{ medical_image }}"
caption: "{{ enhanced_caption }}"
original_caption: "{{ original_caption }}"
execution_params:
mode: local
retry: false
Key multimodal features:
chat_template: Specify the VLM chat template (e.g.,qwen2-vl)type: image: Mark input variables as imagesimage_base_path: Base directory for resolving relative image paths- Supports PIL Images, URLs, and file paths
Architecture
MMIRAGE uses a modular architecture:
mmirage/
├── config/ # Configuration loading and validation
├── core/
│ ├── loader/ # Dataset loaders (JSONL, HuggingFace)
│ ├── process/ # Processors (LLM, etc.) and variable system
│ │ └── processors/
│ │ └── llm/ # LLM processor with multimodal support
│ └── writer/ # Output rendering with Jinja2
├── shard_process.py # Main processing script
└── merge_shards.py # Shard merging utility
Useful tools
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mmirage-0.1.4.tar.gz.
File metadata
- Download URL: mmirage-0.1.4.tar.gz
- Upload date:
- Size: 32.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1cd7f714b16e6e269b1abd3bd044f269bdd0083426d097545631867591f523d3
|
|
| MD5 |
8fb1194c72d75e5524d4eb3256f316cc
|
|
| BLAKE2b-256 |
d2c3129d33303d9f41a7cd59f5ddbb6726478527f2eb9a0d677ed8bfe80d251f
|
File details
Details for the file mmirage-0.1.4-py3-none-any.whl.
File metadata
- Download URL: mmirage-0.1.4-py3-none-any.whl
- Upload date:
- Size: 44.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ca1b91984cc9ed8c38e2ff4078f2b89bb0ddadb7b77eca844966e81b480a8ca
|
|
| MD5 |
92f8a83304ef517d6f07927b7dd4ef01
|
|
| BLAKE2b-256 |
61ca0fbdaca251d91c0e7e6eafd061940989282c21c74caa2f5486e18ead7489
|