Skip to main content

Advanced platform designed to streamline the processing of datasets using generative models.

Project description

MMIRAGE

MMIRAGE, which stands for Modular Multimodal Intelligent Reformatting and Augmentation Generation Engine, is an advanced platform designed to streamline the processing of datasets using generative models. It is engineered to handle large-scale data reformatting and augmentation tasks with efficiency and precision. By leveraging state-of-the-art generative models, MMIRAGE enables users to perform complex dataset transformations, ensuring compatibility across various formats and schemas. Its multi-node support and parallel processing capabilities make it an ideal choice for scenarios demanding substantial computational power, such as distributed training and inference workflows. MMIRAGE not only simplifies the integration of powerful language models but also provides a customizable framework for diverse use cases, from reformatting conversational datasets to generating Q/A pairs from plain text.

How to install

To install the library, you can clone it from GitHub and then use pip to install it directly. It is recommended to have already installed torch and sglang to take advantage of GPU acceleration.

git clone git@github.com:EPFLiGHT/MMIRAGE.git
pip install -e ./MMIRAGE

For testing and scripts that make use of the library, it is advised to create a .env file. You can do this by running the following command:

curl https://raw.githubusercontent.com/EPFLiGHT/MMIRAGE/refs/heads/json-output/scripts/generate_env.sh | sh

How to install

To install the library, you can clone it from GitHub and then use pip to install it directly. It is recommended to have already installed torch and sglang to take advantage of GPU acceleration.

git clone git@github.com:EPFLiGHT/MIRAGE.git
pip install -e ./MIRAGE

For testing and scripts that make use of the library, it is advised to create a .env file. You can do this by running the following command:

curl https://raw.githubusercontent.com/EPFLiGHT/MIRAGE/refs/heads/json-output/scripts/generate_env.sh | sh

Key features

  • Easily configurable with a YAML file which configure the following parameters
    • The prompt to the LLM
    • Variables with the name and their key to a JSON
  • Parallelizable with a multi-node support
    • The training pipeline should use either distributed inference using accelerate
  • Support a variety of LLMs and VLMs (LLM only for a first version)
  • Support any dataset schemas (configurable with the YAML format)
  • The ability to either output a JSON (or any other structured format) or a plain text

Example usage

Reformatting dataset

Suppose you have a dataset with samples of the following format

{ 
    "conversations" : [{"role": "user", "content": "Describe the image"}, {"role": "assistant", "content": "This is a badly formmatted answer"}],
    "modalities" : [<the images>]
}

The dataset contains assistant answers that are badly formatted. The goal would be to use a LLM to format our answer in Markdown. With MMIRAGE, it would be as simple as defining a YAML configuration file. Then in the YAML configuration file, we could specify

inputs:
  - name: assistant_answer
    key: conversations[1].content
  - name: user_prompt
    key: conversations[0].content
  - name: modalities
    key: modalities

outputs:
  - name: formatted_answer
    type: llm
    output_type: plain
    prompt: | 
      Reformat the answer in a markdown format without adding anything else:
      {assistant_answer}
      
output_schema:
  conversations:
    - role: user
      content: {user_prompt}
    - role: assistant
      content: {formatted_answer}
  modalities: {modalities}

Configuration explanation:

  • inputs: specify variables that are defined from the input dataset. For instance by specifying the key conversations[1].content, we say that this variable corresponds to sample["conversations"][1]["content"]
  • outputs: specify variables that are created from the pipeline. We specify how the variable should be created:
    • Here formatted_answer is created using a LLM prompt and is a plain text variable (as opposed to JSON variables)
  • output_schema: specify the output schema of the dataset. So each sample will follow this format. Here we know that each sample will contain 2 keys: conversations and modalities

Transforming datasets

In the second example, we want to generate questions from plain text document. The 3 keys that we want to generate are:

  • "question"
  • "answer"
  • "explanation"

Suppose we have the following format:

{
    "text" : "This is a very interesting article about cancer"
}
inputs:
  - name: plain_text
    key: text
    
outputs:
  - name: output_dict
    type: prompt
    output_type: JSON
    prompt: | 
      I want to generate Q/A pairs from the following text:
      {plain_text}
    output_schema:
      - question
      - explanation
      - answer
        
output_schema:
  conversations:
    - role: user
      content: {question}
    - role: assistant
      content: |
        {explanation}
        Answer: {answer}

Here, we choose to output a JSON answer with 3 keys ("question", "explanation" and "answer"). That we will match

Usefool tools

  • Jinja2 to process the YAML: #link
  • JMESPath: #link
  • SGLang: #link
  • Paper for performance drom: #link

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmirage-0.1.2.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mmirage-0.1.2-py3-none-any.whl (3.1 kB view details)

Uploaded Python 3

File details

Details for the file mmirage-0.1.2.tar.gz.

File metadata

  • Download URL: mmirage-0.1.2.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mmirage-0.1.2.tar.gz
Algorithm Hash digest
SHA256 144719f8b4fe217db86d12dd4ea1d95ddee25d74c8faf65e2f4bcc8b39d22b96
MD5 b21376c4fbcf0774e534cb7187334cfd
BLAKE2b-256 a2650a85cb5b7ca73c69066734a0f60798339a47a3e3eb0303dbf868f329dec7

See more details on using hashes here.

File details

Details for the file mmirage-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: mmirage-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 3.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mmirage-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 07bce9eca5428662ce78d104bdae2eb64b1c723fe28b2b58825478fc302078b8
MD5 11f7a342ffc06db24ffec42b2ee84a97
BLAKE2b-256 3eff0b3072d0ee681fcc4a69935978488a300f43e64c2869133389e94f869a1f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page