tts-webui.mimo-audio

No project description provided

Project description

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
MiMo Audio: Audio Language Models are Few-Shot Learners
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Introduction

Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks, spoken dialogue benchmarks and instruct-TTS evaluations, approaching or surpassing closed-source models.

Results

Architecture

MiMo-Audio-Tokenizer

MiMo-Audio-Tokenizer is a 1.2B-parameter Transformer operating at 25 Hz. It employs an eight-layer RVQ stack to generate 200 tokens per second. By jointly optimizing semantic and reconstruction objectives, we train MiMo-Audio-Tokenizer from scratch on a 10-million-hour corpus, achieving superior reconstruction quality and facilitating downstream language modeling.

Tokenizer

MiMo-Audio couples a patch encoder, an LLM, and a patch decoder to improve modeling efficiency for high-rate sequences and bridge the length mismatch between speech and text. The patch encoder aggregates four consecutive time steps of RVQ tokens into a single patch, downsampling the sequence to a 6.25 Hz representation for the LLM. The patch decoder autoregressively generates the full 25 Hz RVQ token sequence via a delayed-generation scheme.

MiMo-Audio

Arch

Explore MiMo-Audio Now! 🚀🚀🚀

🎧 Try the Hugging Face demo: MiMo-Audio Demo
📰 Read the Official Blog: MiMo-Audio Blog
📄 Dive into the Technical Report: MiMo-Audio Technical Report

Model Download

Models	🤗 Hugging Face
MiMo-Audio-Tokenizer	XiaomiMiMo/MiMo-Audio-Tokenizer
MiMo-Audio-7B-Base	XiaomiMiMo/MiMo-Audio-7B-Base
MiMo-Audio-7B-Instruct	XiaomiMiMo/MiMo-Audio-7B-Instruct

pip install huggingface-hub

hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
hf download XiaomiMiMo/MiMo-Audio-7B-Base --local-dir ./models/MiMo-Audio-7B-Base
hf download XiaomiMiMo/MiMo-Audio-7B-Instruct --local-dir ./models/MiMo-Audio-7B-Instruct

Getting Started

Spin up the MiMo-Audio demo in minutes with the built-in Gradio app.

Prerequisites (Linux)

Python 3.12
CUDA >= 12.0

Installation

git clone https://github.com/XiaomiMiMo/MiMo-Audio.git
cd MiMo-Audio
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1

[!Note] If the compilation of flash-attn takes too long, you can download the precompiled wheel and install it manually:

Download Precompiled Wheel
pip install /path/to/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

Run the demo

python run_mimo_audio.py

This launches a local Gradio interface where you can try MiMo-Audio interactively.

Demo UI

Enter the local paths for MiMo-Audio-Tokenizer and MiMo-Audio-7B-Instruct, then enjoy the full functionality of MiMo-Audio!

Inference Scripts

Base Model

We provide an example script to explore the in-context learning capabilities of MiMo-Audio-7B-Base.
See: inference_example_pretrain.py

Instruct Model

To try the instruction-tuned model MiMo-Audio-7B-Instruct, use the corresponding inference script.
See: inference_example_sft.py

Evaluation Toolkit

Full evaluation suite are available at 🌐MiMo-Audio-Eval.

This toolkit is designed to evaluate MiMo-Audio and other recent audio LLMs as mentioned in the paper. It provides a flexible and extensible framework, supporting a wide range of datasets, tasks, and models.

Citation

@misc{coreteam2025mimoaudio,
      title={MiMo-Audio: Audio Language Models are Few-Shot Learners}, 
      author={LLM-Core-Team Xiaomi},
      year={2025},
      url={https://github.com/XiaomiMiMo/MiMo-Audio}, 
}

Contact

Please contact us at mimo@xiaomi.com or open an issue if you have any questions.

Project details

Release history Release notifications | RSS feed

0.0.3

Sep 20, 2025

0.0.2

Sep 20, 2025

This version

0.0.1

Sep 20, 2025

0.0.0

Sep 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tts_webui_mimo_audio-0.0.1-py3-none-any.whl (8.2 kB view details)

Uploaded Sep 20, 2025 Python 3

File details

Details for the file tts_webui_mimo_audio-0.0.1-py3-none-any.whl.

File metadata

Download URL: tts_webui_mimo_audio-0.0.1-py3-none-any.whl
Upload date: Sep 20, 2025
Size: 8.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for tts_webui_mimo_audio-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`46d90e6528758d2087a3268cfe5c5a6fcab520a17734850725bf19849b78bb3a`
MD5	`717c53cd09781c483381b0362ac10381`
BLAKE2b-256	`f4a9350660271ded147d5a69469628d42652303747ed96d0676ee0ac6ffeddfd`

See more details on using hashes here.

tts-webui.mimo-audio 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Introduction

Architecture

MiMo-Audio-Tokenizer

MiMo-Audio

Explore MiMo-Audio Now! 🚀🚀🚀

Model Download

Getting Started

Prerequisites (Linux)

Installation

Run the demo

Inference Scripts

Base Model

Instruct Model

Evaluation Toolkit

Citation

Contact

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes