Voxtral audio processing and model implementation for Apple Silicon using MLX

These details have not been verified by PyPI

Project links

Project description

MLX Voxtral

MLX Voxtral is an optimized implementation of Mistral AI's Voxtral speech models for Apple Silicon, providing efficient audio transcription with support for model quantization and streaming processing.

Features

🚀 Optimized for Apple Silicon - Leverages MLX framework for maximum performance on M1/M2/M3 chips
🗜️ Model Quantization - Reduce model size by 4.3x with minimal quality loss
🎙️ Full Audio Pipeline - Complete audio processing from file/URL to transcription
🔧 CLI Tools - Command-line utilities for transcription and quantization
📦 Pre-quantized Models - Ready-to-use quantized models available

Installation

Install from PyPI

# Install mlx-voxtral from PyPI
pip install mlx-voxtral

# Install transformers from GitHub (required)
pip install git+https://github.com/huggingface/transformers

Install from Source

# Clone the repository
git clone https://github.com/mzbac/mlx.voxtral
cd mlx.voxtral

# Install in development mode
pip install -e .

Quick Start

Simple Transcription

from mlx_voxtral import VoxtralForConditionalGeneration, VoxtralProcessor

# Load model and processor
model = VoxtralForConditionalGeneration.from_pretrained("mistralai/Voxtral-Mini-3B-2507")
processor = VoxtralProcessor.from_pretrained("mistralai/Voxtral-Mini-3B-2507")

# Transcribe audio
inputs = processor.apply_transcrition_request(
    language="en",
    audio="speech.mp3"
)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.0)
transcription = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(transcription)

Command Line Usage

# Basic transcription
mlx-voxtral.generate --audio speech.mp3

# With custom parameters
mlx-voxtral.generate --model mistralai/Voxtral-Mini-3B-2507 --max-token 2048 --temperature 0.1 --audio speech.mp3

# From URL
mlx-voxtral.generate --audio https://example.com/podcast.mp3

# Using quantized model
mlx-voxtral.generate --model ./voxtral-mini-4bit --audio speech.mp3

Model Quantization

MLX Voxtral includes powerful quantization capabilities to reduce model size and improve performance:

Quantization Tool

# Basic 4-bit quantization (recommended)
mlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 -o ./voxtral-mini-4bit

# Mixed precision quantization (best quality)
mlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 --output-dir ./voxtral-mini-mixed --mixed

# Custom quantization settings
mlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 \
    --output-dir ./voxtral-mini-8bit \
    --bits 8 \
    --group-size 32

Using Quantized Models

# Load pre-quantized model (same API as original)
model = VoxtralForConditionalGeneration.from_pretrained("mzbac/voxtral-mini-3b-4bit-mixed")
processor = VoxtralProcessor.from_pretrained(".mzbac/voxtral-mini-3b-4bit-mixed")

# Use exactly like the original model
transcription = model.transcribe("speech.mp3", processor)

Audio Processing Pipeline

Low-Level Audio Processing

from mlx_voxtral import process_audio_for_voxtral

# Process audio file for direct model input
result = process_audio_for_voxtral("speech.mp3")

# Access processed features
mel_features = result["input_features"]  # Shape: [n_chunks, 128, 3000]
print(f"Audio duration: {result['duration_seconds']:.2f}s")
print(f"Number of 30s chunks: {result['n_chunks']}")

The audio processing pipeline:

Audio Loading: Supports files and URLs, resamples to 16kHz mono
Chunking: Splits into 30-second chunks with proper padding
STFT: 400-point FFT with 160 hop length
Mel Spectrogram: 128 mel bins covering 0-8000 Hz
Normalization: Log scale with global max normalization

Advanced Usage

Streaming Transcription

# Process long audio files efficiently
for chunk in model.transcribe_stream("podcast.mp3", processor, chunk_length_s=30):
    print(chunk, end="", flush=True)

Custom Generation Parameters

inputs = processor.apply_transcrition_request(
    language="en",
    audio="speech.mp3"
)

outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.1,
    top_p=0.95,
    repetition_penalty=1.1
)

Processing Multiple Files

# Process multiple audio files sequentially
audio_files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]
transcriptions = []

for audio_file in audio_files:
    inputs = processor.apply_transcrition_request(language="en", audio=audio_file)
    outputs = model.generate(**inputs, max_new_tokens=1024)
    text = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    transcriptions.append(text)

Note: The model processes one audio file at a time. For long audio files, it automatically splits them into 30-second chunks internally.

Pre-quantized Models

For convenience, pre-quantized models are available:

models = {
    "mzbac/voxtral-mini-3b-4bit-mixed": "3.2GB model with mixed precision",
    "mzbac/voxtral-mini-3b-8bit": "5.3GB model with 8-bit quantization"
}

API Reference

VoxtralProcessor

processor = VoxtralProcessor.from_pretrained("mistralai/Voxtral-Mini-3B-2507")

# Apply transcription formatting
inputs = processor.apply_transcrition_request(
    language="en",  # or "fr", "de", etc.
    audio="path/to/audio.mp3",
    task="transcribe",  # or "translate"
)

# Decode model outputs
text = processor.decode(token_ids, skip_special_tokens=True)

VoxtralForConditionalGeneration

model = VoxtralForConditionalGeneration.from_pretrained(
    "mistralai/Voxtral-Mini-3B-2507",
    dtype=mx.bfloat16  # Optional: specify dtype
)

# Generate transcription
outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.0,
    do_sample=False
)

Performance Tips

Use Quantized Models: 4-bit quantization provides the best balance of size and quality
Temperature Settings: Use temperature=0.0 for deterministic transcription
Chunk Size: Default 30-second chunks are optimal for most use cases
Long Audio: The model automatically handles long audio by splitting into chunks

Requirements

Python: 3.11 or higher
Platform: Apple Silicon Mac (M1/M2/M3)
Dependencies:
- MLX >= 0.26.5
- mlx-lm >= 0.26.0
- mistral-common >= 1.8.2
- transformers (latest from GitHub)
- Audio: soundfile, soxr, or ffmpeg

TODO

Batch Processing Support: Implement batched inference for processing multiple audio files simultaneously
Transformers Tokenizer Integration: Add support for using Hugging Face Transformers tokenizers as an alternative to mistral-common
Swift Support: Create a Swift library for Voxtral support

License

see LICENSE file for details.

Acknowledgments

This implementation is based on Mistral AI's Voxtral models and the Hugging Face Transformers implementation
Built using Apple's MLX framework for optimized performance on Apple Silicon

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.4

Aug 19, 2025

This version

0.0.3

Aug 19, 2025

0.0.2

Jul 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_voxtral-0.0.3.tar.gz (31.2 kB view details)

Uploaded Aug 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_voxtral-0.0.3-py3-none-any.whl (33.6 kB view details)

Uploaded Aug 19, 2025 Python 3

File details

Details for the file mlx_voxtral-0.0.3.tar.gz.

File metadata

Download URL: mlx_voxtral-0.0.3.tar.gz
Upload date: Aug 19, 2025
Size: 31.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.0

File hashes

Hashes for mlx_voxtral-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`4f8c261e0c34ccc324bd7b83fff19e1ea501dfa8ba34836f55f79c30238f57b7`
MD5	`17be4145902c113623e2359aaca5dbeb`
BLAKE2b-256	`8b1ac6b7d2179bd5f99520f54de540ab8b2cc093a1483ee350b3663babc2d3c1`

See more details on using hashes here.

File details

Details for the file mlx_voxtral-0.0.3-py3-none-any.whl.

File metadata

Download URL: mlx_voxtral-0.0.3-py3-none-any.whl
Upload date: Aug 19, 2025
Size: 33.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.0

File hashes

Hashes for mlx_voxtral-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`200082fde33dc1d72d4b76087812695bb49acec2baf73f9bbdcfb721e49582c7`
MD5	`bd638b9671b572b1028135af94e0a52e`
BLAKE2b-256	`9676e2fe58c0104d5330e80b3fca5a9416b01b0de9af1b207226c4153bedfea9`

See more details on using hashes here.

mlx-voxtral 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

MLX Voxtral

Features

Installation

Install from PyPI

Install from Source

Quick Start

Simple Transcription

Command Line Usage

Model Quantization

Quantization Tool

Using Quantized Models

Audio Processing Pipeline

Low-Level Audio Processing

Advanced Usage

Streaming Transcription

Custom Generation Parameters

Processing Multiple Files

Pre-quantized Models

API Reference

VoxtralProcessor

VoxtralForConditionalGeneration

Performance Tips

Requirements

TODO

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes