MLX-Audio is a package for inference of text-to-speech (TTS) and speech-to-speech (STS) models locally on your Mac using MLX

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

MLX-Audio

A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon.

Features

Fast inference on Apple Silicon (M series chips)
Multiple language support
Voice customization options
Adjustable speech speed control (0.5x to 2.0x)
Interactive web interface with 3D audio visualization
REST API for TTS generation
Quantization support for optimized performance
Direct access to output files via Finder/Explorer integration

Installation

# Install the package
pip install mlx-audio

# For web interface and API dependencies
pip install -r requirements.txt

Modular Imports

MLX-Audio supports modular imports, allowing you to use STT, TTS, or STS independently without loading unnecessary dependencies. This is useful for embedding in applications where you only need specific functionality.

# Import only STT (doesn't load TTS dependencies)
from mlx_audio.stt.utils import load_model as load_stt_model
model = load_stt_model("mlx-community/whisper-large-v3-turbo")

# Import only TTS (doesn't load STT dependencies)
from mlx_audio.tts.utils import load_model as load_tts_model
model = load_tts_model("prince-canuma/Kokoro-82M")

# Import shared DSP functions directly
from mlx_audio.dsp import stft, istft, mel_filters, hanning

Benefits:

Reduced bundle size: STT-only imports ~400MB vs full package ~1.7GB
Faster startup: Only loads required modules
Ideal for embedding: Perfect for iOS/macOS apps needing only transcription

Quick Start

To generate audio with an LLM use:

# Basic usage
mlx_audio.tts.generate --text "Hello, world"

# Specify prefix for output file
mlx_audio.tts.generate --text "Hello, world" --file_prefix hello

# Adjust speaking speed (0.5-2.0)
mlx_audio.tts.generate --text "Hello, world" --speed 1.4

How to call from python

To generate audio with an LLM use:

from mlx_audio.tts.generate import generate_audio

# Example: Generate an audiobook chapter as mp3 audio
generate_audio(
    text=("In the beginning, the universe was created...\n"
        "...or the simulation was booted up."),
    model_path="prince-canuma/Kokoro-82M",
    voice="af_heart",
    speed=1.2,
    lang_code="a", # Kokoro: (a)f_heart, or comment out for auto
    file_prefix="audiobook_chapter1",
    audio_format="wav",
    sample_rate=24000,
    join_audio=True,
    verbose=True  # Set to False to disable print messages
)

print("Audiobook chapter successfully generated!")

Web Interface & FastAPI Server

MLX-Audio provides a modern web interface with real-time audio visualization capabilities. The interface offers:

Text-to-Speech generation with customizable voices and parameters
Speech-to-Text transcription with support for multiple languages
Audio file upload and playback functionality
Interactive 3D audio visualization
Automatic audio file management in the outputs directory
Direct access to the output folder from the interface (local deployment only)

Key Features

Voice Customization: Select from multiple voice presets including AF Heart, AF Nova, AF Bella, and BF Emma
Speech Rate Control: Fine-tune speech generation speed using an intuitive slider (range: 0.5x - 2.0x)
Dynamic 3D Visualization: Experience audio through an interactive 3D orb that responds to frequency changes
Audio Management: Upload, play, and visualize custom audio files
Smart Playback: Optional automatic playback of generated audio
File Management: Quick access to the output directory through an integrated file explorer button
Speech Recognition: Convert speech to text with support for multiple languages and models To start the web interface and API server:

UI:

# Configure the API base URL and port
export NEXT_PUBLIC_API_BASE_URL=http://localhost
export NEXT_PUBLIC_API_PORT=8000

# Start UI server
cd mlx_audio/ui
npm run dev

Server:

# Using the command-line interface
mlx_audio.server

# With custom host and port
mlx_audio.server --host 0.0.0.0 --port 9000

# With verbose logging
mlx_audio.server --verbose

Available command line arguments:

--host: Host address to bind the server to (default: 127.0.0.1)
--port: Port to bind the server to (default: 8000)

Then open your browser and navigate to:

http://127.0.0.1:8000

API Endpoints

The server provides the following REST API endpoints:

POST /v1/audio/speech: Generate speech from text following the OpenAI TTS specification.
- JSON body parameters:
  - model: Name or path of the TTS model to use.
  - input: Text to convert to speech.
  - voice: Optional voice preset.
  - speed: Optional speech speed (default 1.0).
- Returns the generated audio in WAV format.
POST /v1/audio/transcriptions: Transcribe audio files using an STT model in a format compatible with OpenAI's API.
- Multipart form parameters:
  - file: The audio file to transcribe.
  - model: Name or path of the STT model.
- Returns JSON containing the transcribed text.
GET /v1/models: List loaded models.
POST /v1/models: Load a model by name.
DELETE /v1/models: Unload a model.

Note: Generated audio files are stored in ~/.mlx_audio/outputs by default, or in a fallback directory if that location is not writable.

Models

Kokoro

Kokoro is a multilingual TTS model that supports various languages and voice styles.

Example Usage

from mlx_audio.tts.models.kokoro import KokoroPipeline
from mlx_audio.tts.utils import load_model
from IPython.display import Audio
import soundfile as sf

# Initialize the model
model_id = 'prince-canuma/Kokoro-82M'
model = load_model(model_id)

# Create a pipeline with American English
pipeline = KokoroPipeline(lang_code='a', model=model, repo_id=model_id)

# Generate audio
text = "The MLX King lives. Let him cook!"
for _, _, audio in pipeline(text, voice='af_heart', speed=1, split_pattern=r'\n+'):
    # Display audio in notebook (if applicable)
    display(Audio(data=audio, rate=24000, autoplay=0))

    # Save audio to file
    sf.write('audio.wav', audio[0], 24000)

Language Options

🇺🇸 'a' - American English
🇬🇧 'b' - British English
🇯🇵 'j' - Japanese (requires pip install misaki[ja])
🇨🇳 'z' - Mandarin Chinese (requires pip install misaki[zh])

CSM (Conversational Speech Model)

CSM is a model from Sesame that allows you text-to-speech and to customize voices using reference audio samples.

Example Usage

# Generate speech using CSM-1B model with reference audio
python -m mlx_audio.tts.generate --model mlx-community/csm-1b --text "Hello from Sesame." --play --ref_audio ./conversational_a.wav

You can pass any audio to clone the voice from or download sample audio file from here.

Advanced Features

Quantization

You can quantize models for improved performance:

from mlx_audio.tts.utils import quantize_model, load_model
import json
import mlx.core as mx

model = load_model(repo_id='prince-canuma/Kokoro-82M')
config = model.config

# Quantize to 8-bit
group_size = 64
bits = 8
weights, config = quantize_model(model, config, group_size, bits)

# Save quantized model
with open('./8bit/config.json', 'w') as f:
    json.dump(config, f)

mx.save_safetensors("./8bit/kokoro-v1_0.safetensors", weights, metadata={"format": "mlx"})

Requirements

MLX
Python 3.8+
Apple Silicon Mac (for optimal performance)
For the web interface and API:
- FastAPI
- Uvicorn

Swift Integration

This repo ships a Swift package for on-device TTS, STT, and STS using Apple's MLX framework on macOS and iOS.

Available Products

Product	Description	Dependencies
`MLXAudio`	Text-to-Speech (Kokoro, Marvis, Orpheus)	Native Swift
`MLXAudioSTT`	Speech-to-Text (Whisper via PythonKit)	PythonKit + Python-Apple-support
`MLXAudioSTS`	Speech-to-Speech pipeline (STT → LLM → TTS)	MLXAudioSTT

Supported Platforms

macOS: 14.0+
iOS: 17.0+

Adding the Swift Package Dependency

Via Xcode (Recommended)

Open your Xcode project
Navigate to File → Add Package Dependencies...
In the search bar, enter the package repository URL:
```
https://github.com/Blaizzy/mlx-audio.git
```
Select the package and choose the version you want to use
Add the desired product(s) to your target:
- MLXAudio - TTS only (native Swift)
- MLXAudioSTT - STT via PythonKit
- MLXAudioSTS - Full speech-to-speech pipeline

Via Package.swift

Add the following dependency to your Package.swift file:

dependencies: [
    .package(url: "https://github.com/Blaizzy/mlx-audio.git", from: "0.2.5")
],
targets: [
    .target(
        name: "YourTarget",
        dependencies: [
            // Choose the products you need:
            .product(name: "MLXAudio", package: "mlx-audio"),      // TTS
            .product(name: "MLXAudioSTT", package: "mlx-audio"),   // STT
            .product(name: "MLXAudioSTS", package: "mlx-audio"),   // STS
        ]
    )
]

TTS Usage

import MLXAudio

// Create a session with a built-in voice (auto-downloads model on first use)
let session = try await MarvisSession(voice: .conversationalA) // playback enabled by default

// One-shot generation (auto-plays if playback is enabled)
let result = try await session.generate(for: "Your text here")
print("Generated \(result.sampleCount) samples @ \(result.sampleRate) Hz")

Streaming generation

Get responsive audio chunks as they are decoded. Chunks are auto-played if playback is enabled.

import MLXAudio

let session = try await MarvisSession(voice: .conversationalA)

for try await chunk in session.stream(text: "Hello there from streaming mode", streamingInterval: 0.5) {
    // Each chunk includes PCM samples and timing metrics
    print("chunk samples=\(chunk.sampleCount) rtf=\(chunk.realTimeFactor)")
}

Raw audio (no playback)

If you want just the samples without auto-play, disable playback at init or call generateRaw.

import MLXAudio

// Option A: Disable playback globally for the session
let s1 = try await MarvisSession(voice: .conversationalA, playbackEnabled: false)
let raw1 = try await s1.generateRaw(for: "Save this to a file")

// Option B: Keep playback enabled but request a raw result for this call
let s2 = try await MarvisSession(voice: .conversationalA)
let raw2 = try await s2.generateRaw(for: "No auto-play for this one")

// rawX.audio is [Float] PCM at rawX.sampleRate (mono)

STT Usage (requires Python-Apple-support)

Note: STT uses PythonKit to bridge mlx_audio.stt (Whisper). You must bundle Python-Apple-support in your app.

import MLXAudioSTT

// Initialize Python environment (call once at app startup)
try PythonSetup.initialize()

// Create STT bridge with progress tracking
let stt = try await STTBridge(model: .whisperLargeV3Turbo) { progress in
    print("Loading: \(Int(progress.fractionCompleted * 100))%")
}

// Transcribe audio file
let result = try await stt.transcribe(audioURL: audioFileURL)
print("Transcription: \(result.text)")

// Or transcribe from audio buffer
let buffer = try AudioBuffer.load(from: audioURL)
let result = try await stt.transcribe(buffer: buffer)

STS Usage (Speech-to-Speech Pipeline)

import MLXAudioSTS
import MLXAudio

// Initialize Python for STT
try PythonSetup.initialize()

// Create TTS session
let tts = try await MarvisSession(voice: .conversationalA)

// Create voice pipeline
let pipeline = try await VoicePipeline(
    config: VoicePipelineConfig(sttModel: .whisperLargeV3Turbo),
    responseGenerator: { input in
        // Your LLM response generation here
        return "You said: \(input)"
    },
    ttsGenerate: { text in
        _ = try await tts.generate(for: text)
    }
)

// Listen to pipeline events
Task {
    for await event in pipeline.events {
        switch event {
        case .transcription(let text):
            print("You said: \(text)")
        case .speaking:
            print("Speaking response...")
        case .error(let msg):
            print("Error: \(msg)")
        default:
            break
        }
    }
}

// Start listening
try await pipeline.start()

License

MIT License

Acknowledgements

Thanks to the Apple MLX team for providing a great framework for building TTS and STS models.
This project uses the Kokoro model architecture for text-to-speech synthesis.
The 3D visualization uses Three.js for rendering.

@misc{mlx-audio, author = {Canuma, Prince}, title = {MLX Audio}, year = {2025}, howpublished = {\url{https://github.com/Blaizzy/mlx-audio}}, note = {A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.} }

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.2.8

Dec 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

better_mlx_audio-0.2.8.tar.gz (988.0 kB view details)

Uploaded Dec 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

better_mlx_audio-0.2.8-py3-none-any.whl (1.0 MB view details)

Uploaded Dec 7, 2025 Python 3

File details

Details for the file better_mlx_audio-0.2.8.tar.gz.

File metadata

Download URL: better_mlx_audio-0.2.8.tar.gz
Upload date: Dec 7, 2025
Size: 988.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for better_mlx_audio-0.2.8.tar.gz
Algorithm	Hash digest
SHA256	`107e375858ba3adcfd8eaf59ccc42987795323b888b47ee012346a3de680cdf1`
MD5	`cfe3d716e2e269144bcd9559d0719c32`
BLAKE2b-256	`2c0231b54d0540557242fbbe188ccac5dabb7ae00a7781fea745e0feea8f51d2`

See more details on using hashes here.

File details

Details for the file better_mlx_audio-0.2.8-py3-none-any.whl.

File metadata

Download URL: better_mlx_audio-0.2.8-py3-none-any.whl
Upload date: Dec 7, 2025
Size: 1.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for better_mlx_audio-0.2.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e0cc566fef3e7e987cf1354145cc6a43cbdd5e822b37916526a59f961067b937`
MD5	`73c3e3490a20d5fe38ff78a7ed118714`
BLAKE2b-256	`ed67e3406b02747992e8fefb4ebbc5f585ee45129b652aef2719e5ca9b43233f`

See more details on using hashes here.

better-mlx-audio 0.2.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MLX-Audio

Features

Installation

Modular Imports

Quick Start

How to call from python

Web Interface & FastAPI Server

Key Features

API Endpoints

Models

Kokoro

Example Usage

Language Options

CSM (Conversational Speech Model)

Example Usage

Advanced Features

Quantization

Requirements

Swift Integration

Available Products

Supported Platforms

Adding the Swift Package Dependency

Via Xcode (Recommended)

Via Package.swift

TTS Usage

Streaming generation

Raw audio (no playback)

STT Usage (requires Python-Apple-support)

STS Usage (Speech-to-Speech Pipeline)

License

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes