WhisperX-style transcription pipeline using an internal mlx-whisper ASR backend.
Project description
mlx-whisperx
mlx-whisperx is a WhisperX-style transcription pipeline for Apple Silicon. It uses a vendored mlx-whisper ASR backend, then optionally applies WhisperX forced alignment and pyannote diarization.
The project is intended to provide a practical local pipeline with WhisperX-like JSON, subtitle, and text outputs while keeping ASR execution on MLX.
Why This Project Exists
This project adds WhisperX-like functionality to an mlx-whisper workflow. The goal is to keep ASR inference on MLX for Apple Silicon while providing the pipeline pieces people commonly use from WhisperX: VAD chunking, forced alignment, word timestamps, diarization hooks, and familiar JSON/subtitle outputs.
The implementation borrows ideas and code from both upstream projects:
- WhisperX, for the pipeline structure, alignment workflow, diarization integration, and output conventions.
mlx-whisper, for the Apple Silicon ASR backend and model execution path.
This repository vendors and adapts code where needed so the pieces work together as a standalone mlx-whisperx package.
Pipeline
audio -> VAD -> mlx-whisper ASR -> forced alignment -> optional diarization -> writers
Default behavior:
- ASR model:
mlx-community/whisper-turbo - VAD backend: Silero
- Decoding: beam search with
beam_size=5andtemperature=0 - Alignment: enabled for transcription
- Diarization: disabled unless
--diarizeis passed
Installation
Clone the repository and install it into a Python environment:
git clone https://github.com/seedds/mlx-whisperx.git
cd mlx-whisperx
python -m pip install -e .
Install diarization support only when you need pyannote:
python -m pip install -e ".[diarize]"
Install everything this repository currently exposes:
python -m pip install -e ".[full]"
ffmpeg must be available on PATH because audio loading is handled through the ffmpeg CLI.
On macOS with Homebrew:
brew install ffmpeg
Optional pyannote VAD and diarization use pyannote models and may require a Hugging Face token, depending on the selected model.
Usage
mlx-whisperx AUDIO [AUDIO ...] [OPTIONS]
By default, mlx-whisperx:
- uses
mlx-community/whisper-turbo - runs Silero VAD
- performs forced alignment for word timestamps
- writes outputs to the current directory
- writes every supported output format when
--output_formatis not specified
Common options:
--model: local model path or Hugging Face repo--language: language code such asen,ja, orfr--task:transcribeortranslate--output_dir: directory to write output files--output_name: custom basename for output files--output_format:all,json,srt,vtt,txt,tsv, oraud--no_align: skip forced alignment--diarize: attach speaker labels when diarization is enabled--hf_token: Hugging Face token for gated pyannote models
Example: output JSON
mlx-whisperx audio.wav \
--output_dir transcripts \
--output_name audio \
--output_format json
This writes transcripts/audio.json.
Example JSON output:
{
"segments": [
{
"start": 0.0,
"end": 2.52,
"text": "Hello and welcome to mlx-whisperx.",
"words": [
{"word": "Hello", "start": 0.0, "end": 0.42, "score": 0.99},
{"word": "and", "start": 0.44, "end": 0.58, "score": 0.98},
{"word": "welcome", "start": 0.6, "end": 1.05, "score": 0.97},
{"word": "to", "start": 1.07, "end": 1.18, "score": 0.98},
{"word": "mlx-whisperx.", "start": 1.2, "end": 2.52, "score": 0.96}
]
}
],
"word_segments": [
{"word": "Hello", "start": 0.0, "end": 0.42, "score": 0.99},
{"word": "and", "start": 0.44, "end": 0.58, "score": 0.98}
],
"language": "en"
}
Example: output SRT
mlx-whisperx audio.wav \
--output_dir subtitles \
--output_name audio \
--output_format srt \
--max_line_width 42 \
--max_line_count 2
This writes subtitles/audio.srt.
Example SRT output:
1
00:00:00,000 --> 00:00:02,520
Hello and welcome to
mlx-whisperx.
2
00:00:02,700 --> 00:00:05,100
This example shows subtitle
output.
Complete example with optional parameters:
mlx-whisperx audio.wav \
--model mlx-community/whisper-large-v3-turbo \
--model_dir ./models \
--model_cache_only False \
--device cpu \
--compute_type float32 \
--output_dir ./out \
--output_name meeting \
--output_format all \
--verbose True \
--log-level info \
--task transcribe \
--language en \
--align_model jonatasgrosman/wav2vec2-large-xlsr-53-english \
--interpolate_method nearest \
--return_char_alignments \
--vad_method pyannote \
--vad_onset 0.5 \
--vad_offset 0.363 \
--vad_model pyannote/segmentation-3.0 \
--chunk_size 30 \
--vad_dump_path ./out/meeting.vad.json \
--diarize \
--min_speakers 2 \
--max_speakers 4 \
--diarize_model pyannote/speaker-diarization-community-1 \
--speaker_embeddings \
--hf_token YOUR_HF_TOKEN \
--temperature 0.0 \
--temperature_increment_on_fallback 0.2 \
--best_of 5 \
--beam_size 5 \
--patience 1.0 \
--length_penalty 1.0 \
--suppress_tokens -1 \
--suppress_numerals \
--initial_prompt "Technical meeting about MLX WhisperX." \
--hotwords "MLX, WhisperX, pyannote, diarization" \
--condition_on_previous_text True \
--compression_ratio_threshold 2.4 \
--logprob_threshold -1.0 \
--no_speech_threshold 0.6 \
--max_line_width 42 \
--max_line_count 2 \
--max_words_per_line 8 \
--highlight_words False \
--print_progress True
Notes:
--output_format allwrites.txt,.vtt,.srt,.tsv,.json, and.aud.--max_line_countonly has an effect when--max_line_widthis also set.--highlight_wordsapplies tosrtandvtt.--hf_tokenis only needed for gated pyannote models.
Python API
from mlx_whisperx import transcribe
result = transcribe(
"audio.wav",
model="mlx-community/whisper-large-v3-turbo",
language="en",
)
Print one transcript segment per line:
for segment in result["segments"]:
print(segment["text"].strip())
Print segment timestamps:
for segment in result["segments"]:
print(f"[{segment['start']:.2f} -> {segment['end']:.2f}] {segment['text'].strip()}")
Print word-level timestamps:
for word in result["word_segments"]:
print(f"[{word['start']:.2f} -> {word['end']:.2f}] {word['word']}")
Common API options match the CLI names:
result = transcribe(
"audio.wav",
model="mlx-community/whisper-turbo",
language="en",
beam_size=5,
temperature=0.0,
no_align=False,
diarize=False,
vad_method="silero",
)
language accepts either canonical codes such as en or case-insensitive names and aliases such as English or Portuguese.
Output Schema
JSON output follows the WhisperX-style shape:
{
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "Example transcript text.",
"words": [
{"word": "Example", "start": 0.0, "end": 0.6, "score": 0.98}
]
}
],
"word_segments": [
{"word": "Example", "start": 0.0, "end": 0.6, "score": 0.98}
],
"language": "en"
}
When diarization is enabled, speaker labels are included where available:
{"word": "Hello", "start": 0.0, "end": 0.4, "score": 0.99, "speaker": "SPEAKER_00"}
CLI Reference
Basic options:
--model:mlx-whispermodel directory or Hugging Face repo.--language: language code or case-insensitive name/alias. If omitted, language is auto-detected by ASR.--task:transcribeortranslate.--output_format:all,srt,vtt,txt,tsv,json, oraud.--output_dir: directory for output files.--output_name: custom output basename.--verbose: print transcript and logs.
English-only Whisper models such as .en checkpoints force language=en and do not support --task translate.
Decoding options:
--temperature: sampling temperature. Default is0.0.--beam_size: beam size whentemperature=0. Default is5.--best_of: number of candidates when sampling withtemperature > 0.--patience: beam-search patience.--length_penalty: beam-search length penalty.--suppress_tokens: comma-separated token IDs to suppress.--suppress_numerals: suppress numeric and currency-symbol tokens.--initial_prompt: initial prompt for ASR.--hotwords: hint phrases appended to the prompt.--condition_on_previous_text: prompt backend windows with previous text inside each VAD chunk.
Precision and model-cache options:
--compute_type float16: force MLX ASR fp16. This is the default.--compute_type float32: force MLX ASR fp32.--model_dir: cache directory for ASR, alignment, pyannote VAD, and diarization models.--model_cache_only: cached ASR, alignment, pyannote VAD, and diarization models only.
VAD options:
--vad_method silero: default VAD backend.--vad_method pyannote: use pyannote VAD if your environment supports it.--vad_onset: VAD onset threshold.--vad_offset: VAD offset threshold.--vad_model: Hugging Face pyannote segmentation model used with--vad_method pyannote. Defaults topyannote/segmentation-3.0.--chunk_size: merged VAD chunk size in seconds.--no_vad: transcribe the full file as one chunk.--clip_timestamps: comma-separated clip start/end pairs in seconds. Requires--no_vad.--vad_dump_path: write VAD chunks and settings to JSON.
Silero VAD loads from the local Torch Hub cache first. To force a local Silero checkout, set:
export MLX_WHISPERX_SILERO_VAD_PATH=/path/to/snakers4_silero-vad
Alignment options:
--no_align: skip forced alignment.--align_model: override the alignment model.--interpolate_method:nearest,linear, orignore.--return_char_alignments: include character alignments in JSON.
Diarization options:
--diarize: assign speaker labels.--diarize_model: pyannote diarization model name.--min_speakers: minimum speaker count.--max_speakers: maximum speaker count.--speaker_embeddings: include speaker embeddings in JSON when available.--hf_token: Hugging Face token for gated pyannote models.
Subtitle options:
--max_line_width: target subtitle line width.--max_line_count: maximum lines per subtitle cue. Requires--max_line_width.--max_words_per_line: maximum words per subtitle cue.--highlight_words: underline the active word in SRT/VTT output.
Examples
Inspect VAD chunks before ASR:
mlx-whisperx audio.wav \
--output_format json \
--vad_dump_path audio.vad.json
Transcribe only selected clips without VAD chunking:
mlx-whisperx audio.wav \
--no_vad \
--clip_timestamps 0,15,30,45 \
--output_format json
Run deterministic beam search explicitly:
mlx-whisperx audio.wav \
--language en \
--temperature 0 \
--beam_size 5 \
--output_format json
Use temperature fallback:
mlx-whisperx audio.wav \
--temperature 0 \
--temperature_increment_on_fallback 0.2
Suppress numerals and currency symbols during decoding:
mlx-whisperx audio.wav --suppress_numerals --output_format json
Use pyannote VAD instead of the default Silero VAD:
mlx-whisperx audio.wav \
--vad_method pyannote \
--vad_model pyannote/segmentation-3.0 \
--hf_token YOUR_HF_TOKEN \
--output_format json
Skip forced alignment:
mlx-whisperx audio.wav --no_align --output_format json
Run diarization:
mlx-whisperx audio.wav \
--diarize \
--hf_token YOUR_HF_TOKEN \
--output_format json
Process multiple files:
mlx-whisperx first.wav second.wav third.wav --output_dir transcripts --output_format all
Current Behavior and Limitations
- ASR decodes merged VAD chunks serially.
- There is no
batch_sizeCLI or API option. translateskips forced alignment because alignment models are transcription-language specific.clip_timestampsis only supported with--no_vadbecause VAD chunking changes the timing base before ASR runs.- Pyannote VAD and diarization depend on a compatible PyTorch, torchaudio, pyannote installation, and Hugging Face model access when the selected model is gated.
- The vendored ASR backend lives under
mlx_whisperx.backend.mlx_whisperso decoder behavior can be changed without modifying external reference repositories.
Development Checks
Compile the package:
python -m py_compile mlx_whisperx/**/*.py
Check CLI help:
python -m mlx_whisperx --help
Build a wheel:
python -m build
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlx_whisperx-0.1.0.tar.gz.
File metadata
- Download URL: mlx_whisperx-0.1.0.tar.gz
- Upload date:
- Size: 823.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e045ee5a9571f90ea107202059f14cafb9aa83ff11e67505e55fe248109b1ad
|
|
| MD5 |
3f17800e90df11b357e27f0398d3fc78
|
|
| BLAKE2b-256 |
bcde0d8cbad61a611c837011d5b36539b1e77a441eb829b67cfc3eb9746f1c34
|
Provenance
The following attestation bundles were made for mlx_whisperx-0.1.0.tar.gz:
Publisher:
publish.yml on seedds/mlx-whisperx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mlx_whisperx-0.1.0.tar.gz -
Subject digest:
1e045ee5a9571f90ea107202059f14cafb9aa83ff11e67505e55fe248109b1ad - Sigstore transparency entry: 1461959681
- Sigstore integration time:
-
Permalink:
seedds/mlx-whisperx@fab9db7c697873b4eeae66ee2cdefd11d36b3a41 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/seedds
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@fab9db7c697873b4eeae66ee2cdefd11d36b3a41 -
Trigger Event:
release
-
Statement type:
File details
Details for the file mlx_whisperx-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mlx_whisperx-0.1.0-py3-none-any.whl
- Upload date:
- Size: 829.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2701fa83f554a00438dfebc73d6eeafe339f99180bd2e538009a3b128a7e1c63
|
|
| MD5 |
1ada7795ac075cbeb957e73a528af406
|
|
| BLAKE2b-256 |
9b7c113f07cc2c4951798aa51cde869a96700006df2d41866dc7801632e2ccf9
|
Provenance
The following attestation bundles were made for mlx_whisperx-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on seedds/mlx-whisperx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mlx_whisperx-0.1.0-py3-none-any.whl -
Subject digest:
2701fa83f554a00438dfebc73d6eeafe339f99180bd2e538009a3b128a7e1c63 - Sigstore transparency entry: 1461959705
- Sigstore integration time:
-
Permalink:
seedds/mlx-whisperx@fab9db7c697873b4eeae66ee2cdefd11d36b3a41 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/seedds
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@fab9db7c697873b4eeae66ee2cdefd11d36b3a41 -
Trigger Event:
release
-
Statement type: