WhisperX-style transcription pipeline using an internal mlx-whisper ASR backend.

These details have not been verified by PyPI

Project description

mlx-whisperx

mlx-whisperx is a WhisperX-style transcription pipeline for Apple Silicon. It uses a vendored mlx-whisper ASR backend, then optionally applies WhisperX forced alignment and pyannote diarization.

The project is intended to provide a practical local pipeline with WhisperX-like JSON, subtitle, and text outputs while keeping ASR execution on MLX.

Why This Project Exists

This project adds WhisperX-like functionality to an mlx-whisper workflow. The goal is to keep ASR inference on MLX for Apple Silicon while providing the pipeline pieces people commonly use from WhisperX: VAD chunking, forced alignment, word timestamps, diarization hooks, and familiar JSON/subtitle outputs.

The implementation borrows ideas and code from both upstream projects:

WhisperX, for the pipeline structure, alignment workflow, diarization integration, and output conventions.
mlx-whisper, for the Apple Silicon ASR backend and model execution path.

This repository vendors and adapts code where needed so the pieces work together as a standalone mlx-whisperx package.

Pipeline

audio -> VAD -> mlx-whisper ASR -> forced alignment -> optional diarization -> writers

Default behavior:

ASR model: mlx-community/whisper-turbo
VAD backend: Silero
Decoding: beam search with beam_size=5 and temperature=0
Alignment: enabled for transcription
Diarization: disabled unless --diarize is passed

Installation

Clone the repository and install it into a Python environment:

git clone https://github.com/seedds/mlx-whisperx.git
cd mlx-whisperx
python -m pip install -e .

Install diarization support only when you need pyannote:

python -m pip install -e ".[diarize]"

Install everything this repository currently exposes:

python -m pip install -e ".[full]"

ffmpeg must be available on PATH because audio loading is handled through the ffmpeg CLI.

On macOS with Homebrew:

brew install ffmpeg

Optional pyannote VAD and diarization use pyannote models and may require a Hugging Face token, depending on the selected model.

Usage

mlx-whisperx AUDIO [AUDIO ...] [OPTIONS]

By default, mlx-whisperx:

uses mlx-community/whisper-turbo
runs Silero VAD
performs forced alignment for word timestamps
writes outputs to the current directory
writes every supported output format when --output_format is not specified

Common options:

--model: local model path or Hugging Face repo
--language: language code such as en, ja, or fr
--task: transcribe or translate
--output_dir: directory to write output files
--output_name: custom basename for output files
--output_format: all, json, srt, vtt, txt, tsv, or aud
--no_align: skip forced alignment
--diarize: attach speaker labels when diarization is enabled
--hf_token: Hugging Face token for gated pyannote models

Example: output JSON

mlx-whisperx audio.wav \
  --output_dir transcripts \
  --output_name audio \
  --output_format json

This writes transcripts/audio.json.

Example JSON output:

{
  "segments": [
    {
      "start": 0.0,
      "end": 2.52,
      "text": "Hello and welcome to mlx-whisperx.",
      "words": [
        {"word": "Hello", "start": 0.0, "end": 0.42, "score": 0.99},
        {"word": "and", "start": 0.44, "end": 0.58, "score": 0.98},
        {"word": "welcome", "start": 0.6, "end": 1.05, "score": 0.97},
        {"word": "to", "start": 1.07, "end": 1.18, "score": 0.98},
        {"word": "mlx-whisperx.", "start": 1.2, "end": 2.52, "score": 0.96}
      ]
    }
  ],
  "word_segments": [
    {"word": "Hello", "start": 0.0, "end": 0.42, "score": 0.99},
    {"word": "and", "start": 0.44, "end": 0.58, "score": 0.98}
  ],
  "language": "en"
}

Example: output SRT

mlx-whisperx audio.wav \
  --output_dir subtitles \
  --output_name audio \
  --output_format srt \
  --max_line_width 42 \
  --max_line_count 2

This writes subtitles/audio.srt.

Example SRT output:

1
00:00:00,000 --> 00:00:02,520
Hello and welcome to
mlx-whisperx.

2
00:00:02,700 --> 00:00:05,100
This example shows subtitle
output.

Complete example with optional parameters:

mlx-whisperx audio.wav \
  --model mlx-community/whisper-large-v3-turbo \
  --model_dir ./models \
  --model_cache_only False \
  --device cpu \
  --compute_type float32 \
  --output_dir ./out \
  --output_name meeting \
  --output_format all \
  --verbose True \
  --log-level info \
  --task transcribe \
  --language en \
  --align_model jonatasgrosman/wav2vec2-large-xlsr-53-english \
  --interpolate_method nearest \
  --return_char_alignments \
  --vad_method pyannote \
  --vad_onset 0.5 \
  --vad_offset 0.363 \
  --vad_model pyannote/segmentation-3.0 \
  --chunk_size 30 \
  --vad_dump_path ./out/meeting.vad.json \
  --diarize \
  --min_speakers 2 \
  --max_speakers 4 \
  --diarize_model pyannote/speaker-diarization-community-1 \
  --speaker_embeddings \
  --hf_token YOUR_HF_TOKEN \
  --temperature 0.0 \
  --temperature_increment_on_fallback 0.2 \
  --best_of 5 \
  --beam_size 5 \
  --patience 1.0 \
  --length_penalty 1.0 \
  --suppress_tokens -1 \
  --suppress_numerals \
  --initial_prompt "Technical meeting about MLX WhisperX." \
  --hotwords "MLX, WhisperX, pyannote, diarization" \
  --condition_on_previous_text True \
  --compression_ratio_threshold 2.4 \
  --logprob_threshold -1.0 \
  --no_speech_threshold 0.6 \
  --max_line_width 42 \
  --max_line_count 2 \
  --max_words_per_line 8 \
  --highlight_words False \
  --print_progress True

Notes:

--output_format all writes .txt, .vtt, .srt, .tsv, .json, and .aud.
--max_line_count only has an effect when --max_line_width is also set.
--highlight_words applies to srt and vtt.
--hf_token is only needed for gated pyannote models.

Python API

from mlx_whisperx import transcribe

result = transcribe(
    "audio.wav",
    model="mlx-community/whisper-large-v3-turbo",
    language="en",
)

Print one transcript segment per line:

for segment in result["segments"]:
    print(segment["text"].strip())

Print segment timestamps:

for segment in result["segments"]:
    print(f"[{segment['start']:.2f} -> {segment['end']:.2f}] {segment['text'].strip()}")

Print word-level timestamps:

for word in result["word_segments"]:
    print(f"[{word['start']:.2f} -> {word['end']:.2f}] {word['word']}")

Common API options match the CLI names:

result = transcribe(
    "audio.wav",
    model="mlx-community/whisper-turbo",
    language="en",
    beam_size=5,
    temperature=0.0,
    no_align=False,
    diarize=False,
    vad_method="silero",
)

language accepts either canonical codes such as en or case-insensitive names and aliases such as English or Portuguese.

Output Schema

JSON output follows the WhisperX-style shape:

{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Example transcript text.",
      "words": [
        {"word": "Example", "start": 0.0, "end": 0.6, "score": 0.98}
      ]
    }
  ],
  "word_segments": [
    {"word": "Example", "start": 0.0, "end": 0.6, "score": 0.98}
  ],
  "language": "en"
}

When diarization is enabled, speaker labels are included where available:

{"word": "Hello", "start": 0.0, "end": 0.4, "score": 0.99, "speaker": "SPEAKER_00"}

CLI Reference

Basic options:

--model: mlx-whisper model directory or Hugging Face repo.
--language: language code or case-insensitive name/alias. If omitted, language is auto-detected by ASR.
--task: transcribe or translate.
--output_format: all, srt, vtt, txt, tsv, json, or aud.
--output_dir: directory for output files.
--output_name: custom output basename.
--verbose: print transcript and logs.

English-only Whisper models such as .en checkpoints force language=en and do not support --task translate.

Decoding options:

--temperature: sampling temperature. Default is 0.0.
--beam_size: beam size when temperature=0. Default is 5.
--best_of: number of candidates when sampling with temperature > 0.
--patience: beam-search patience.
--length_penalty: beam-search length penalty.
--suppress_tokens: comma-separated token IDs to suppress.
--suppress_numerals: suppress numeric and currency-symbol tokens.
--initial_prompt: initial prompt for ASR.
--hotwords: hint phrases appended to the prompt.
--condition_on_previous_text: prompt backend windows with previous text inside each VAD chunk.

Precision and model-cache options:

--compute_type float16: force MLX ASR fp16. This is the default.
--compute_type float32: force MLX ASR fp32.
--model_dir: cache directory for ASR, alignment, pyannote VAD, and diarization models.
--model_cache_only: cached ASR, alignment, pyannote VAD, and diarization models only.

VAD options:

--vad_method silero: default VAD backend.
--vad_method pyannote: use pyannote VAD if your environment supports it.
--vad_onset: VAD onset threshold.
--vad_offset: VAD offset threshold.
--vad_model: Hugging Face pyannote segmentation model used with --vad_method pyannote. Defaults to pyannote/segmentation-3.0.
--chunk_size: merged VAD chunk size in seconds.
--no_vad: transcribe the full file as one chunk.
--clip_timestamps: comma-separated clip start/end pairs in seconds. Requires --no_vad.
--vad_dump_path: write VAD chunks and settings to JSON.

Silero VAD loads from the local Torch Hub cache first. To force a local Silero checkout, set:

export MLX_WHISPERX_SILERO_VAD_PATH=/path/to/snakers4_silero-vad

Alignment options:

--no_align: skip forced alignment.
--align_model: override the alignment model.
--interpolate_method: nearest, linear, or ignore.
--return_char_alignments: include character alignments in JSON.

Diarization options:

--diarize: assign speaker labels.
--diarize_model: pyannote diarization model name.
--min_speakers: minimum speaker count.
--max_speakers: maximum speaker count.
--speaker_embeddings: include speaker embeddings in JSON when available.
--hf_token: Hugging Face token for gated pyannote models.

Subtitle options:

--max_line_width: target subtitle line width.
--max_line_count: maximum lines per subtitle cue. Requires --max_line_width.
--max_words_per_line: maximum words per subtitle cue.
--highlight_words: underline the active word in SRT/VTT output.

Examples

Inspect VAD chunks before ASR:

mlx-whisperx audio.wav \
  --output_format json \
  --vad_dump_path audio.vad.json

Transcribe only selected clips without VAD chunking:

mlx-whisperx audio.wav \
  --no_vad \
  --clip_timestamps 0,15,30,45 \
  --output_format json

Run deterministic beam search explicitly:

mlx-whisperx audio.wav \
  --language en \
  --temperature 0 \
  --beam_size 5 \
  --output_format json

Use temperature fallback:

mlx-whisperx audio.wav \
  --temperature 0 \
  --temperature_increment_on_fallback 0.2

Suppress numerals and currency symbols during decoding:

mlx-whisperx audio.wav --suppress_numerals --output_format json

Use pyannote VAD instead of the default Silero VAD:

mlx-whisperx audio.wav \
  --vad_method pyannote \
  --vad_model pyannote/segmentation-3.0 \
  --hf_token YOUR_HF_TOKEN \
  --output_format json

Skip forced alignment:

mlx-whisperx audio.wav --no_align --output_format json

Run diarization:

mlx-whisperx audio.wav \
  --diarize \
  --hf_token YOUR_HF_TOKEN \
  --output_format json

Process multiple files:

mlx-whisperx first.wav second.wav third.wav --output_dir transcripts --output_format all

Current Behavior and Limitations

ASR decodes merged VAD chunks serially.
There is no batch_size CLI or API option.
translate skips forced alignment because alignment models are transcription-language specific.
clip_timestamps is only supported with --no_vad because VAD chunking changes the timing base before ASR runs.
Pyannote VAD and diarization depend on a compatible PyTorch, torchaudio, pyannote installation, and Hugging Face model access when the selected model is gated.
The vendored ASR backend lives under mlx_whisperx.backend.mlx_whisper so decoder behavior can be changed without modifying external reference repositories.

Development Checks

Compile the package:

python -m py_compile mlx_whisperx/**/*.py

Check CLI help:

python -m mlx_whisperx --help

Build a wheel:

python -m build

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_whisperx-0.1.0.tar.gz (823.4 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_whisperx-0.1.0-py3-none-any.whl (829.5 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file mlx_whisperx-0.1.0.tar.gz.

File metadata

Download URL: mlx_whisperx-0.1.0.tar.gz
Upload date: May 7, 2026
Size: 823.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_whisperx-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1e045ee5a9571f90ea107202059f14cafb9aa83ff11e67505e55fe248109b1ad`
MD5	`3f17800e90df11b357e27f0398d3fc78`
BLAKE2b-256	`bcde0d8cbad61a611c837011d5b36539b1e77a441eb829b67cfc3eb9746f1c34`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_whisperx-0.1.0.tar.gz:

Publisher: publish.yml on seedds/mlx-whisperx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mlx_whisperx-0.1.0.tar.gz
- Subject digest: 1e045ee5a9571f90ea107202059f14cafb9aa83ff11e67505e55fe248109b1ad
- Sigstore transparency entry: 1461959681
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: seedds/mlx-whisperx@fab9db7c697873b4eeae66ee2cdefd11d36b3a41
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/seedds
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@fab9db7c697873b4eeae66ee2cdefd11d36b3a41
- Trigger Event: release

File details

Details for the file mlx_whisperx-0.1.0-py3-none-any.whl.

File metadata

Download URL: mlx_whisperx-0.1.0-py3-none-any.whl
Upload date: May 7, 2026
Size: 829.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_whisperx-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2701fa83f554a00438dfebc73d6eeafe339f99180bd2e538009a3b128a7e1c63`
MD5	`1ada7795ac075cbeb957e73a528af406`
BLAKE2b-256	`9b7c113f07cc2c4951798aa51cde869a96700006df2d41866dc7801632e2ccf9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_whisperx-0.1.0-py3-none-any.whl:

Publisher: publish.yml on seedds/mlx-whisperx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mlx_whisperx-0.1.0-py3-none-any.whl
- Subject digest: 2701fa83f554a00438dfebc73d6eeafe339f99180bd2e538009a3b128a7e1c63
- Sigstore transparency entry: 1461959705
- Sigstore integration time: May 7, 2026
Source repository:
- Permalink: seedds/mlx-whisperx@fab9db7c697873b4eeae66ee2cdefd11d36b3a41
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/seedds
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@fab9db7c697873b4eeae66ee2cdefd11d36b3a41
- Trigger Event: release

mlx-whisperx 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

mlx-whisperx

Why This Project Exists

Pipeline

Installation

Usage

Python API

Output Schema

CLI Reference

Examples

Current Behavior and Limitations

Development Checks

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance