Skip to main content

WhisperX-style transcription pipeline using an internal mlx-whisper ASR backend.

Project description

mlx-whisperx

mlx-whisperx is a WhisperX-style transcription pipeline for Apple Silicon. It uses a vendored mlx-whisper ASR backend, then optionally applies WhisperX forced alignment and pyannote diarization.

The project is intended to provide a practical local pipeline with WhisperX-like JSON, subtitle, and text outputs while keeping ASR execution on MLX.

Why This Project Exists

This project adds WhisperX-like functionality to an mlx-whisper workflow. The goal is to keep ASR inference on MLX for Apple Silicon while providing the pipeline pieces people commonly use from WhisperX: VAD chunking, forced alignment, word timestamps, diarization hooks, and familiar JSON/subtitle outputs.

The implementation borrows ideas and code from both upstream projects:

  • WhisperX, for the pipeline structure, alignment workflow, diarization integration, and output conventions.
  • mlx-whisper, for the Apple Silicon ASR backend and model execution path.

This repository vendors and adapts code where needed so the pieces work together as a standalone mlx-whisperx package.

Pipeline

audio -> VAD -> mlx-whisper ASR -> forced alignment -> optional diarization -> writers

Default behavior:

  • ASR model: mlx-community/whisper-turbo
  • VAD backend: Silero
  • Decoding: beam search with beam_size=5 and temperature=0
  • Alignment: enabled for transcription
  • Diarization: disabled unless --diarize is passed

Installation

Clone the repository and install it into a Python environment:

git clone https://github.com/seedds/mlx-whisperx.git
cd mlx-whisperx
python -m pip install -e .

Install diarization support only when you need pyannote:

python -m pip install -e ".[diarize]"

Install everything this repository currently exposes:

python -m pip install -e ".[full]"

ffmpeg must be available on PATH because audio loading is handled through the ffmpeg CLI.

On macOS with Homebrew:

brew install ffmpeg

Optional pyannote VAD and diarization use pyannote models and may require a Hugging Face token, depending on the selected model.

Usage

mlx-whisperx AUDIO [AUDIO ...] [OPTIONS]

By default, mlx-whisperx:

  • uses mlx-community/whisper-turbo
  • runs Silero VAD
  • performs forced alignment for word timestamps
  • writes outputs to the current directory
  • writes every supported output format when --output_format is not specified

Common options:

  • --model: local model path or Hugging Face repo
  • --language: language code such as en, ja, or fr
  • --task: transcribe or translate
  • --output_dir: directory to write output files
  • --output_name: custom basename for output files
  • --output_format: all, json, srt, vtt, txt, tsv, or aud
  • --no_align: skip forced alignment
  • --diarize: attach speaker labels when diarization is enabled
  • --hf_token: Hugging Face token for gated pyannote models

Example: output JSON

mlx-whisperx audio.wav \
  --output_dir transcripts \
  --output_name audio \
  --output_format json

This writes transcripts/audio.json.

Example JSON output:

{
  "segments": [
    {
      "start": 0.0,
      "end": 2.52,
      "text": "Hello and welcome to mlx-whisperx.",
      "words": [
        {"word": "Hello", "start": 0.0, "end": 0.42, "score": 0.99},
        {"word": "and", "start": 0.44, "end": 0.58, "score": 0.98},
        {"word": "welcome", "start": 0.6, "end": 1.05, "score": 0.97},
        {"word": "to", "start": 1.07, "end": 1.18, "score": 0.98},
        {"word": "mlx-whisperx.", "start": 1.2, "end": 2.52, "score": 0.96}
      ]
    }
  ],
  "word_segments": [
    {"word": "Hello", "start": 0.0, "end": 0.42, "score": 0.99},
    {"word": "and", "start": 0.44, "end": 0.58, "score": 0.98}
  ],
  "language": "en"
}

Example: output SRT

mlx-whisperx audio.wav \
  --output_dir subtitles \
  --output_name audio \
  --output_format srt \
  --max_line_width 42 \
  --max_line_count 2

This writes subtitles/audio.srt.

Example SRT output:

1
00:00:00,000 --> 00:00:02,520
Hello and welcome to
mlx-whisperx.

2
00:00:02,700 --> 00:00:05,100
This example shows subtitle
output.

Complete example with optional parameters:

mlx-whisperx audio.wav \
  --model mlx-community/whisper-large-v3-turbo \
  --model_dir ./models \
  --model_cache_only False \
  --device cpu \
  --compute_type float32 \
  --output_dir ./out \
  --output_name meeting \
  --output_format all \
  --verbose True \
  --log-level info \
  --task transcribe \
  --language en \
  --align_model jonatasgrosman/wav2vec2-large-xlsr-53-english \
  --interpolate_method nearest \
  --return_char_alignments \
  --vad_method pyannote \
  --vad_onset 0.5 \
  --vad_offset 0.363 \
  --vad_model pyannote/segmentation-3.0 \
  --chunk_size 30 \
  --vad_dump_path ./out/meeting.vad.json \
  --diarize \
  --min_speakers 2 \
  --max_speakers 4 \
  --diarize_model pyannote/speaker-diarization-community-1 \
  --speaker_embeddings \
  --hf_token YOUR_HF_TOKEN \
  --temperature 0.0 \
  --temperature_increment_on_fallback 0.2 \
  --best_of 5 \
  --beam_size 5 \
  --patience 1.0 \
  --length_penalty 1.0 \
  --suppress_tokens -1 \
  --suppress_numerals \
  --initial_prompt "Technical meeting about MLX WhisperX." \
  --hotwords "MLX, WhisperX, pyannote, diarization" \
  --condition_on_previous_text True \
  --compression_ratio_threshold 2.4 \
  --logprob_threshold -1.0 \
  --no_speech_threshold 0.6 \
  --max_line_width 42 \
  --max_line_count 2 \
  --max_words_per_line 8 \
  --highlight_words False \
  --print_progress True

Notes:

  • --output_format all writes .txt, .vtt, .srt, .tsv, .json, and .aud.
  • --max_line_count only has an effect when --max_line_width is also set.
  • --highlight_words applies to srt and vtt.
  • --hf_token is only needed for gated pyannote models.

Python API

from mlx_whisperx import transcribe

result = transcribe(
    "audio.wav",
    model="mlx-community/whisper-large-v3-turbo",
    language="en",
)

Print one transcript segment per line:

for segment in result["segments"]:
    print(segment["text"].strip())

Print segment timestamps:

for segment in result["segments"]:
    print(f"[{segment['start']:.2f} -> {segment['end']:.2f}] {segment['text'].strip()}")

Print word-level timestamps:

for word in result["word_segments"]:
    print(f"[{word['start']:.2f} -> {word['end']:.2f}] {word['word']}")

Common API options match the CLI names:

result = transcribe(
    "audio.wav",
    model="mlx-community/whisper-turbo",
    language="en",
    beam_size=5,
    temperature=0.0,
    no_align=False,
    diarize=False,
    vad_method="silero",
)

language accepts either canonical codes such as en or case-insensitive names and aliases such as English or Portuguese.

Output Schema

JSON output follows the WhisperX-style shape:

{
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "Example transcript text.",
      "words": [
        {"word": "Example", "start": 0.0, "end": 0.6, "score": 0.98}
      ]
    }
  ],
  "word_segments": [
    {"word": "Example", "start": 0.0, "end": 0.6, "score": 0.98}
  ],
  "language": "en"
}

When diarization is enabled, speaker labels are included where available:

{"word": "Hello", "start": 0.0, "end": 0.4, "score": 0.99, "speaker": "SPEAKER_00"}

CLI Reference

Basic options:

  • --model: mlx-whisper model directory or Hugging Face repo.
  • --language: language code or case-insensitive name/alias. If omitted, language is auto-detected by ASR.
  • --task: transcribe or translate.
  • --output_format: all, srt, vtt, txt, tsv, json, or aud.
  • --output_dir: directory for output files.
  • --output_name: custom output basename.
  • --verbose: print transcript and logs.

English-only Whisper models such as .en checkpoints force language=en and do not support --task translate.

Decoding options:

  • --temperature: sampling temperature. Default is 0.0.
  • --beam_size: beam size when temperature=0. Default is 5.
  • --best_of: number of candidates when sampling with temperature > 0.
  • --patience: beam-search patience.
  • --length_penalty: beam-search length penalty.
  • --suppress_tokens: comma-separated token IDs to suppress.
  • --suppress_numerals: suppress numeric and currency-symbol tokens.
  • --initial_prompt: initial prompt for ASR.
  • --hotwords: hint phrases appended to the prompt.
  • --condition_on_previous_text: prompt backend windows with previous text inside each VAD chunk.

Precision and model-cache options:

  • --compute_type float16: force MLX ASR fp16. This is the default.
  • --compute_type float32: force MLX ASR fp32.
  • --model_dir: cache directory for ASR, alignment, pyannote VAD, and diarization models.
  • --model_cache_only: cached ASR, alignment, pyannote VAD, and diarization models only.

VAD options:

  • --vad_method silero: default VAD backend.
  • --vad_method pyannote: use pyannote VAD if your environment supports it.
  • --vad_onset: VAD onset threshold.
  • --vad_offset: VAD offset threshold.
  • --vad_model: Hugging Face pyannote segmentation model used with --vad_method pyannote. Defaults to pyannote/segmentation-3.0.
  • --chunk_size: merged VAD chunk size in seconds.
  • --no_vad: transcribe the full file as one chunk.
  • --clip_timestamps: comma-separated clip start/end pairs in seconds. Requires --no_vad.
  • --vad_dump_path: write VAD chunks and settings to JSON.

Silero VAD loads from the local Torch Hub cache first. To force a local Silero checkout, set:

export MLX_WHISPERX_SILERO_VAD_PATH=/path/to/snakers4_silero-vad

Alignment options:

  • --no_align: skip forced alignment.
  • --align_model: override the alignment model.
  • --interpolate_method: nearest, linear, or ignore.
  • --return_char_alignments: include character alignments in JSON.

Diarization options:

  • --diarize: assign speaker labels.
  • --diarize_model: pyannote diarization model name.
  • --min_speakers: minimum speaker count.
  • --max_speakers: maximum speaker count.
  • --speaker_embeddings: include speaker embeddings in JSON when available.
  • --hf_token: Hugging Face token for gated pyannote models.

Subtitle options:

  • --max_line_width: target subtitle line width.
  • --max_line_count: maximum lines per subtitle cue. Requires --max_line_width.
  • --max_words_per_line: maximum words per subtitle cue.
  • --highlight_words: underline the active word in SRT/VTT output.

Examples

Inspect VAD chunks before ASR:

mlx-whisperx audio.wav \
  --output_format json \
  --vad_dump_path audio.vad.json

Transcribe only selected clips without VAD chunking:

mlx-whisperx audio.wav \
  --no_vad \
  --clip_timestamps 0,15,30,45 \
  --output_format json

Run deterministic beam search explicitly:

mlx-whisperx audio.wav \
  --language en \
  --temperature 0 \
  --beam_size 5 \
  --output_format json

Use temperature fallback:

mlx-whisperx audio.wav \
  --temperature 0 \
  --temperature_increment_on_fallback 0.2

Suppress numerals and currency symbols during decoding:

mlx-whisperx audio.wav --suppress_numerals --output_format json

Use pyannote VAD instead of the default Silero VAD:

mlx-whisperx audio.wav \
  --vad_method pyannote \
  --vad_model pyannote/segmentation-3.0 \
  --hf_token YOUR_HF_TOKEN \
  --output_format json

Skip forced alignment:

mlx-whisperx audio.wav --no_align --output_format json

Run diarization:

mlx-whisperx audio.wav \
  --diarize \
  --hf_token YOUR_HF_TOKEN \
  --output_format json

Process multiple files:

mlx-whisperx first.wav second.wav third.wav --output_dir transcripts --output_format all

Current Behavior and Limitations

  • ASR decodes merged VAD chunks serially.
  • There is no batch_size CLI or API option.
  • translate skips forced alignment because alignment models are transcription-language specific.
  • clip_timestamps is only supported with --no_vad because VAD chunking changes the timing base before ASR runs.
  • Pyannote VAD and diarization depend on a compatible PyTorch, torchaudio, pyannote installation, and Hugging Face model access when the selected model is gated.
  • The vendored ASR backend lives under mlx_whisperx.backend.mlx_whisper so decoder behavior can be changed without modifying external reference repositories.

Development Checks

Compile the package:

python -m py_compile mlx_whisperx/**/*.py

Check CLI help:

python -m mlx_whisperx --help

Build a wheel:

python -m build

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_whisperx-0.1.0.tar.gz (823.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_whisperx-0.1.0-py3-none-any.whl (829.5 kB view details)

Uploaded Python 3

File details

Details for the file mlx_whisperx-0.1.0.tar.gz.

File metadata

  • Download URL: mlx_whisperx-0.1.0.tar.gz
  • Upload date:
  • Size: 823.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_whisperx-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1e045ee5a9571f90ea107202059f14cafb9aa83ff11e67505e55fe248109b1ad
MD5 3f17800e90df11b357e27f0398d3fc78
BLAKE2b-256 bcde0d8cbad61a611c837011d5b36539b1e77a441eb829b67cfc3eb9746f1c34

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_whisperx-0.1.0.tar.gz:

Publisher: publish.yml on seedds/mlx-whisperx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mlx_whisperx-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mlx_whisperx-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 829.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_whisperx-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2701fa83f554a00438dfebc73d6eeafe339f99180bd2e538009a3b128a7e1c63
MD5 1ada7795ac075cbeb957e73a528af406
BLAKE2b-256 9b7c113f07cc2c4951798aa51cde869a96700006df2d41866dc7801632e2ccf9

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_whisperx-0.1.0-py3-none-any.whl:

Publisher: publish.yml on seedds/mlx-whisperx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page