Skip to main content

Native Apple-Silicon ASR service powered by whispermlx (MLX) with FastAPI

Project description

Whispermlx ASR Service

Version License: MIT Platform: Apple Silicon Python: 3.13 Status

A native Apple-Silicon ASR API service powered by whispermlx (MLX) with FastAPI.

Runs natively on macOS with Apple Silicon (M1/M2/M3/M4). MLX Whisper inference runs on the Metal GPU automatically. No CUDA, no Docker, no Ray Serve.

What This Does

  • Transcribes audio files using OpenAI Whisper models via the MLX backend
  • Identifies speakers ("Who spoke when") using Pyannote.audio
  • Returns word-level timestamps via wav2vec2 alignment
  • Supports 90+ languages
  • Outputs JSON, SRT, VTT, TSV, and plain text formats
  • OpenAI-compatible API (/v1/audio/transcriptions, /v1/audio/translations, /v1/models)
  • Runs natively on Apple Silicon with uv and Python 3.13

Limitations

  • Not production-grade: Basic error handling, no authentication
  • Apple Silicon only: Requires an M-series Mac. No NVIDIA/CUDA support.
  • File size limits: Large audio files (>1GB) can cause out-of-memory errors
  • Memory usage: RAM consumption increases with file size and diarization. Peak ~2.3 GB for a small-model full pipeline on a 16 GB M1.
  • Alpha software: Expect bugs and breaking changes

How It Works

Audio --> MLX Whisper (transcription, Metal GPU) --> Wav2Vec2 (alignment) --> Pyannote (speaker ID) --> Output

The service runs as a single-process uvicorn server with an async queue. Requests are serialized through a semaphore so only one pipeline runs on the Metal GPU at a time. This is suitable for single-device, low-traffic, or development use.

Device semantics: MLX Whisper ASR always runs on the Metal GPU automatically. The DEVICE environment variable (default mps) only controls where the VAD, wav2vec2 alignment, and pyannote diarization (torch-based stages) run. COMPUTE_TYPE and BATCH_SIZE are accepted for API compatibility but have no effect on the MLX backend.

Prerequisites

Hardware Requirements

  • Apple Silicon Mac (M1, M2, M3, or M4)
  • RAM: 16 GB recommended (8 GB may work with tiny/base models)
  • Storage: 50 GB SSD for model caching

Memory requirements vary by model size:

Whisper Model RAM (full pipeline*) Notes
tiny, base ~2 GB Fast, low quality
small ~2.3 GB Good balance of speed and quality
medium ~5 GB Good quality, slower
large-v3-turbo, turbo ~5 GB Fast, high quality
large-v3 ~10+ GB Best quality, slowest

*Full pipeline = Whisper model + alignment model + pyannote speaker diarization. Measured on M1 16 GB.

Software Requirements

  • macOS with Apple Silicon
  • uv (Python package manager)
  • Python 3.13 (installed via uv; system Python 3.14 is incompatible with whispermlx)
  • FFmpeg (for audio decoding; brew install ffmpeg)
  • Hugging Face Account (for speaker diarization models)

Quick Start

1. Install uv and Set Up Python 3.13

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/KalebJS/whispermlx-asr-service.git
cd whispermlx-asr-service

# Create a Python 3.13 virtual environment and install dependencies
uv venv --python 3.13
uv sync

2. Get Hugging Face Token (for Speaker Diarization)

Speaker diarization requires a Hugging Face token and model access:

a) Create a Hugging Face Account:

b) Accept the Model User Agreement:

c) Generate an Access Token:

Without the token and accepted agreement, diarization is gracefully skipped and transcription still works, but no speaker labels will be assigned.

3. Configure Environment

# Copy example environment file
cp .env.example .env

# Edit .env and add your Hugging Face token
nano .env

Minimal .env:

HF_TOKEN=hf_your_token_here
DEVICE=mps
PRELOAD_MODEL=large-v3
PORT=9001

4. Run the Service

# Export your .env vars first (entrypoint.sh does NOT auto-load .env)
set -a; source .env; set +a

# Start the service (binds 0.0.0.0:9001)
./entrypoint.sh

# Or start directly with uvicorn (binds localhost only)
uv run uvicorn app.main:app --host 127.0.0.1 --port 9001

# Or load .env and start in one step
uv run uvicorn app.main:app --host 127.0.0.1 --port 9001 --env-file .env

The service will be available at http://localhost:9001.

Note: entrypoint.sh hardcodes port 9001 and binds to 0.0.0.0 (all interfaces). The PORT env var is only respected when launching uvicorn directly with --port $PORT. Since the service has no authentication, prefer --host 127.0.0.1 unless you need remote access.

Port 9001 is the default. Port 9000 may be in use by other services (e.g., php-fpm on some macOS setups). The reserved port range for this service is 9001-9010.

5. Test the Service

# Health check
curl http://localhost:9001/health

# Test transcription
curl -X POST http://localhost:9001/asr \
  -F "audio_file=@your_audio.mp3" \
  -F "language=en"

A smoke test script is included:

./test-api.sh localhost 9001 path/to/audio.wav

API Documentation

Once running, visit http://localhost:9001/docs for interactive API documentation.

Main Endpoint: POST /asr

Parameters:

Parameter Type Default Description
audio_file File Required Audio file to transcribe
task String transcribe Task type: transcribe or translate
language String Auto-detect Language code (e.g., en, es, fr)
model String large-v3 Whisper model (see Model Selection)
initial_prompt String None Context or spelling guide to steer the model
hotwords String None Accepted but ignored by the MLX backend (see Hotwords)
output_format String json Output format: json, text, srt, vtt, tsv
output String None Legacy alias for output_format
word_timestamps Boolean true Return word-level timestamps
diarize Boolean true Enable speaker diarization
enable_diarization Boolean None Alias for diarize
num_speakers Integer Auto Exact number of speakers (overrides min/max)
min_speakers Integer Auto Minimum number of speakers
max_speakers Integer Auto Maximum number of speakers
return_speaker_embeddings Boolean false Return 256-dimensional speaker embedding vectors

Example Request (JSON output):

curl -X POST http://localhost:9001/asr \
  -F "audio_file=@meeting.mp3" \
  -F "language=en" \
  -F "model=large-v3" \
  -F "output_format=json" \
  -F "diarize=true" \
  -F "min_speakers=2" \
  -F "max_speakers=5"

Example Request (SRT subtitles):

curl -X POST http://localhost:9001/asr \
  -F "audio_file=@video.mp4" \
  -F "language=en" \
  -F "output_format=srt" \
  -F "diarize=false"

Example Response (JSON):

The text field is a JSON array mirroring the segments array (legacy drop-in shape from the original whisper-asr-webservice):

{
  "text": [
    {
      "start": 0.5,
      "end": 2.3,
      "text": " Hello, welcome to the meeting.",
      "speaker": "SPEAKER_00",
      "words": [
        {"word": "Hello", "start": 0.5, "end": 0.8, "score": 0.95},
        {"word": "welcome", "start": 0.9, "end": 1.2, "score": 0.93}
      ]
    }
  ],
  "language": "en",
  "segments": [...],
  "word_segments": [...]
}

Hotwords (No-Op)

The hotwords parameter is accepted for API compatibility but is a no-op on the MLX backend. The whispermlx library has no hotwords mechanism. When you supply hotwords, the service logs a warning ("hotwords is ignored by the MLX backend") and proceeds with normal transcription. The parameter never causes an error.

To bias transcription toward specific spellings, use initial_prompt instead, which provides context that primes the model to expect certain terms:

# Use initial_prompt to guide spelling
curl -X POST "http://localhost:9001/asr?language=en&initial_prompt=Speakr+is+a+transcription+app." \
  -F "audio_file=@meeting.mp3"

Speaker Diarization

Speaker diarization assigns SPEAKER_NN labels to segments and words. It is enabled by default when HF_TOKEN is set.

Requirements:

When HF_TOKEN is missing or diarization fails: The service gracefully skips diarization and returns the transcription without speaker labels (HTTP 200, no crash). This ensures transcription is never blocked by a missing token.

Exact Speaker Count:

curl -X POST http://localhost:9001/asr \
  -F "audio_file=@interview.mp3" \
  -F "num_speakers=2" \
  -F "diarize=true"

num_speakers overrides min_speakers and max_speakers.

Speaker Embeddings:

curl -X POST http://localhost:9001/asr \
  -F "audio_file=@meeting.mp3" \
  -F "return_speaker_embeddings=true" \
  -F "diarize=true"

Returns a speaker_embeddings object keyed by speaker label, with one numeric vector per detected speaker. Embeddings are only included in json output format.

OpenAI-Compatible Endpoints

The service provides drop-in OpenAI API compatibility:

POST /v1/audio/transcriptions

curl -X POST http://localhost:9001/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=whisper-1"

Supports response_format: json (default, returns {"text": "..."}), text, srt, vtt, verbose_json (full object with segments, optional word timestamps via timestamp_granularities[]).

The model field accepts OpenAI-style aliases (whisper-1, whisper-tiny, whisper-large-v3) and raw MLX model names (tiny, base, large-v3-turbo, etc.).

POST /v1/audio/translations

curl -X POST http://localhost:9001/v1/audio/translations \
  -F "file=@spanish_audio.mp3" \
  -F "model=whisper-1"

Translates non-English audio into English text. Same response formats as transcriptions. verbose_json reports task: "translate".

GET /v1/models

curl http://localhost:9001/v1/models

Returns an OpenAI-style list of available models, built from the MLX model map. Includes the whisper-1 alias and all canonical MLX model names.

GET /v1/models/{model_id}

curl http://localhost:9001/v1/models/large-v3

Returns the matching model object, or a 404 OpenAI error for unknown ids.

Health and Metrics

# Health check
curl http://localhost:9001/health
# {"status": "healthy", "device": "mps", "loaded_models": ["large-v3"], "serve_mode": "simple"}

# Root endpoint
curl http://localhost:9001/

# Prometheus metrics
curl http://localhost:9001/metrics

# Queue metrics (JSON)
curl http://localhost:9001/queue-metrics

Prometheus Metrics:

Metric Type Notes
whisperx_requests_total{endpoint,status} Counter status is ok, http_<code>, or error
whisperx_request_duration_seconds{endpoint} Histogram End-to-end handler time
whisperx_active_transcriptions Gauge In-flight /asr requests
whisperx_loaded_models Gauge Whisper models currently in cache
whisperx_model_evictions_total{model} Counter Models unloaded by the idle-eviction sweep
whisperx_audio_duration_seconds Histogram Submitted audio duration
whisperx_audio_size_megabytes Histogram Submitted file size
whisperx_vram_allocated_bytes Gauge MLX active memory (or 0)
whisperx_service_info Info Static labels: version, device, compute_type, serve_mode

The whisperx_vram_allocated_bytes gauge reports MLX active memory via mlx.core.get_active_memory() when available, or 0 otherwise. No torch.cuda is used.


Model Selection

Available MLX Whisper models (speed vs accuracy tradeoff):

Model Parameters Speed Quality
tiny, tiny.en 39M Fastest Lowest
base, base.en 74M Very Fast Low
small, small.en 244M Fast Medium
medium, medium.en 769M Moderate Good
large, large-v1 1550M Slow Excellent
large-v2 1550M Slow Excellent
large-v3 1550M Slow Best
large-v3-turbo, turbo 809M Fast High

OpenAI-style aliases (whisper-1, whisper-tiny, whisper-large-v3, etc.) are also accepted and resolve to the corresponding MLX model.

Recommendation:

  • Use large-v3 for best quality
  • Use small or base for speed and lower memory usage
  • Use large-v3-turbo for a good balance of speed and quality

Models are downloaded on first use and cached in CACHE_DIR (default ~/.cache/whisperx-asr).


Configuration

Environment Variables

Edit .env to customize:

# Device for torch-based stages (VAD, alignment, diarization).
# MLX Whisper ASR always runs on the Metal GPU regardless of this setting.
DEVICE=mps              # mps (default, recommended) or cpu (fallback, slower diarization)

# Compute type and batch size (accepted but INERT under MLX — no effect on inference)
# Code defaults: COMPUTE_TYPE=int8, BATCH_SIZE=2 (leftover from CUDA era, unused)
#COMPUTE_TYPE=int8
#BATCH_SIZE=2

# Hugging Face token for diarization (REQUIRED for speaker labels)
HF_TOKEN=hf_xxx...

# Model preloading (optional, reduces first-request latency)
# Also sets the default model for /asr requests when no model= param is given
PRELOAD_MODEL=large-v3   # Leave empty to disable

# Override which model the OpenAI "whisper-1" alias resolves to
# (defaults to the PRELOAD_MODEL value, or large-v3 if unset)
#OPENAI_WHISPER1_MODEL=large-v3

# Service port (default 9001)
PORT=9001

# Model cache directories
CACHE_DIR=~/.cache/whisperx-asr
HF_HOME=~/.cache/whisperx-asr

# Maximum file size in MB (prevents out-of-memory errors)
MAX_FILE_SIZE_MB=1000

# GPU concurrency (Metal GPU semaphore; default 1)
#GPU_CONCURRENCY=1

# Maximum queued requests before rejecting with 503 (default 32)
#MAX_QUEUE_SIZE=32

# Idle model eviction (default disabled). When > 0, Whisper models that have
# not served a request in this many seconds are unloaded from memory.
#MODEL_KEEP_ALIVE_SECONDS=0
#MODEL_EVICTION_INTERVAL_SECONDS=60

# Offline mode (optional): set to 1 to prevent network requests after models are cached
#HF_HUB_OFFLINE=1

Idle Model Eviction

Set MODEL_KEEP_ALIVE_SECONDS to unload Whisper models that have been idle longer than the configured window. The next request that needs the model reloads it transparently:

MODEL_KEEP_ALIVE_SECONDS=3600          # unload models idle for 1 hour
MODEL_EVICTION_INTERVAL_SECONDS=60     # sweep cadence (floor 30 seconds)

Default is 0 (disabled; models stay loaded).


Integration with Speakr

To use this service with Speakr:

Update Speakr's .env file:

USE_ASR_ENDPOINT=true
ASR_BASE_URL=http://localhost:9001

If the service is on a different machine, replace localhost with the IP address and ensure the port is accessible through your firewall.

If Speakr runs in Docker: localhost inside a container refers to the container itself, not the host. Use host.docker.internal instead:

ASR_BASE_URL=http://host.docker.internal:9001

Model compatibility: distil-* models (e.g., distil-large-v2) are no longer available on the MLX backend. If Speakr was configured to use a distil-* model, switch to a standard model name such as large-v3, small, or large-v3-turbo. The hotwords parameter is accepted but silently ignored; use initial_prompt for spelling bias instead.


Running the Service

# Using entrypoint.sh (exports .env first; binds 0.0.0.0:9001)
set -a; source .env; set +a
./entrypoint.sh

# Or directly with uvicorn (localhost only, auto-loads .env)
uv run uvicorn app.main:app --host 127.0.0.1 --port 9001 --env-file .env

# With environment variables inline (no .env needed)
DEVICE=mps PRELOAD_MODEL=base uv run uvicorn app.main:app --host 127.0.0.1 --port 9001

Offline Use

The service can run completely offline after an initial setup with internet access:

  1. Start the service with internet access
  2. Run at least one transcription request with diarization enabled to cache all models:
    curl -X POST http://localhost:9001/asr \
      -F "audio_file=@test.mp3" \
      -F "diarize=true"
    
  3. Set HF_HUB_OFFLINE=1 in your .env file
  4. Restart the service

The service will now operate without any network requests to Hugging Face.


Monitoring and Logs

View Logs

When running in the foreground, logs appear in the terminal. When running in the background:

# Export .env first, then start in the background
set -a; source .env; set +a
./entrypoint.sh &> service.log &
tail -f service.log

Health Check

curl http://localhost:9001/health
# {"status": "healthy", "device": "mps", "loaded_models": ["large-v3"], "serve_mode": "simple"}

Supported Audio Formats

The service supports formats decodable by FFmpeg:

  • Audio: MP3, WAV, M4A, FLAC, AAC, OGG, WMA
  • Video: MP4, AVI, MOV, MKV, WebM (audio track extracted)
  • Other: AMR, 3GP, 3GPP

Troubleshooting

Speaker Diarization Not Working

Symptom: No speaker labels in output

Solutions:

  1. Verify HF_TOKEN is set correctly in .env
  2. Accept the model agreement at pyannote/speaker-diarization-community-1
  3. Check logs for diarization errors
  4. Ensure diarize=true in request (diarization defaults to true when HF_TOKEN is set)

Without HF_TOKEN, diarization is silently skipped and transcription proceeds without speaker labels.

Out of Memory Errors

Solutions:

  1. Reduce MAX_FILE_SIZE_MB in .env
  2. Use a smaller model (small or base instead of large-v3)
  3. Disable diarization for very large files: diarize=false
  4. Split large audio files into smaller chunks before uploading

Slow Processing

Solutions:

  1. Use a smaller model for faster processing
  2. Disable diarization if not needed: diarize=false
  3. Use large-v3-turbo for a good speed/quality balance

API Returns 500 Errors

Check logs for error details. Common causes:

  • Invalid audio format (use FFmpeg to convert)
  • Model download failure (check internet access)
  • Incorrect parameters (check API docs at /docs)

Stress Testing

A stress test script is included to measure throughput and latency under concurrent load:

# Default: 4 concurrent workers, all files in testfiles/
uv run python tests/stress_test.py

# 8 concurrent workers, 3 rounds
uv run python tests/stress_test.py --workers 8 --rounds 3

# Test OpenAI-compat endpoint
uv run python tests/stress_test.py --endpoint openai

# Without diarization
uv run python tests/stress_test.py --no-diarize

Place audio files in the tests/testfiles/ directory (gitignored).


Security Notes

This service has NO built-in authentication or security features.

If exposing to a network:

  • Use firewall rules to restrict access
  • Consider putting behind a reverse proxy
  • Store HF_TOKEN securely in the .env file (never hardcode)

License

This project is MIT licensed. See LICENSE for details.

WhisperX is licensed under BSD-4-Clause. See WhisperX repository for details.

Credits

Changelog

See CHANGELOG.md for version history.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whispermlx_asr_service-0.5.1.tar.gz (195.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whispermlx_asr_service-0.5.1-py3-none-any.whl (29.2 kB view details)

Uploaded Python 3

File details

Details for the file whispermlx_asr_service-0.5.1.tar.gz.

File metadata

  • Download URL: whispermlx_asr_service-0.5.1.tar.gz
  • Upload date:
  • Size: 195.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for whispermlx_asr_service-0.5.1.tar.gz
Algorithm Hash digest
SHA256 1d7450087f38b28ab4c519e819d7b00f4e9d0333dd5df0c69be0b757dcc4c5f6
MD5 38d311ef15aac143f515e67695d61a97
BLAKE2b-256 cf3df686c24b19efb26931016831a32d72411b455a1d3b931adf527b716f5d08

See more details on using hashes here.

Provenance

The following attestation bundles were made for whispermlx_asr_service-0.5.1.tar.gz:

Publisher: publish.yml on KalebJS/whispermlx-asr-service

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file whispermlx_asr_service-0.5.1-py3-none-any.whl.

File metadata

File hashes

Hashes for whispermlx_asr_service-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f0375ff619d0ec090691c88d2e3bffce9957044cca02e48a23b9bc7a9a028930
MD5 f08e8841f595012538c39d2c9976d24b
BLAKE2b-256 6975ebc0e41058e7cb5d3d0bad4229a014faffc469a6c440be8f5c735f2f05a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for whispermlx_asr_service-0.5.1-py3-none-any.whl:

Publisher: publish.yml on KalebJS/whispermlx-asr-service

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page