Skip to main content

Real-time speech-to-text transcription optimized for Apple Silicon

Project description

Realtime_mlx_STT

PyPI version Python 3.9+ License: MIT Platform

High-performance speech-to-text transcription library optimized exclusively for Apple Silicon. Leverages MLX framework for real-time on-device transcription with low latency.

โš ๏ธ IMPORTANT: This library is designed for LOCAL USE ONLY on macOS with Apple Silicon. The included server is a development tool and should NOT be exposed to the internet or used in production environments without implementing proper security measures.

Features

  • Real-time transcription with low latency using MLX Whisper
  • Multiple APIs - Python API, REST API, and WebSocket for different use cases
  • Apple Silicon optimization using MLX with Neural Engine acceleration
  • Voice activity detection with WebRTC and Silero (configurable thresholds)
  • Wake word detection using Porcupine ("Jarvis", "Alexa", etc.)
  • OpenAI integration for cloud-based transcription alternative
  • Interactive CLI for easy exploration of features
  • Web UI with modern interface and real-time updates
  • Profile system for quick configuration switching
  • Event-driven architecture with command pattern
  • Thread-safe and production-ready

Language Selection

The Whisper large-v3-turbo model supports 99 languages with intelligent language detection:

  • Language-specific mode: When you select a specific language (e.g., Norwegian, French, Spanish), the model uses language-specific tokens that significantly improve transcription accuracy for that language
  • Multi-language capability: Even with a language selected, Whisper can still transcribe other languages if spoken - it's not restricted to only the selected language
  • Accuracy benefit: Selecting the primary language you'll be speaking provides much more accurate transcription compared to auto-detect mode
  • Auto-detect mode: When no language is specified, the model attempts to detect the language automatically, though with potentially lower accuracy

For example, if you select Norwegian (no) as your language:

  • Norwegian speech will be transcribed with high accuracy
  • English speech will still be transcribed correctly if spoken
  • The model uses the Norwegian language token (50288) to optimize for Norwegian

This behavior matches OpenAI's Whisper API - the language parameter guides but doesn't restrict the model.

Requirements

  • macOS with Apple Silicon (M1/M2/M3) - Required, not optional
  • Python 3.9+ (3.11+ recommended for best performance)
  • MLX for Apple Silicon optimization
  • PyAudio for audio capture
  • WebRTC VAD and Silero VAD for voice activity detection
  • Porcupine for wake word detection (optional)
  • Torch and NumPy for audio processing

Important Note: This library is specifically optimized for Apple Silicon and will not work on Intel-based Macs or other platforms. It requires the Neural Engine found in Apple Silicon chips to achieve optimal performance.

Installation

Install from PyPI (Recommended)

# Basic installation
pip install realtime-mlx-stt

# With OpenAI support for cloud transcription
pip install "realtime-mlx-stt[openai]"

# With development tools
pip install "realtime-mlx-stt[dev]"

# With server support for REST/WebSocket APIs
pip install "realtime-mlx-stt[server]"

# Install everything
pip install "realtime-mlx-stt[openai,server,dev]"

๐Ÿ“š Documentation

Install from Source

# Clone the repository
git clone https://github.com/kristofferv98/Realtime_mlx_STT.git
cd Realtime_mlx_STT

# Set up Python environment (requires Python 3.9+ but 3.11+ recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e .

Quick Start

Interactive CLI (Recommended)

The easiest way to explore all features:

python examples/cli.py

This provides a menu-driven interface for:

  • Quick 10-second transcription
  • Continuous streaming mode
  • OpenAI cloud transcription
  • Wake word detection
  • Audio device selection
  • Language configuration

Python API

from realtime_mlx_stt import STTClient

# Simple transcription
client = STTClient()
for result in client.transcribe(duration=10):
    print(result.text)

# With OpenAI
client = STTClient(openai_api_key="sk-...")
for result in client.transcribe(engine="openai"):
    print(result.text)

# Wake word mode
client.start_wake_word("jarvis")

Server Mode

Security Note: The server is for local development only and binds to localhost by default. Do NOT expose it to the internet without proper authentication and security measures.

# Start server (localhost only - safe)
cd example_server
python server_example.py

# Opens web UI at http://localhost:8000

Architecture

The library provides two specialized interfaces built on a common Features layer:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          User Interfaces                         โ”‚
โ”‚  โ€ข CLI (examples/cli.py)                        โ”‚
โ”‚  โ€ข Web UI (example_server/)                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚          API Layers                             โ”‚
โ”‚  โ€ข Python API (realtime_mlx_stt/)              โ”‚
โ”‚  โ€ข REST/WebSocket (src/Application/Server/)    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚          Features Layer                         โ”‚
โ”‚  โ€ข AudioCapture                                โ”‚
โ”‚  โ€ข VoiceActivityDetection                      โ”‚
โ”‚  โ€ข Transcription (MLX/OpenAI)                  โ”‚
โ”‚  โ€ข WakeWordDetection                           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚          Core & Infrastructure                  โ”‚
โ”‚  โ€ข Command/Event System                         โ”‚
โ”‚  โ€ข Logging & Configuration                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Design Principles

  • Vertical Slice Architecture: Each feature is self-contained with Commands, Events, Handlers, and Models
  • Dual API Design: Python API optimized for direct use, Server API optimized for multi-client scenarios
  • Event-Driven: Features communicate via commands and events, not direct dependencies
  • Production Ready: Thread-safe, lazy initialization, comprehensive error handling

API Documentation

Python API (realtime_mlx_stt)

from realtime_mlx_stt import STTClient, TranscriptionSession, create_transcriber

# Method 1: Modern Client API
client = STTClient(
    openai_api_key="sk-...",     # Optional
    default_engine="mlx_whisper", # or "openai"
    default_language="en"         # or None for auto-detect
)

# Transcribe for fixed duration
for result in client.transcribe(duration=10):
    print(f"{result.text} (confidence: {result.confidence})")

# Streaming with stop word
with client.stream() as stream:
    for result in stream:
        print(result.text)
        if "stop" in result.text.lower():
            break

# Method 2: Session-based API
from realtime_mlx_stt import TranscriptionSession, ModelConfig, VADConfig

session = TranscriptionSession(
    model=ModelConfig(engine="mlx_whisper", language="no"),
    vad=VADConfig(sensitivity=0.8),
    on_transcription=lambda r: print(r.text)
)

with session:
    time.sleep(30)  # Listen for 30 seconds

# Method 3: Simple Transcriber
from realtime_mlx_stt import Transcriber
transcriber = Transcriber(language="es")
text = transcriber.transcribe_from_mic(duration=5)
print(f"You said: {text}")

REST API

# Start system with profile
curl -X POST http://localhost:8000/api/v1/system/start \
  -H "Content-Type: application/json" \
  -d '{
    "profile": "vad-triggered",
    "custom_config": {
      "transcription": {"language": "fr"},
      "vad": {"sensitivity": 0.7}
    }
  }'

# Get system status
curl http://localhost:8000/api/v1/system/status

# Transcribe audio file
curl -X POST http://localhost:8000/api/v1/transcription/audio \
  -H "Content-Type: application/json" \
  -d '{"audio_data": "base64_encoded_audio_data"}'

WebSocket Events

const ws = new WebSocket('ws://localhost:8000/events');

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    
    switch(data.type) {
        case 'transcription':
            if (data.is_final) {
                console.log(`Final: ${data.text}`);
            } else {
                console.log(`Transcribing: ${data.text}`);
            }
            break;
        case 'wake_word':
            console.log(`Wake word: ${data.wake_word}`);
            break;
    }

Configuration

Environment Variables

# API Keys
export OPENAI_API_KEY="sk-..."        # For OpenAI transcription
export PORCUPINE_ACCESS_KEY="..."     # For wake word detection
# Alternative names for Picovoice universal key (same as PORCUPINE_ACCESS_KEY):
# export PICOVOICE_ACCESS_KEY="..."
# export PICOVOICE_API_KEY="..."

# Logging
export LOG_LEVEL="INFO"               # DEBUG, INFO, WARNING, ERROR
export LOG_FORMAT="human"             # human, json, detailed

Python Configuration

from realtime_mlx_stt import ModelConfig, VADConfig, WakeWordConfig

# Model configuration
model = ModelConfig(
    engine="mlx_whisper",        # or "openai"
    model="whisper-large-v3-turbo",
    language="en"                # or None for auto-detect
)

# VAD configuration
vad = VADConfig(
    enabled=True,
    sensitivity=0.6,             # 0.0-1.0
    min_speech_duration=0.25,    # seconds
    min_silence_duration=0.1     # seconds
)

# Wake word configuration
# Note: Requires PORCUPINE_ACCESS_KEY environment variable
wake_word = WakeWordConfig(
    words=["jarvis", "computer"],
    sensitivity=0.7,
    timeout=30                   # seconds
)

## Testing

The project includes comprehensive tests for each feature and component:

```bash
# Run all tests
python tests/run_tests.py

# Run tests for a specific feature or component
python tests/run_tests.py -f VoiceActivityDetection
python tests/run_tests.py -f Infrastructure
python tests/run_tests.py -f Application  # Server/Client tests

# Run a specific test with verbose output
python tests/run_tests.py -t webrtc_vad_test -v
python tests/run_tests.py -t test_server_module -v

# Test with PYTHONPATH (if imports fail)
PYTHONPATH=/path/to/Realtime_mlx_STT python tests/run_tests.py

The Server implementation includes tests for:

  • API Controllers (Transcription and System)
  • WebSocket connections and event broadcasting
  • Configuration and profile management
  • Command/Event integration

Performance

On Apple Silicon (M1/M2/M3), the MLX-optimized Whisper-large-v3-turbo model typically achieves:

  • Batch mode: ~0.3-0.5x realtime (processes 60 seconds of audio in 20-30 seconds)
  • Streaming mode: ~0.5-0.7x realtime (processes audio with ~2-3 second latency)

The MLX implementation takes full advantage of the Neural Engine in Apple Silicon chips, providing significantly better performance than CPU-based implementations.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Recent Updates

  • New Python API: Added high-level realtime_mlx_stt package with STTClient, TranscriptionSession, and Transcriber
  • Interactive CLI: New user-friendly CLI at examples/cli.py for exploring all features
  • Dual API Architecture: Python API optimized for direct use, Server API for multi-client scenarios
  • Improved Examples: Consolidated examples with clear documentation
  • Architecture Documentation: Added comprehensive architecture documentation
  • OpenAI Integration: Support for OpenAI's transcription API as alternative to local MLX

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

realtime_mlx_stt-0.1.5.tar.gz (202.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

realtime_mlx_stt-0.1.5-py3-none-any.whl (284.3 kB view details)

Uploaded Python 3

File details

Details for the file realtime_mlx_stt-0.1.5.tar.gz.

File metadata

  • Download URL: realtime_mlx_stt-0.1.5.tar.gz
  • Upload date:
  • Size: 202.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for realtime_mlx_stt-0.1.5.tar.gz
Algorithm Hash digest
SHA256 7035bd65e5022dcc7798e3c2cc333d2f1fe7e0f9d05c0b2baea20773658980b1
MD5 835cea83211a382a66bcedd260791079
BLAKE2b-256 dd0a7697b2bac9d3052c09d2c0a7205281599584ba861c67e87a93b348e71145

See more details on using hashes here.

File details

Details for the file realtime_mlx_stt-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for realtime_mlx_stt-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b68b0ef18bdfd182d4845694838c416c712d88b3cef903fce21649b9b649f24f
MD5 6333dab9138f3274aa527a501670de4b
BLAKE2b-256 8a61371c1466bc71f41b0fdb1172e456b63aadbfd9245388982d06f19b129c0c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page