Real-time speech-to-text transcription optimized for Apple Silicon
Project description
Realtime_mlx_STT
High-performance speech-to-text transcription library optimized exclusively for Apple Silicon. Leverages MLX framework for real-time on-device transcription with low latency.
โ ๏ธ IMPORTANT: This library is designed for LOCAL USE ONLY on macOS with Apple Silicon. The included server is a development tool and should NOT be exposed to the internet or used in production environments without implementing proper security measures.
Features
- Real-time transcription with low latency using MLX Whisper
- Multiple APIs - Python API, REST API, and WebSocket for different use cases
- Apple Silicon optimization using MLX with Neural Engine acceleration
- Voice activity detection with WebRTC and Silero (configurable thresholds)
- Wake word detection using Porcupine ("Jarvis", "Alexa", etc.)
- OpenAI integration for cloud-based transcription alternative
- Interactive CLI for easy exploration of features
- Web UI with modern interface and real-time updates
- Profile system for quick configuration switching
- Event-driven architecture with command pattern
- Thread-safe and production-ready
Language Selection
The Whisper large-v3-turbo model supports 99 languages with intelligent language detection:
- Language-specific mode: When you select a specific language (e.g., Norwegian, French, Spanish), the model uses language-specific tokens that significantly improve transcription accuracy for that language
- Multi-language capability: Even with a language selected, Whisper can still transcribe other languages if spoken - it's not restricted to only the selected language
- Accuracy benefit: Selecting the primary language you'll be speaking provides much more accurate transcription compared to auto-detect mode
- Auto-detect mode: When no language is specified, the model attempts to detect the language automatically, though with potentially lower accuracy
For example, if you select Norwegian (no) as your language:
- Norwegian speech will be transcribed with high accuracy
- English speech will still be transcribed correctly if spoken
- The model uses the Norwegian language token (50288) to optimize for Norwegian
This behavior matches OpenAI's Whisper API - the language parameter guides but doesn't restrict the model.
Requirements
- macOS with Apple Silicon (M1/M2/M3) - Required, not optional
- Python 3.9+ (3.11+ recommended for best performance)
- MLX for Apple Silicon optimization
- PyAudio for audio capture
- WebRTC VAD and Silero VAD for voice activity detection
- Porcupine for wake word detection (optional)
- Torch and NumPy for audio processing
Important Note: This library is specifically optimized for Apple Silicon and will not work on Intel-based Macs or other platforms. It requires the Neural Engine found in Apple Silicon chips to achieve optimal performance.
Installation
Install from PyPI (Recommended)
# Basic installation
pip install realtime-mlx-stt
# With OpenAI support for cloud transcription
pip install "realtime-mlx-stt[openai]"
# With development tools
pip install "realtime-mlx-stt[dev]"
# With server support for REST/WebSocket APIs
pip install "realtime-mlx-stt[server]"
# Install everything
pip install "realtime-mlx-stt[openai,server,dev]"
๐ Documentation
- Usage Guide - Common patterns and troubleshooting
- API Reference - Detailed API documentation
- Examples - Working code examples
Install from Source
# Clone the repository
git clone https://github.com/kristofferv98/Realtime_mlx_STT.git
cd Realtime_mlx_STT
# Set up Python environment (requires Python 3.9+ but 3.11+ recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e .
Quick Start
Interactive CLI (Recommended)
The easiest way to explore all features:
python examples/cli.py
This provides a menu-driven interface for:
- Quick 10-second transcription
- Continuous streaming mode
- OpenAI cloud transcription
- Wake word detection
- Audio device selection
- Language configuration
Python API
from realtime_mlx_stt import STTClient
# Simple transcription
client = STTClient()
for result in client.transcribe(duration=10):
print(result.text)
# With OpenAI
client = STTClient(openai_api_key="sk-...")
for result in client.transcribe(engine="openai"):
print(result.text)
# Wake word mode
client.start_wake_word("jarvis")
Server Mode
Security Note: The server is for local development only and binds to localhost by default. Do NOT expose it to the internet without proper authentication and security measures.
# Start server (localhost only - safe)
cd example_server
python server_example.py
# Opens web UI at http://localhost:8000
Architecture
The library provides two specialized interfaces built on a common Features layer:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ User Interfaces โ
โ โข CLI (examples/cli.py) โ
โ โข Web UI (example_server/) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ API Layers โ
โ โข Python API (realtime_mlx_stt/) โ
โ โข REST/WebSocket (src/Application/Server/) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Features Layer โ
โ โข AudioCapture โ
โ โข VoiceActivityDetection โ
โ โข Transcription (MLX/OpenAI) โ
โ โข WakeWordDetection โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Core & Infrastructure โ
โ โข Command/Event System โ
โ โข Logging & Configuration โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Design Principles
- Vertical Slice Architecture: Each feature is self-contained with Commands, Events, Handlers, and Models
- Dual API Design: Python API optimized for direct use, Server API optimized for multi-client scenarios
- Event-Driven: Features communicate via commands and events, not direct dependencies
- Production Ready: Thread-safe, lazy initialization, comprehensive error handling
API Documentation
Python API (realtime_mlx_stt)
from realtime_mlx_stt import STTClient, TranscriptionSession, create_transcriber
# Method 1: Modern Client API
client = STTClient(
openai_api_key="sk-...", # Optional
default_engine="mlx_whisper", # or "openai"
default_language="en" # or None for auto-detect
)
# Transcribe for fixed duration
for result in client.transcribe(duration=10):
print(f"{result.text} (confidence: {result.confidence})")
# Streaming with stop word
with client.stream() as stream:
for result in stream:
print(result.text)
if "stop" in result.text.lower():
break
# Method 2: Session-based API
from realtime_mlx_stt import TranscriptionSession, ModelConfig, VADConfig
session = TranscriptionSession(
model=ModelConfig(engine="mlx_whisper", language="no"),
vad=VADConfig(sensitivity=0.8),
on_transcription=lambda r: print(r.text)
)
with session:
time.sleep(30) # Listen for 30 seconds
# Method 3: Simple Transcriber
from realtime_mlx_stt import Transcriber
transcriber = Transcriber(language="es")
text = transcriber.transcribe_from_mic(duration=5)
print(f"You said: {text}")
REST API
# Start system with profile
curl -X POST http://localhost:8000/api/v1/system/start \
-H "Content-Type: application/json" \
-d '{
"profile": "vad-triggered",
"custom_config": {
"transcription": {"language": "fr"},
"vad": {"sensitivity": 0.7}
}
}'
# Get system status
curl http://localhost:8000/api/v1/system/status
# Transcribe audio file
curl -X POST http://localhost:8000/api/v1/transcription/audio \
-H "Content-Type: application/json" \
-d '{"audio_data": "base64_encoded_audio_data"}'
WebSocket Events
const ws = new WebSocket('ws://localhost:8000/events');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
switch(data.type) {
case 'transcription':
if (data.is_final) {
console.log(`Final: ${data.text}`);
} else {
console.log(`Transcribing: ${data.text}`);
}
break;
case 'wake_word':
console.log(`Wake word: ${data.wake_word}`);
break;
}
Configuration
Environment Variables
# API Keys
export OPENAI_API_KEY="sk-..." # For OpenAI transcription
export PORCUPINE_ACCESS_KEY="..." # For wake word detection
# Alternative names for Picovoice universal key (same as PORCUPINE_ACCESS_KEY):
# export PICOVOICE_ACCESS_KEY="..."
# export PICOVOICE_API_KEY="..."
# Logging
export LOG_LEVEL="INFO" # DEBUG, INFO, WARNING, ERROR
export LOG_FORMAT="human" # human, json, detailed
Python Configuration
from realtime_mlx_stt import ModelConfig, VADConfig, WakeWordConfig
# Model configuration
model = ModelConfig(
engine="mlx_whisper", # or "openai"
model="whisper-large-v3-turbo",
language="en" # or None for auto-detect
)
# VAD configuration
vad = VADConfig(
enabled=True,
sensitivity=0.6, # 0.0-1.0
min_speech_duration=0.25, # seconds
min_silence_duration=0.1 # seconds
)
# Wake word configuration
# Note: Requires PORCUPINE_ACCESS_KEY environment variable
wake_word = WakeWordConfig(
words=["jarvis", "computer"],
sensitivity=0.7,
timeout=30 # seconds
)
## Testing
The project includes comprehensive tests for each feature and component:
```bash
# Run all tests
python tests/run_tests.py
# Run tests for a specific feature or component
python tests/run_tests.py -f VoiceActivityDetection
python tests/run_tests.py -f Infrastructure
python tests/run_tests.py -f Application # Server/Client tests
# Run a specific test with verbose output
python tests/run_tests.py -t webrtc_vad_test -v
python tests/run_tests.py -t test_server_module -v
# Test with PYTHONPATH (if imports fail)
PYTHONPATH=/path/to/Realtime_mlx_STT python tests/run_tests.py
The Server implementation includes tests for:
- API Controllers (Transcription and System)
- WebSocket connections and event broadcasting
- Configuration and profile management
- Command/Event integration
Performance
On Apple Silicon (M1/M2/M3), the MLX-optimized Whisper-large-v3-turbo model typically achieves:
- Batch mode: ~0.3-0.5x realtime (processes 60 seconds of audio in 20-30 seconds)
- Streaming mode: ~0.5-0.7x realtime (processes audio with ~2-3 second latency)
The MLX implementation takes full advantage of the Neural Engine in Apple Silicon chips, providing significantly better performance than CPU-based implementations.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Recent Updates
- New Python API: Added high-level
realtime_mlx_sttpackage with STTClient, TranscriptionSession, and Transcriber - Interactive CLI: New user-friendly CLI at
examples/cli.pyfor exploring all features - Dual API Architecture: Python API optimized for direct use, Server API for multi-client scenarios
- Improved Examples: Consolidated examples with clear documentation
- Architecture Documentation: Added comprehensive architecture documentation
- OpenAI Integration: Support for OpenAI's transcription API as alternative to local MLX
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- OpenAI Whisper for the base Whisper large-v3-turbo model
- MLX for Apple Silicon optimization
- RealtimeSTT for the original audio processing concepts
- Picovoice Porcupine for wake word detection
- Hugging Face for model distribution infrastructure
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file realtime_mlx_stt-0.1.5.tar.gz.
File metadata
- Download URL: realtime_mlx_stt-0.1.5.tar.gz
- Upload date:
- Size: 202.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7035bd65e5022dcc7798e3c2cc333d2f1fe7e0f9d05c0b2baea20773658980b1
|
|
| MD5 |
835cea83211a382a66bcedd260791079
|
|
| BLAKE2b-256 |
dd0a7697b2bac9d3052c09d2c0a7205281599584ba861c67e87a93b348e71145
|
File details
Details for the file realtime_mlx_stt-0.1.5-py3-none-any.whl.
File metadata
- Download URL: realtime_mlx_stt-0.1.5-py3-none-any.whl
- Upload date:
- Size: 284.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b68b0ef18bdfd182d4845694838c416c712d88b3cef903fce21649b9b649f24f
|
|
| MD5 |
6333dab9138f3274aa527a501670de4b
|
|
| BLAKE2b-256 |
8a61371c1466bc71f41b0fdb1172e456b63aadbfd9245388982d06f19b129c0c
|