Live speech-to-text streaming on Apple Silicon
Project description
TextStream
Local real-time speech-to-text for Apple Silicon. One pip install. No API keys. No cloud. No cost.
pip install textstream-asr then textstream
TextStream turns your Mac's microphone into a live transcription server. It runs Qwen3-ASR (~2% word error rate) on-device through MLX, filters noise with Silero VAD, and streams text over SSE at localhost:7890/stream. Any app, script, or frontend can subscribe and get words as they're spoken.
Build voice-controlled tools. Add live captions to your app. Record meeting notes that write themselves. Pipe speech into your IDE. Whatever needs ears — point it at the stream.
Why this exists
Cloud speech APIs charge per minute and add latency. Whisper runs offline but isn't real-time. TextStream gives you a live, local transcription endpoint that any process on your machine can read from — for free, with 2% WER accuracy.
Benchmarks
Numbers from published evaluations. Your actual RTF will depend on model size and what else is running.
Accuracy (Word Error Rate)
| Model | LibriSpeech clean | LibriSpeech other | Params |
|---|---|---|---|
| Qwen3-ASR 0.6B (default) | 2.11% | 4.55% | 600M |
| Qwen3-ASR 1.7B | 1.63% | 3.38% | 1.7B |
| Whisper-large-v3 | 1.51% | 3.97% | 1.5B |
| GPT-4o-Transcribe | 1.39% | 3.75% | — |
Source: Qwen3-ASR Technical Report
Speed (Apple Silicon via MLX)
| Metric | Value |
|---|---|
| Real-time factor (RTF) | ~0.06 (16x faster than real-time) |
| MLX vs PyTorch | ~4x faster on Apple Silicon |
| VAD latency | <1ms per 32ms audio chunk |
| Time to first token | ~92ms |
Source: mlx-qwen3-asr benchmarks, Silero VAD performance metrics
Resource usage
- RAM: ~1.2GB for 0.6B model, ~3GB for 1.7B
- CPU/GPU: Runs on Neural Engine + GPU via MLX Metal backend. Minimal CPU overhead — the transcription loop sleeps between intervals
- Disk: Models are cached by HuggingFace Hub (~1.2GB / 3.4GB first download)
- Battery: Comparable to background music playback. MLX is designed for Apple Silicon power efficiency
Requirements
| Supported | |
|---|---|
| macOS on Apple Silicon (M1/M2/M3/M4) | Yes |
| macOS on Intel | No — MLX requires Apple Silicon |
| Linux / Windows | Not yet — MLX is macOS-only. PyTorch backend planned |
| Python | 3.10+ |
Install
pip install textstream-asr
Quick start
textstream # start transcribing, opens browser UI
textstream --no-browser # headless — just the SSE server
textstream --engine qwen-1.7b # larger model, lower word error rate
textstream --vad-threshold 0.5 # stricter voice detection (default 0.4)
Connect from your app
import json, urllib.request
# Subscribe to the live transcript stream
req = urllib.request.Request("http://localhost:7890/stream")
with urllib.request.urlopen(req) as resp:
for line in resp:
line = line.decode().strip()
if line.startswith("data: "):
event = json.loads(line[6:])
if event["type"] == "stream":
print(event["finalized"], event["draft"])
// Browser / Node SSE
const src = new EventSource("http://localhost:7890/stream");
src.onmessage = (e) => {
const { finalized, draft } = JSON.parse(e.data);
console.log(finalized, draft);
};
How it works
Every --interval seconds (default 2.5), TextStream drains the mic buffer and runs Silero VAD on the chunk. If speech is detected, the chunk is fed to Qwen3-ASR's streaming decoder. The model returns stable (finalized) text and speculative (draft) text. Stable text gets persisted to disk and broadcast to all SSE subscribers.
If the model hallucinates on noise that slips past VAD, a pattern filter catches it and resets the stream. Safety net — with VAD active, it almost never fires.
API
GET /stream → SSE stream: {"type":"stream","finalized":"...","draft":"..."}
GET /engine → {"engine":"qwen"}
GET /switch?engine=qwen-1.7b → hot-swap model without restart
GET /pause → pause mic capture
GET /resume → resume
GET /stop → shutdown
GET / → built-in browser UI
Configuration
| Flag | Default | Description |
|---|---|---|
--port |
7890 | HTTP server port |
--engine |
qwen | qwen (0.6B) or qwen-1.7b |
--interval |
2.5 | Seconds between transcription updates |
--vad-threshold |
0.4 | Silero VAD speech probability threshold |
--no-browser |
— | Don't open browser on start |
Transcripts are saved to ~/Documents/textstream/transcripts/YYYY-MM-DD/.
Dependencies
- MLX — Apple Silicon ML framework
- mlx-qwen3-asr — Qwen3-ASR for MLX
- silero-vad-lite — Voice activity detection (~2MB, bundles ONNX runtime)
- sounddevice — PortAudio bindings
- NumPy
Author
Boris Djordjevic — 199 Biotechnologies
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file textstream_asr-0.2.0.tar.gz.
File metadata
- Download URL: textstream_asr-0.2.0.tar.gz
- Upload date:
- Size: 129.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23f46d3a72882b8c6fff43f6d506a217166fd066aa960a928687252d2d8952a5
|
|
| MD5 |
e98d077da202667ca511f97006158f61
|
|
| BLAKE2b-256 |
fa8bfa425aa0a55d1abe703baf8e989d87a52c16c074d4dc2aa51bfe1faf4df7
|
File details
Details for the file textstream_asr-0.2.0-py3-none-any.whl.
File metadata
- Download URL: textstream_asr-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b22565a93c66f47654cb4a3fcadc8567c85a4e9c0893f6ff8eb1206b36b44637
|
|
| MD5 |
0140718105bf60fc81c5e3c4c0ad400f
|
|
| BLAKE2b-256 |
3da933d0f3b2f9becd6c1634fb6580ba53fe408a01d4418d4391327a3a2573a3
|