On-device Vietnamese speech recognition — Python bindings for phostt
Project description
phostt
On-device Vietnamese speech recognition
Local STT server powered by Zipformer-vi RNN-T — no cloud, no API keys, full privacy
phostt turns any machine into a Vietnamese speech recognition server that runs entirely on-device. Zipformer-vi RNN-T weights ship pre-quantized from sherpa-onnx; the binary is one command, the model is ~75 MB, everything runs locally.
cargo install phostt && phostt serve
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API: http://127.0.0.1:9876/v1/transcribe
Or use Python:
pip install phostt
from phostt import Engine
engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)
Or build from source:
git clone https://github.com/ekhodzitsky/phostt
cd phostt
cargo run --release -- serve
Table of Contents
- Why phostt?
- Features
- Platform Support
- Quick Start
- Benchmarks
- Quality / WER
- Architecture
- Mobile / FFI
- Roadmap
- Known Limitations
- Troubleshooting
- Contributing
- Security
- Acknowledgements
- License
Why phostt?
| phostt | PhoWhisper-large | Cloud APIs | |
|---|---|---|---|
| Architecture | Zipformer + RNN-T | Whisper enc-dec | varies |
| Model size (INT8) | ~75 MB | ~1.5 GB | server-side |
| WER (GigaSpeech2-vi) | ~7.7% | n/a | varies |
| Latency (3.7 s audio) | ~61 ms | ~300 ms | network + queue |
| Throughput | 61× RTF | ~3× RTF | varies |
| Privacy | 100% local | 100% local | data leaves device |
| Cost | free forever | free | $0.006/min+ |
| Setup | cargo install |
Python + deps | API key + billing |
| Streaming | real-time WebSocket | batch only | varies |
The Zipformer-vi-30M weights ship via sherpa-onnx releases (Apache 2.0, trained on ~70,000 hours of Vietnamese speech, ICASSP-published WER on the VLSP and GigaSpeech2 benchmarks).
Features
- Real-time streaming — partial transcription via WebSocket as you speak
- REST API + SSE — file transcription with instant or streaming response
- Hardware acceleration — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
- Pre-quantized INT8 — encoder ships at ~75 MB INT8 from upstream
- Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
- Auto-download — model fetched from sherpa-onnx GitHub releases on first run
- Speaker diarization — optional
diarizationfeature for multi-speaker sessions - Docker ready — CPU and CUDA images with multi-stage builds
- Android FFI — C-ABI + Kotlin bridge for mobile integration
- Hardened — connection limits, frame caps, idle timeouts, sanitized errors, rate limiting
Platform Support
| Platform | Target | Backend | Notes |
|---|---|---|---|
| macOS (Apple Silicon) | aarch64-apple-darwin |
CoreML / CPU | Neural Engine + CPU fallback |
| macOS (Intel) | x86_64-apple-darwin |
CPU | |
| Linux (x86_64) | x86_64-unknown-linux-gnu |
CUDA 12+ / CPU | CUDA via --features cuda |
| Linux (ARM64) | aarch64-unknown-linux-gnu |
CPU | Buildable, not CI-tested yet |
| Android | aarch64-linux-android, armv7-linux-androideabi |
NNAPI / CPU | Via cargo-ndk + ffi feature |
| Windows | x86_64-pc-windows-msvc |
CPU | Community-maintained |
iOS is theoretically supported via CoreML (
--features coreml,ffi), but not yet verified in CI.
Quick Start
Install
cargo install phostt
The first run downloads the ~75 MB Zipformer-vi ONNX bundle automatically into ~/.phostt/models/.
Python
pip install phostt
from phostt import Engine
engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)
Engine is thread-safe — multiple Python threads can call transcribe_file
or transcribe_bytes concurrently (limited by the ONNX session pool size).
Serve
phostt serve
# Listening on ws://127.0.0.1:9876/v1/ws
# REST API at http://127.0.0.1:9876/v1/transcribe
Smoke test
phostt transcribe ~/.phostt/models/test_wavs/0.wav
Expected output (from the bundled Vietnamese test fixture):
RỒI CŨNG HỖ TRỢ CHO LÂU LÂU CŨNG CHO GẠO CHO NÀY KIA
Usage Examples
REST API (single file):
curl -X POST http://localhost:9876/v1/transcribe \
-H "Content-Type: audio/wav" \
--data-binary @sample.wav
REST API (streaming SSE):
curl -X POST http://localhost:9876/v1/stream \
-H "Content-Type: audio/wav" \
--data-binary @sample.wav
WebSocket (real-time):
# Connect and stream PCM16 chunks as you speak
websocat ws://localhost:9876/v1/ws
Python:
from phostt import Engine
engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)
See examples/python_binding.py for a runnable demo.
With hardware acceleration:
# macOS Apple Silicon — CoreML Neural Engine
phostt serve --features coreml
# Linux + NVIDIA — CUDA 12
phostt serve --features cuda
Docker
# CPU (any platform)
docker build -t phostt .
docker run -p 9876:9876 phostt
# CUDA (Linux + NVIDIA Container Toolkit)
docker build -f Dockerfile.cuda -t phostt-cuda .
docker run --gpus all -p 9876:9876 phostt-cuda
Or pull from GitHub Container Registry:
docker pull ghcr.io/ekhodzitsky/phostt:latest
docker run -p 9876:9876 ghcr.io/ekhodzitsky/phostt:latest
Or use Docker Compose:
docker compose up
Benchmarks
Measured on Apple Silicon M2 Pro, release build, 3.74 s Vietnamese test audio:
| Backend | Mean Latency | Median | P95 | RTF | Peak RSS |
|---|---|---|---|---|---|
| CPU | 60 ms | 60 ms | 61 ms | 62× | 1.4 GB |
| CoreML (Neural Engine) | 93 ms | 90 ms | 124 ms | 40× | 1.2 GB |
RTF = real-time factor (audio seconds processed per wall-clock second). For this 30M-param INT8 model, CPU is faster than CoreML on Apple Silicon.
Auto-updated benchmark history: BENCHMARKS.md.
Quality / WER
GigaSpeech2-vi (clean): ~7.7% — published upstream benchmark on clean Vietnamese speech.
For detailed benchmark history (latency, throughput, memory) and regression tracking datasets, see BENCHMARKS.md.
Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐
│ Client │────▶│ axum HTTP │────▶│ SessionPool │
│ (WS/REST) │ │ router │ │ (async-channel) │
└─────────────┘ └─────────────┘ └─────────────────────┘
│
┌───────────────────────────┘
▼
┌────────────────┐
│ SessionTriplet │──▶ Zipformer Encoder (ONNX)
│ (enc/dec/join) │──▶ RNN-T Decoder (greedy)
└────────────────┘──▶ Joiner
│
▼
┌────────────────┐
│ StreamingState │──▶ overlap-buffer / VAD
│ (per-connection)│ → partial + final segments
└────────────────┘
Mobile / FFI
phostt exposes a C-ABI for Android integration:
PhosttEngine* engine = phostt_engine_new("/path/to/models");
PhosttStream* stream = phostt_stream_new(engine);
char* json = phostt_stream_process_chunk(engine, stream, pcm16, len, 16000);
// ... free with phostt_string_free(json) ...
See ANDROID.md for NDK setup, Kotlin bridge (ffi/android/PhosttBridge.kt), and model bundling strategies.
Roadmap
- v0.3.0 — Silero VAD streaming, configurable overlap-buffer, auto-benchmark CI
- v0.4.0 — Polyvoice diarization, security/resource hardening, benchmark RSS
- v0.4.1 — Dependency updates (rubato 2.0, sha2 0.11), docs polish, CI improvements
- iOS build verification (CoreML +
ffifeature) — theoretically supported, not yet CI-tested - Quantized embedding extractor for faster diarization
- Offline batch re-clustering pass for improved speaker accuracy
Known Limitations
- Out-of-domain audio (English loanwords, numbers, proper names) may produce phonetic Vietnamese transcriptions rather than verbatim text. This is expected for a mono-lingual model trained on ~70,000 hours of Vietnamese speech.
- Memory footprint (~1.4 GB peak RSS) may be too heavy for <2 GB devices. Consider the CPU backend and a smaller batch size for embedded use.
- iOS is theoretically supported via CoreML +
ffi, but has not been verified in CI. - Windows builds are community-maintained and not CI-tested.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Model not found on first run |
Auto-download failed or proxy blocks GitHub | Set PHOSTT_MODEL_DIR to a local path with extracted weights |
| High latency (>200 ms) on CPU | Debug build or missing release profile |
Always run cargo run --release or cargo install |
| CoreML slower than CPU | Neural Engine overhead on short audio | CPU is actually faster for this 30M-param INT8 model; CoreML wins on larger models |
SIGKILL during model load |
OOM on low-RAM system | Close other apps, use CPU backend, or run on a machine with ≥4 GB RAM |
| WebSocket closes immediately | Rate limit hit or origin mismatch | Check logs; disable rate limiting with --rate-limit 0 for local testing |
| Diarization missing speakers | diarization feature not enabled |
Rebuild with --features diarization |
See TODO.md for the full tracker.
Contributing
See CONTRIBUTING.md. Quick start for developers:
cargo build --release --features coreml # or cuda
cargo test # 146 fast unit tests, no model needed
cargo clippy --all-targets -- -D warnings
cargo deny check
Security
Please report security vulnerabilities privately — see SECURITY.md for contact details and supported versions.
Acknowledgements
phostt is a Vietnamese fork of gigastt,
which provides the production-grade server scaffolding (HTTP/WS/SSE,
rate-limit, metrics, graceful shutdown). The Vietnamese inference uses the
Zipformer-Transducer weights packaged by the
sherpa-onnx project.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phostt-0.4.3.tar.gz.
File metadata
- Download URL: phostt-0.4.3.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
619ef8d078dbb0a48db0da9c16b07952457ff7b005c5c7424c87745e6dd8d248
|
|
| MD5 |
81367101db9d6601d9ff73144bac835e
|
|
| BLAKE2b-256 |
59c1c1403e62be0bee69f0b9fcff7464f9730e68fd51512e97dd2696bbe901c1
|
File details
Details for the file phostt-0.4.3-cp313-cp313-macosx_11_0_arm64.whl.
File metadata
- Download URL: phostt-0.4.3-cp313-cp313-macosx_11_0_arm64.whl
- Upload date:
- Size: 7.4 MB
- Tags: CPython 3.13, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c222053822f4a191b4b6a3f07fa88d5c3b1c2564d0d841f003cf4ca307c144c
|
|
| MD5 |
aab21795950a5b948a50bcc952619ebb
|
|
| BLAKE2b-256 |
13a49ac5aaf730a28f6b033fbccb6bdc01c6ff167c5ba2aced2b3bdc30d98041
|