Skip to main content

On-device Vietnamese speech recognition — Python bindings for phostt

Project description

phostt

On-device Vietnamese speech recognition

Local STT server powered by Zipformer-vi RNN-T — no cloud, no API keys, full privacy

Crates.io Downloads Docs.rs Release CI MIT License Changelog


phostt turns any machine into a Vietnamese speech recognition server that runs entirely on-device. Zipformer-vi RNN-T weights ship pre-quantized from sherpa-onnx; the binary is one command, the model is ~75 MB, everything runs locally.

cargo install phostt && phostt serve
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API:  http://127.0.0.1:9876/v1/transcribe

Or use Python:

pip install phostt
from phostt import Engine

engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)

Or build from source:

git clone https://github.com/ekhodzitsky/phostt
cd phostt
cargo run --release -- serve

Table of Contents

Why phostt?

phostt PhoWhisper-large Cloud APIs
Architecture Zipformer + RNN-T Whisper enc-dec varies
Model size (INT8) ~75 MB ~1.5 GB server-side
WER (GigaSpeech2-vi) ~7.7% n/a varies
Latency (3.7 s audio) ~61 ms ~300 ms network + queue
Throughput 61× RTF ~3× RTF varies
Privacy 100% local 100% local data leaves device
Cost free forever free $0.006/min+
Setup cargo install Python + deps API key + billing
Streaming real-time WebSocket batch only varies

The Zipformer-vi-30M weights ship via sherpa-onnx releases (Apache 2.0, trained on ~70,000 hours of Vietnamese speech, ICASSP-published WER on the VLSP and GigaSpeech2 benchmarks).

Features

  • Real-time streaming — partial transcription via WebSocket as you speak
  • REST API + SSE — file transcription with instant or streaming response
  • Hardware acceleration — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
  • Pre-quantized INT8 — encoder ships at ~75 MB INT8 from upstream
  • Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
  • Auto-download — model fetched from sherpa-onnx GitHub releases on first run
  • Speaker diarization — optional diarization feature for multi-speaker sessions
  • Docker ready — CPU and CUDA images with multi-stage builds
  • Android FFI — C-ABI + Kotlin bridge for mobile integration
  • Hardened — connection limits, frame caps, idle timeouts, sanitized errors, rate limiting

Platform Support

Platform Target Backend Notes
macOS (Apple Silicon) aarch64-apple-darwin CoreML / CPU Neural Engine + CPU fallback
macOS (Intel) x86_64-apple-darwin CPU
Linux (x86_64) x86_64-unknown-linux-gnu CUDA 12+ / CPU CUDA via --features cuda
Linux (ARM64) aarch64-unknown-linux-gnu CPU Buildable, not CI-tested yet
Android aarch64-linux-android, armv7-linux-androideabi NNAPI / CPU Via cargo-ndk + ffi feature
Windows x86_64-pc-windows-msvc CPU Community-maintained

iOS is theoretically supported via CoreML (--features coreml,ffi), but not yet verified in CI.

Quick Start

Install

cargo install phostt

The first run downloads the ~75 MB Zipformer-vi ONNX bundle automatically into ~/.phostt/models/.

Python

pip install phostt
from phostt import Engine

engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)

Engine is thread-safe — multiple Python threads can call transcribe_file or transcribe_bytes concurrently (limited by the ONNX session pool size).

Serve

phostt serve
# Listening on ws://127.0.0.1:9876/v1/ws
# REST API at http://127.0.0.1:9876/v1/transcribe

Smoke test

phostt transcribe ~/.phostt/models/test_wavs/0.wav

Expected output (from the bundled Vietnamese test fixture):

RỒI CŨNG HỖ TRỢ CHO LÂU LÂU CŨNG CHO GẠO CHO NÀY KIA

Usage Examples

REST API (single file):

curl -X POST http://localhost:9876/v1/transcribe \
  -H "Content-Type: audio/wav" \
  --data-binary @sample.wav

REST API (streaming SSE):

curl -X POST http://localhost:9876/v1/stream \
  -H "Content-Type: audio/wav" \
  --data-binary @sample.wav

WebSocket (real-time):

# Connect and stream PCM16 chunks as you speak
websocat ws://localhost:9876/v1/ws

With hardware acceleration:

# macOS Apple Silicon — CoreML Neural Engine
phostt serve --features coreml

# Linux + NVIDIA — CUDA 12
phostt serve --features cuda

Docker

# CPU (any platform)
docker build -t phostt .
docker run -p 9876:9876 phostt

# CUDA (Linux + NVIDIA Container Toolkit)
docker build -f Dockerfile.cuda -t phostt-cuda .
docker run --gpus all -p 9876:9876 phostt-cuda

Or use Docker Compose:

docker compose up

Benchmarks

Measured on Apple Silicon M2 Pro, release build, 3.74 s Vietnamese test audio:

Backend Mean Latency Median P95 RTF Peak RSS
CPU 60 ms 60 ms 61 ms 62× 1.4 GB
CoreML (Neural Engine) 93 ms 90 ms 124 ms 40× 1.2 GB

RTF = real-time factor (audio seconds processed per wall-clock second). For this 30M-param INT8 model, CPU is faster than CoreML on Apple Silicon.

Auto-updated benchmark history: BENCHMARKS.md.

Quality / WER

GigaSpeech2-vi (clean): ~7.7% — published upstream benchmark on clean Vietnamese speech.

For detailed benchmark history (latency, throughput, memory) and regression tracking datasets, see BENCHMARKS.md.

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────────────┐
│   Client    │────▶│  axum HTTP  │────▶│   SessionPool       │
│  (WS/REST)  │     │   router    │     │  (async-channel)    │
└─────────────┘     └─────────────┘     └─────────────────────┘
                                                │
                    ┌───────────────────────────┘
                    ▼
           ┌────────────────┐
           │ SessionTriplet │──▶ Zipformer Encoder (ONNX)
           │ (enc/dec/join) │──▶ RNN-T Decoder (greedy)
           └────────────────┘──▶ Joiner
                    │
                    ▼
           ┌────────────────┐
           │ StreamingState │──▶ overlap-buffer / VAD
           │ (per-connection)│    → partial + final segments
           └────────────────┘

Mobile / FFI

phostt exposes a C-ABI for Android integration:

PhosttEngine* engine = phostt_engine_new("/path/to/models");
PhosttStream* stream = phostt_stream_new(engine);
char* json = phostt_stream_process_chunk(engine, stream, pcm16, len, 16000);
// ... free with phostt_string_free(json) ...

See ANDROID.md for NDK setup, Kotlin bridge (ffi/android/PhosttBridge.kt), and model bundling strategies.

Roadmap

  • v0.3.0 — Silero VAD streaming, configurable overlap-buffer, auto-benchmark CI
  • v0.4.0 — Polyvoice diarization, security/resource hardening, benchmark RSS
  • v0.4.1 — Dependency updates (rubato 2.0, sha2 0.11), docs polish, CI improvements
  • iOS build verification (CoreML + ffi feature) — theoretically supported, not yet CI-tested
  • Quantized embedding extractor for faster diarization
  • Offline batch re-clustering pass for improved speaker accuracy

Known Limitations

  • Out-of-domain audio (English loanwords, numbers, proper names) may produce phonetic Vietnamese transcriptions rather than verbatim text. This is expected for a mono-lingual model trained on ~70,000 hours of Vietnamese speech.
  • Memory footprint (~1.4 GB peak RSS) may be too heavy for <2 GB devices. Consider the CPU backend and a smaller batch size for embedded use.
  • iOS is theoretically supported via CoreML + ffi, but has not been verified in CI.
  • Windows builds are community-maintained and not CI-tested.

Troubleshooting

Symptom Cause Fix
Model not found on first run Auto-download failed or proxy blocks GitHub Set PHOSTT_MODEL_DIR to a local path with extracted weights
High latency (>200 ms) on CPU Debug build or missing release profile Always run cargo run --release or cargo install
CoreML slower than CPU Neural Engine overhead on short audio CPU is actually faster for this 30M-param INT8 model; CoreML wins on larger models
SIGKILL during model load OOM on low-RAM system Close other apps, use CPU backend, or run on a machine with ≥4 GB RAM
WebSocket closes immediately Rate limit hit or origin mismatch Check logs; disable rate limiting with --rate-limit 0 for local testing
Diarization missing speakers diarization feature not enabled Rebuild with --features diarization

See TODO.md for the full tracker.

Contributing

See CONTRIBUTING.md. Quick start for developers:

cargo build --release --features coreml   # or cuda
cargo test                                # 146 fast unit tests, no model needed
cargo clippy --all-targets -- -D warnings
cargo deny check

Security

Please report security vulnerabilities privately — see SECURITY.md for contact details and supported versions.

Acknowledgements

phostt is a Vietnamese fork of gigastt, which provides the production-grade server scaffolding (HTTP/WS/SSE, rate-limit, metrics, graceful shutdown). The Vietnamese inference uses the Zipformer-Transducer weights packaged by the sherpa-onnx project.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phostt-0.4.2.tar.gz (1.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

phostt-0.4.2-cp314-cp314-macosx_11_0_arm64.whl (7.4 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

phostt-0.4.2-cp313-cp313-macosx_11_0_arm64.whl (7.4 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

File details

Details for the file phostt-0.4.2.tar.gz.

File metadata

  • Download URL: phostt-0.4.2.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for phostt-0.4.2.tar.gz
Algorithm Hash digest
SHA256 c34e526d65a212411c2859390f2a9c6d511e1fd5147984a6461ffecf70285429
MD5 9606afb00240a2298115deb50c90e3a1
BLAKE2b-256 0ce80a01b72f3c39d1c1a7822a8cc2f13d15117903c05ae047d26ce9188688b2

See more details on using hashes here.

File details

Details for the file phostt-0.4.2-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for phostt-0.4.2-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8f10c4e0135699f81d6db0cbbf77906e50c9f28d580c551ac22f533c7bf00264
MD5 4f5868dfe8f010b2f9f223dd69b174e8
BLAKE2b-256 cacc8d8c978a4fe062f05cd0cbd9ec9884d38474ecc73efcd54bf06a9c008623

See more details on using hashes here.

File details

Details for the file phostt-0.4.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for phostt-0.4.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 71cbcfd698b6043ef8b0d37365a4864a8525b05d085eb812728668810e7d04a8
MD5 b61c6b82d46eaf77358099653da6ebb2
BLAKE2b-256 a6c50afd2358026b736e608be04f9f964895e04fdb92be60113d344fb9745e1a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page