Skip to main content

On-device Vietnamese speech recognition — Python bindings for phostt

Project description

phostt

On-device Vietnamese speech recognition

Local STT server powered by Zipformer-vi RNN-T — no cloud, no API keys, full privacy

Crates.io Downloads Docs.rs Release PyPI CI MIT License Changelog


phostt turns any machine into a Vietnamese speech recognition server that runs entirely on-device. Zipformer-vi RNN-T weights ship pre-quantized from sherpa-onnx; the binary is one command, the model is ~75 MB, everything runs locally.

cargo install phostt && phostt serve
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API:  http://127.0.0.1:9876/v1/transcribe

Or use Python:

pip install phostt
from phostt import Engine

engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)

Or build from source:

git clone https://github.com/ekhodzitsky/phostt
cd phostt
cargo run --release -- serve

Table of Contents

Why phostt?

phostt PhoWhisper-large Cloud APIs
Architecture Zipformer + RNN-T Whisper enc-dec varies
Model size (INT8) ~75 MB ~1.5 GB server-side
WER (GigaSpeech2-vi) ~7.7% n/a varies
Latency (3.7 s audio) ~61 ms ~300 ms network + queue
Throughput 61× RTF ~3× RTF varies
Privacy 100% local 100% local data leaves device
Cost free forever free $0.006/min+
Setup cargo install Python + deps API key + billing
Streaming real-time WebSocket batch only varies

The Zipformer-vi-30M weights ship via sherpa-onnx releases (Apache 2.0, trained on ~70,000 hours of Vietnamese speech, ICASSP-published WER on the VLSP and GigaSpeech2 benchmarks).

Features

  • Real-time streaming — partial transcription via WebSocket as you speak
  • REST API + SSE — file transcription with instant or streaming response
  • Hardware acceleration — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
  • Pre-quantized INT8 — encoder ships at ~75 MB INT8 from upstream
  • Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
  • Auto-download — model fetched from sherpa-onnx GitHub releases on first run
  • Speaker diarization — optional diarization feature for multi-speaker sessions
  • Docker ready — CPU and CUDA images with multi-stage builds
  • Android FFI — C-ABI + Kotlin bridge for mobile integration
  • Hardened — connection limits, frame caps, idle timeouts, sanitized errors, rate limiting

Platform Support

Platform Target Backend Notes
macOS (Apple Silicon) aarch64-apple-darwin CoreML / CPU Neural Engine + CPU fallback
macOS (Intel) x86_64-apple-darwin CPU
Linux (x86_64) x86_64-unknown-linux-gnu CUDA 12+ / CPU CUDA via --features cuda
Linux (ARM64) aarch64-unknown-linux-gnu CPU Buildable, not CI-tested yet
Android aarch64-linux-android, armv7-linux-androideabi NNAPI / CPU Via cargo-ndk + ffi feature
Windows x86_64-pc-windows-msvc CPU Community-maintained

iOS is theoretically supported via CoreML (--features coreml,ffi), but not yet verified in CI.

Quick Start

Install

cargo install phostt

The first run downloads the ~75 MB Zipformer-vi ONNX bundle automatically into ~/.phostt/models/.

Python

pip install phostt
from phostt import Engine

engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)

Engine is thread-safe — multiple Python threads can call transcribe_file or transcribe_bytes concurrently (limited by the ONNX session pool size).

Serve

phostt serve
# Listening on ws://127.0.0.1:9876/v1/ws
# REST API at http://127.0.0.1:9876/v1/transcribe

Smoke test

phostt transcribe ~/.phostt/models/test_wavs/0.wav

Expected output (from the bundled Vietnamese test fixture):

RỒI CŨNG HỖ TRỢ CHO LÂU LÂU CŨNG CHO GẠO CHO NÀY KIA

Usage Examples

REST API (single file):

curl -X POST http://localhost:9876/v1/transcribe \
  -H "Content-Type: audio/wav" \
  --data-binary @sample.wav

REST API (streaming SSE):

curl -X POST http://localhost:9876/v1/stream \
  -H "Content-Type: audio/wav" \
  --data-binary @sample.wav

WebSocket (real-time):

# Connect and stream PCM16 chunks as you speak
websocat ws://localhost:9876/v1/ws

Python:

from phostt import Engine

engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)

See examples/python_binding.py for a runnable demo.

With hardware acceleration:

# macOS Apple Silicon — CoreML Neural Engine
phostt serve --features coreml

# Linux + NVIDIA — CUDA 12
phostt serve --features cuda

Docker

# CPU (any platform)
docker build -t phostt .
docker run -p 9876:9876 phostt

# CUDA (Linux + NVIDIA Container Toolkit)
docker build -f Dockerfile.cuda -t phostt-cuda .
docker run --gpus all -p 9876:9876 phostt-cuda

Or pull from GitHub Container Registry:

docker pull ghcr.io/ekhodzitsky/phostt:latest
docker run -p 9876:9876 ghcr.io/ekhodzitsky/phostt:latest

Or use Docker Compose:

docker compose up

Benchmarks

Measured on Apple Silicon M2 Pro, release build, 3.74 s Vietnamese test audio:

Backend Mean Latency Median P95 RTF Peak RSS
CPU 60 ms 60 ms 61 ms 62× 1.4 GB
CoreML (Neural Engine) 93 ms 90 ms 124 ms 40× 1.2 GB

RTF = real-time factor (audio seconds processed per wall-clock second). For this 30M-param INT8 model, CPU is faster than CoreML on Apple Silicon.

Auto-updated benchmark history: BENCHMARKS.md.

Quality / WER

GigaSpeech2-vi (clean): ~7.7% — published upstream benchmark on clean Vietnamese speech.

For detailed benchmark history (latency, throughput, memory) and regression tracking datasets, see BENCHMARKS.md.

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────────────┐
│   Client    │────▶│  axum HTTP  │────▶│   SessionPool       │
│  (WS/REST)  │     │   router    │     │  (async-channel)    │
└─────────────┘     └─────────────┘     └─────────────────────┘
                                                │
                    ┌───────────────────────────┘
                    ▼
           ┌────────────────┐
           │ SessionTriplet │──▶ Zipformer Encoder (ONNX)
           │ (enc/dec/join) │──▶ RNN-T Decoder (greedy)
           └────────────────┘──▶ Joiner
                    │
                    ▼
           ┌────────────────┐
           │ StreamingState │──▶ overlap-buffer / VAD
           │ (per-connection)│    → partial + final segments
           └────────────────┘

Mobile / FFI

phostt exposes a C-ABI for Android integration:

PhosttEngine* engine = phostt_engine_new("/path/to/models");
PhosttStream* stream = phostt_stream_new(engine);
char* json = phostt_stream_process_chunk(engine, stream, pcm16, len, 16000);
// ... free with phostt_string_free(json) ...

See ANDROID.md for NDK setup, Kotlin bridge (ffi/android/PhosttBridge.kt), and model bundling strategies.

Roadmap

  • v0.3.0 — Silero VAD streaming, configurable overlap-buffer, auto-benchmark CI
  • v0.4.0 — Polyvoice diarization, security/resource hardening, benchmark RSS
  • v0.4.1 — Dependency updates (rubato 2.0, sha2 0.11), docs polish, CI improvements
  • iOS build verification (CoreML + ffi feature) — theoretically supported, not yet CI-tested
  • Quantized embedding extractor for faster diarization
  • Offline batch re-clustering pass for improved speaker accuracy

Known Limitations

  • Out-of-domain audio (English loanwords, numbers, proper names) may produce phonetic Vietnamese transcriptions rather than verbatim text. This is expected for a mono-lingual model trained on ~70,000 hours of Vietnamese speech.
  • Memory footprint (~1.4 GB peak RSS) may be too heavy for <2 GB devices. Consider the CPU backend and a smaller batch size for embedded use.
  • iOS is theoretically supported via CoreML + ffi, but has not been verified in CI.
  • Windows builds are community-maintained and not CI-tested.

Troubleshooting

Symptom Cause Fix
Model not found on first run Auto-download failed or proxy blocks GitHub Set PHOSTT_MODEL_DIR to a local path with extracted weights
High latency (>200 ms) on CPU Debug build or missing release profile Always run cargo run --release or cargo install
CoreML slower than CPU Neural Engine overhead on short audio CPU is actually faster for this 30M-param INT8 model; CoreML wins on larger models
SIGKILL during model load OOM on low-RAM system Close other apps, use CPU backend, or run on a machine with ≥4 GB RAM
WebSocket closes immediately Rate limit hit or origin mismatch Check logs; disable rate limiting with --rate-limit 0 for local testing
Diarization missing speakers diarization feature not enabled Rebuild with --features diarization

See TODO.md for the full tracker.

Contributing

See CONTRIBUTING.md. Quick start for developers:

cargo build --release --features coreml   # or cuda
cargo test                                # 146 fast unit tests, no model needed
cargo clippy --all-targets -- -D warnings
cargo deny check

Security

Please report security vulnerabilities privately — see SECURITY.md for contact details and supported versions.

Acknowledgements

phostt is a Vietnamese fork of gigastt, which provides the production-grade server scaffolding (HTTP/WS/SSE, rate-limit, metrics, graceful shutdown). The Vietnamese inference uses the Zipformer-Transducer weights packaged by the sherpa-onnx project.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phostt-0.4.3.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phostt-0.4.3-cp313-cp313-macosx_11_0_arm64.whl (7.4 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

File details

Details for the file phostt-0.4.3.tar.gz.

File metadata

  • Download URL: phostt-0.4.3.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for phostt-0.4.3.tar.gz
Algorithm Hash digest
SHA256 619ef8d078dbb0a48db0da9c16b07952457ff7b005c5c7424c87745e6dd8d248
MD5 81367101db9d6601d9ff73144bac835e
BLAKE2b-256 59c1c1403e62be0bee69f0b9fcff7464f9730e68fd51512e97dd2696bbe901c1

See more details on using hashes here.

File details

Details for the file phostt-0.4.3-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for phostt-0.4.3-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5c222053822f4a191b4b6a3f07fa88d5c3b1c2564d0d841f003cf4ca307c144c
MD5 aab21795950a5b948a50bcc952619ebb
BLAKE2b-256 13a49ac5aaf730a28f6b033fbccb6bdc01c6ff167c5ba2aced2b3bdc30d98041

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page