On-device Vietnamese speech recognition — Python bindings for phostt

These details have not been verified by PyPI

Project links

Project description

phostt

On-device Vietnamese speech recognition

Local STT server powered by Zipformer-vi RNN-T — no cloud, no API keys, full privacy

phostt turns any machine into a Vietnamese speech recognition server that runs entirely on-device. Zipformer-vi RNN-T weights ship pre-quantized from sherpa-onnx; the binary is one command, the model is ~75 MB, everything runs locally.

cargo install phostt && phostt serve
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API:  http://127.0.0.1:9876/v1/transcribe

Or use Python:

pip install phostt

from phostt import Engine

engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)

Or build from source:

git clone https://github.com/ekhodzitsky/phostt
cd phostt
cargo run --release -- serve

Why phostt?
Features
Platform Support
Quick Start
- Install
- Python
- Serve
- Smoke test
- Usage Examples
- Docker
Benchmarks
Quality / WER
Architecture
Mobile / FFI
Roadmap
Known Limitations
Troubleshooting
Contributing
Security
Acknowledgements
License

Why phostt?

	phostt	PhoWhisper-large	Cloud APIs
Architecture	Zipformer + RNN-T	Whisper enc-dec	varies
Model size (INT8)	~75 MB	~1.5 GB	server-side
WER (GigaSpeech2-vi)	~7.7%	n/a	varies
Latency (3.7 s audio)	~61 ms	~300 ms	network + queue
Throughput	61× RTF	~3× RTF	varies
Privacy	100% local	100% local	data leaves device
Cost	free forever	free	$0.006/min+
Setup	`cargo install`	Python + deps	API key + billing
Streaming	real-time WebSocket	batch only	varies

The Zipformer-vi-30M weights ship via sherpa-onnx releases (Apache 2.0, trained on ~70,000 hours of Vietnamese speech, ICASSP-published WER on the VLSP and GigaSpeech2 benchmarks).

Features

Real-time streaming — partial transcription via WebSocket as you speak
REST API + SSE — file transcription with instant or streaming response
Hardware acceleration — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
Pre-quantized INT8 — encoder ships at ~75 MB INT8 from upstream
Multi-format audio — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
Auto-download — model fetched from sherpa-onnx GitHub releases on first run
Speaker diarization — optional diarization feature for multi-speaker sessions
Docker ready — CPU and CUDA images with multi-stage builds
Android FFI — C-ABI + Kotlin bridge for mobile integration
Hardened — connection limits, frame caps, idle timeouts, sanitized errors, rate limiting

Platform Support

Platform	Target	Backend	Notes
macOS (Apple Silicon)	`aarch64-apple-darwin`	CoreML / CPU	Neural Engine + CPU fallback
macOS (Intel)	`x86_64-apple-darwin`	CPU
Linux (x86_64)	`x86_64-unknown-linux-gnu`	CUDA 12+ / CPU	CUDA via `--features cuda`
Linux (ARM64)	`aarch64-unknown-linux-gnu`	CPU	Buildable, not CI-tested yet
Android	`aarch64-linux-android`, `armv7-linux-androideabi`	NNAPI / CPU	Via `cargo-ndk` + `ffi` feature
Windows	`x86_64-pc-windows-msvc`	CPU	Community-maintained

iOS is theoretically supported via CoreML (--features coreml,ffi), but not yet verified in CI.

Quick Start

Install

cargo install phostt

The first run downloads the ~75 MB Zipformer-vi ONNX bundle automatically into ~/.phostt/models/.

Python

pip install phostt

from phostt import Engine

engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)

Engine is thread-safe — multiple Python threads can call transcribe_file or transcribe_bytes concurrently (limited by the ONNX session pool size).

Serve

phostt serve
# Listening on ws://127.0.0.1:9876/v1/ws
# REST API at http://127.0.0.1:9876/v1/transcribe

Smoke test

phostt transcribe ~/.phostt/models/test_wavs/0.wav

Expected output (from the bundled Vietnamese test fixture):

RỒI CŨNG HỖ TRỢ CHO LÂU LÂU CŨNG CHO GẠO CHO NÀY KIA

Usage Examples

REST API (single file):

curl -X POST http://localhost:9876/v1/transcribe \
  -H "Content-Type: audio/wav" \
  --data-binary @sample.wav

REST API (streaming SSE):

curl -X POST http://localhost:9876/v1/stream \
  -H "Content-Type: audio/wav" \
  --data-binary @sample.wav

WebSocket (real-time):

# Connect and stream PCM16 chunks as you speak
websocat ws://localhost:9876/v1/ws

Python:

from phostt import Engine

engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)

See examples/python_binding.py for a runnable demo.

With hardware acceleration:

# macOS Apple Silicon — CoreML Neural Engine
phostt serve --features coreml

# Linux + NVIDIA — CUDA 12
phostt serve --features cuda

Docker

# CPU (any platform)
docker build -t phostt .
docker run -p 9876:9876 phostt

# CUDA (Linux + NVIDIA Container Toolkit)
docker build -f Dockerfile.cuda -t phostt-cuda .
docker run --gpus all -p 9876:9876 phostt-cuda

Or pull from GitHub Container Registry:

docker pull ghcr.io/ekhodzitsky/phostt:latest
docker run -p 9876:9876 ghcr.io/ekhodzitsky/phostt:latest

Or use Docker Compose:

docker compose up

Benchmarks

Measured on Apple Silicon M2 Pro, release build, 3.74 s Vietnamese test audio:

Backend	Mean Latency	Median	P95	RTF	Peak RSS
CPU	60 ms	60 ms	61 ms	62×	1.4 GB
CoreML (Neural Engine)	93 ms	90 ms	124 ms	40×	1.2 GB

RTF = real-time factor (audio seconds processed per wall-clock second). For this 30M-param INT8 model, CPU is faster than CoreML on Apple Silicon.

Auto-updated benchmark history: BENCHMARKS.md.

Quality / WER

GigaSpeech2-vi (clean): ~7.7% — published upstream benchmark on clean Vietnamese speech.

For detailed benchmark history (latency, throughput, memory) and regression tracking datasets, see BENCHMARKS.md.

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────────────┐
│   Client    │────▶│  axum HTTP  │────▶│   SessionPool       │
│  (WS/REST)  │     │   router    │     │  (async-channel)    │
└─────────────┘     └─────────────┘     └─────────────────────┘
                                                │
                    ┌───────────────────────────┘
                    ▼
           ┌────────────────┐
           │ SessionTriplet │──▶ Zipformer Encoder (ONNX)
           │ (enc/dec/join) │──▶ RNN-T Decoder (greedy)
           └────────────────┘──▶ Joiner
                    │
                    ▼
           ┌────────────────┐
           │ StreamingState │──▶ overlap-buffer / VAD
           │ (per-connection)│    → partial + final segments
           └────────────────┘

Mobile / FFI

phostt exposes a C-ABI for Android integration:

PhosttEngine* engine = phostt_engine_new("/path/to/models");
PhosttStream* stream = phostt_stream_new(engine);
char* json = phostt_stream_process_chunk(engine, stream, pcm16, len, 16000);
// ... free with phostt_string_free(json) ...

See ANDROID.md for NDK setup, Kotlin bridge (ffi/android/PhosttBridge.kt), and model bundling strategies.

Roadmap

v0.3.0 — Silero VAD streaming, configurable overlap-buffer, auto-benchmark CI
v0.4.0 — Polyvoice diarization, security/resource hardening, benchmark RSS
v0.4.1 — Dependency updates (rubato 2.0, sha2 0.11), docs polish, CI improvements
iOS build verification (CoreML + ffi feature) — theoretically supported, not yet CI-tested
Quantized embedding extractor for faster diarization
Offline batch re-clustering pass for improved speaker accuracy

Known Limitations

Out-of-domain audio (English loanwords, numbers, proper names) may produce phonetic Vietnamese transcriptions rather than verbatim text. This is expected for a mono-lingual model trained on ~70,000 hours of Vietnamese speech.
Memory footprint (~1.4 GB peak RSS) may be too heavy for <2 GB devices. Consider the CPU backend and a smaller batch size for embedded use.
iOS is theoretically supported via CoreML + ffi, but has not been verified in CI.
Windows builds are community-maintained and not CI-tested.

Troubleshooting

Symptom	Cause	Fix
`Model not found` on first run	Auto-download failed or proxy blocks GitHub	Set `PHOSTT_MODEL_DIR` to a local path with extracted weights
High latency (>200 ms) on CPU	Debug build or missing `release` profile	Always run `cargo run --release` or `cargo install`
CoreML slower than CPU	Neural Engine overhead on short audio	CPU is actually faster for this 30M-param INT8 model; CoreML wins on larger models
`SIGKILL` during model load	OOM on low-RAM system	Close other apps, use CPU backend, or run on a machine with ≥4 GB RAM
WebSocket closes immediately	Rate limit hit or origin mismatch	Check logs; disable rate limiting with `--rate-limit 0` for local testing
Diarization missing speakers	`diarization` feature not enabled	Rebuild with `--features diarization`

See TODO.md for the full tracker.

Contributing

See CONTRIBUTING.md. Quick start for developers:

cargo build --release --features coreml   # or cuda
cargo test                                # 146 fast unit tests, no model needed
cargo clippy --all-targets -- -D warnings
cargo deny check

Security

Please report security vulnerabilities privately — see SECURITY.md for contact details and supported versions.

Acknowledgements

phostt is a Vietnamese fork of gigastt, which provides the production-grade server scaffolding (HTTP/WS/SSE, rate-limit, metrics, graceful shutdown). The Vietnamese inference uses the Zipformer-Transducer weights packaged by the sherpa-onnx project.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.3

May 6, 2026

0.4.2

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phostt-0.4.3.tar.gz (1.6 MB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

phostt-0.4.3-cp313-cp313-macosx_11_0_arm64.whl (7.4 MB view details)

Uploaded May 6, 2026 CPython 3.13macOS 11.0+ ARM64

File details

Details for the file phostt-0.4.3.tar.gz.

File metadata

Download URL: phostt-0.4.3.tar.gz
Upload date: May 6, 2026
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.13.1

File hashes

Hashes for phostt-0.4.3.tar.gz
Algorithm	Hash digest
SHA256	`619ef8d078dbb0a48db0da9c16b07952457ff7b005c5c7424c87745e6dd8d248`
MD5	`81367101db9d6601d9ff73144bac835e`
BLAKE2b-256	`59c1c1403e62be0bee69f0b9fcff7464f9730e68fd51512e97dd2696bbe901c1`

See more details on using hashes here.

File details

Details for the file phostt-0.4.3-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

Download URL: phostt-0.4.3-cp313-cp313-macosx_11_0_arm64.whl
Upload date: May 6, 2026
Size: 7.4 MB
Tags: CPython 3.13, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.13.1

File hashes

Hashes for phostt-0.4.3-cp313-cp313-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`5c222053822f4a191b4b6a3f07fa88d5c3b1c2564d0d841f003cf4ca307c144c`
MD5	`aab21795950a5b948a50bcc952619ebb`
BLAKE2b-256	`13a49ac5aaf730a28f6b033fbccb6bdc01c6ff167c5ba2aced2b3bdc30d98041`

See more details on using hashes here.

phostt 0.4.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

phostt

Table of Contents

Why phostt?

Features

Platform Support

Quick Start

Install

Python

Serve

Smoke test

Usage Examples

Docker

Benchmarks

Quality / WER

Architecture

Mobile / FFI

Roadmap

Known Limitations

Troubleshooting

Contributing

Security

Acknowledgements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes