Python client and CLI for Volcengine/ByteDance Doubao seed-tts-2.0 streaming TTS and bigmodel streaming ASR — full voice (speech-in + speech-out) toolkit.
Project description
doubao-speech
English | 中文
A production-minded Python client and CLI for Volcengine Doubao voice APIs — seed-tts-2.0 (text → speech) and bigmodel (speech → text) in a single package. Native-quality Chinese voices with emotion control, and streaming ASR with ITN and punctuation.
Why
doubao-speech is the first PyPI package that covers both directions of
Volcengine's modern voice stack:
- Other Python TTS wrappers hit the older SAMI HTTP endpoint (no streaming, older voice quality).
- No published PyPI package speaks seed-tts-2.0 bidi-stream or the bigmodel ASR endpoint.
This package fills that gap with a clean, unified surface:
synthesize()/transcribe()for text → speech / speech → text.- A CLI that drops straight into agent frameworks (Hermes Agent, Dify, LangChain, n8n, …).
- Strict mypy on every public module.
- 95% unit test coverage, atomic output writes, proper credential redaction.
- Same credentials, same config file, same error hierarchy for both directions.
Install
pip install doubao-speech
# or with uv:
uv add doubao-speech
# CLI-only:
uv tool install doubao-speech
Quick start
Text → speech
from doubao_speech import synthesize
synthesize("你好,世界", "hello.mp3")
Speech → text
from doubao_speech import transcribe
text = transcribe("meeting.mp3")
print(text)
Async, both directions
from doubao_speech import synthesize_async, transcribe_async
await synthesize_async(
"Hello from Doubao seed-tts-2.0!",
"hello.mp3",
voice="en-female-assistant",
speed=1.1,
)
transcript = await transcribe_async("interview.wav", enable_punc=True)
CLI
# TTS
doubao-speech say "你好" --out hello.mp3
doubao-speech say "好激动!" --voice zh-female-warm --speed 1.2 --out excited.mp3
# STT
doubao-speech transcribe meeting.mp3
doubao-speech transcribe voice-note.ogg --out transcript.txt
doubao-speech transcribe recording.wav --no-punctuation --sample-rate 16000
# Voice catalog
doubao-speech list-voices --lang zh
# Inspect effective config (tokens redacted)
doubao-speech config show
Credentials
Resolve order — first match wins:
- Keyword arguments to
synthesize(...)/transcribe(...) - Environment variables:
VOLCENGINE_APP_ID,VOLCENGINE_ACCESS_TOKEN(also accepted asDOUBAO_APP_ID,DOUBAO_ACCESS_TOKEN) ~/.doubao-speech/config.yaml- Built-in defaults
Example ~/.doubao-speech/config.yaml:
app_id: "1234567890"
access_token: "volc_...."
speaker: zh_female_vv_uranus_bigtts
audio_format: mp3
sample_rate: 24000
Credentials come from the Volcengine Speech console. You need seed-tts-2.0 activated for TTS and a bigmodel ASR resource enabled for STT (free tier suffices for testing).
Hermes Agent integration
Hermes Agent's declarative
tts.providers.<name> command-type surface makes doubao-speech a one-liner:
# ~/.hermes/config.yaml
tts:
provider: doubao
providers:
doubao:
type: command
command: 'doubao-speech say --text-file {input_path} --out {output_path}'
output_format: mp3
max_text_length: 1024
timeout: 30
Any Hermes voice-out path now routes through Doubao seed-tts-2.0.
Hermes does not yet have a command-type STT provider; if you want Doubao ASR in Hermes today, use the bundled
voice-volcengineplugin (separate install) for STT whiledoubao-speechhandles TTS.
Audio format support
| Direction | Input / Output | Notes |
|---|---|---|
| TTS | mp3 (default), wav, ogg, pcm |
24 kHz by default |
| STT | wav, mp3, ogg, flac, raw PCM |
Auto-detected from extension; requires ffmpeg for non-WAV inputs |
For STT, non-WAV inputs are transcoded to PCM16 mono via ffmpeg at the
target sample rate. Install ffmpeg once (brew install ffmpeg /
apt install ffmpeg) and any format works.
Voices (TTS)
The CLI ships with curated aliases for common voices:
| Alias | Language | Gender | Style |
|---|---|---|---|
zh-female-warm (default) |
zh-CN | female | warm, conversational |
zh-female-reporter |
zh-CN | female | crisp, news-reporter |
zh-male-warm |
zh-CN | male | warm, narrator |
zh-male-energetic |
zh-CN | male | energetic host |
en-female-assistant |
en-US | female | assistant, neutral |
en-male-assistant |
en-US | male | assistant, neutral |
Volcengine publishes hundreds more speaker IDs; pass any raw speaker ID
to voice= directly.
Emotion control
seed-tts-2.0 supports per-utterance emotion tags:
synthesize(
"好激动,我终于做到了!",
"out.mp3",
emotion="excited",
emotion_scale=4.0, # 0-5; higher = more intense
)
Supported emotions vary by speaker — check the Volcengine console.
STT features
- ITN (inverse text normalization): "一百二十三" → "123"
- Punctuation: Automatic commas, periods, question marks
- Disfluency removal (DDC): Strips "嗯", "啊", repeated syllables
- Utterance timestamps: Available on the async generator path
- Low-latency streaming: 200ms chunks by default; tune via
--segment-ms
Disable any of the above with flags:
doubao-speech transcribe lecture.mp3 --no-itn --no-punctuation --no-ddc
Or in Python:
transcribe("lecture.mp3", enable_itn=False, enable_punc=False, enable_ddc=False)
Performance
- TTS: One synthesize call opens a fresh WebSocket and tears it down at the end. End-to-end latency for ~2 s of speech is ~750 ms on a healthy connection — network dominates.
- STT: Streaming starts returning partial transcripts within ~500 ms. Final transcript for a 10 s clip arrives in ~1.5-2 s total.
import doubao_speechis ~3 ms —websocketsandyamlare loaded lazily only when you actually callsynthesize()/transcribe().
Error handling
All user-facing errors inherit from DoubaoSpeechError:
from doubao_speech import (
DoubaoSpeechError, DoubaoConfigError,
DoubaoAuthError, DoubaoAPIError, DoubaoTimeoutError,
synthesize, transcribe,
)
try:
transcribe("audio.mp3")
except DoubaoAuthError:
... # rotate your token
except DoubaoTimeoutError:
... # retry or check network
except DoubaoSpeechError as exc:
... # catch-all
DoubaoTTSError is kept as a back-compat alias for users porting from
the earlier doubao-tts package.
Security
- Access tokens are redacted in all logs and CLI output —
see
SECURITY.mdfor the exact policy. - User text and transcribed content are not logged by default. Opt in
with
DOUBAO_SPEECH_TRACE_PAYLOADS=1only for protocol debugging. ~/.doubao-speech/config.yamlis user-scoped; the shipped.gitignoreexcludes.envfiles.- Report vulnerabilities:
hypnus.yuan@gmail.comor a private GitHub security advisory.
Development
git clone https://github.com/Hypnus-Yuan/doubao-speech.git
cd doubao-speech
uv sync --all-extras --group dev
uv run pre-commit install
uv run pytest
See CONTRIBUTING.md for the full workflow.
Roadmap
- v0.2 — connection-reuse daemon (TCP+TLS amortization), streaming callback API for partial transcripts, richer voice metadata sync.
- v0.3 — LangChain/LlamaIndex/Dify integration recipes.
- v1.0 — API frozen, semver guarantees.
License
MIT — see LICENSE.
Credits
Protocol framing extracted and hardened from Hermes Agent community work. Thanks to the Volcengine Speech team for the seed-tts-2.0 bidirectional streaming API and the bigmodel ASR endpoint.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doubao_speech-0.1.0.tar.gz.
File metadata
- Download URL: doubao_speech-0.1.0.tar.gz
- Upload date:
- Size: 39.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0d298595f9b972f0068a66eca1aca9d7d0538b66f74f7f5da3f2488d7db2dad
|
|
| MD5 |
7b216475a20824b52c3ec0e38e2b00e0
|
|
| BLAKE2b-256 |
a2e14ceeb67f945c88edfde57446fcd53cdc524ebd1f2b78f434d89a9819d83b
|
File details
Details for the file doubao_speech-0.1.0-py3-none-any.whl.
File metadata
- Download URL: doubao_speech-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1928c8261b497286934767be2c00df2b9e2f6bf85da51f101f685e92b713ca3c
|
|
| MD5 |
9eaef7786474ebd0decb8b5ff0fa601e
|
|
| BLAKE2b-256 |
ad5942d92c627e956fa822dc89446e5f0002798a4a5b3111f15c6ab2e5173235
|