Python client and CLI for Volcengine/ByteDance Doubao seed-tts-2.0 streaming TTS and bigmodel streaming ASR — full voice (speech-in + speech-out) toolkit.

These details have not been verified by PyPI

Project links

Project description

doubao-speech

English | 中文

A production-minded Python client and CLI for Volcengine Doubao voice APIs — seed-tts-2.0 (text → speech) and bigmodel (speech → text) in a single package. Native-quality Chinese voices with emotion control, and streaming ASR with ITN and punctuation.

Why

doubao-speech is the first PyPI package that covers both directions of Volcengine's modern voice stack:

Other Python TTS wrappers hit the older SAMI HTTP endpoint (no streaming, older voice quality).
No published PyPI package speaks seed-tts-2.0 bidi-stream or the bigmodel ASR endpoint.

This package fills that gap with a clean, unified surface:

synthesize() / transcribe() for text → speech / speech → text.
A CLI that drops straight into agent frameworks (Hermes Agent, Dify, LangChain, n8n, …).
Strict mypy on every public module.
95% unit test coverage, atomic output writes, proper credential redaction.
Same credentials, same config file, same error hierarchy for both directions.

Install

pip install doubao-speech

# or with uv:
uv add doubao-speech

# CLI-only:
uv tool install doubao-speech

Quick start

Text → speech

from doubao_speech import synthesize

synthesize("你好，世界", "hello.mp3")

Speech → text

from doubao_speech import transcribe

text = transcribe("meeting.mp3")
print(text)

Async, both directions

from doubao_speech import synthesize_async, transcribe_async

await synthesize_async(
    "Hello from Doubao seed-tts-2.0!",
    "hello.mp3",
    voice="en-female-assistant",
    speed=1.1,
)

transcript = await transcribe_async("interview.wav", enable_punc=True)

CLI

# TTS
doubao-speech say "你好" --out hello.mp3
doubao-speech say "好激动！" --voice zh-female-warm --speed 1.2 --out excited.mp3

# STT
doubao-speech transcribe meeting.mp3
doubao-speech transcribe voice-note.ogg --out transcript.txt
doubao-speech transcribe recording.wav --no-punctuation --sample-rate 16000

# Voice catalog
doubao-speech list-voices --lang zh

# Inspect effective config (tokens redacted)
doubao-speech config show

Credentials

Resolve order — first match wins:

Keyword arguments to synthesize(...) / transcribe(...)
Environment variables: VOLCENGINE_APP_ID, VOLCENGINE_ACCESS_TOKEN (also accepted as DOUBAO_APP_ID, DOUBAO_ACCESS_TOKEN)
~/.doubao-speech/config.yaml
Built-in defaults

Example ~/.doubao-speech/config.yaml:

app_id: "1234567890"
access_token: "volc_...."
speaker: zh_female_vv_uranus_bigtts
audio_format: mp3
sample_rate: 24000

Credentials come from the Volcengine Speech console. You need seed-tts-2.0 activated for TTS and a bigmodel ASR resource enabled for STT (free tier suffices for testing).

Hermes Agent integration

Hermes Agent's declarative tts.providers.<name> command-type surface makes doubao-speech a one-liner:

# ~/.hermes/config.yaml
tts:
  provider: doubao
  providers:
    doubao:
      type: command
      command: 'doubao-speech say --text-file {input_path} --out {output_path}'
      output_format: mp3
      max_text_length: 1024
      timeout: 30

Any Hermes voice-out path now routes through Doubao seed-tts-2.0.

Hermes does not yet have a command-type STT provider; if you want Doubao ASR in Hermes today, use the bundled voice-volcengine plugin (separate install) for STT while doubao-speech handles TTS.

Audio format support

Direction	Input / Output	Notes
TTS	`mp3` (default), `wav`, `ogg`, `pcm`	24 kHz by default
STT	`wav`, `mp3`, `ogg`, `flac`, `raw` PCM	Auto-detected from extension; requires `ffmpeg` for non-WAV inputs

For STT, non-WAV inputs are transcoded to PCM16 mono via ffmpeg at the target sample rate. Install ffmpeg once (brew install ffmpeg / apt install ffmpeg) and any format works.

Voices (TTS)

The CLI ships with curated aliases for common voices:

Alias	Language	Gender	Style
`zh-female-warm` (default)	zh-CN	female	warm, conversational
`zh-female-reporter`	zh-CN	female	crisp, news-reporter
`zh-male-warm`	zh-CN	male	warm, narrator
`zh-male-energetic`	zh-CN	male	energetic host
`en-female-assistant`	en-US	female	assistant, neutral
`en-male-assistant`	en-US	male	assistant, neutral

Volcengine publishes hundreds more speaker IDs; pass any raw speaker ID to voice= directly.

Emotion control

seed-tts-2.0 supports per-utterance emotion tags:

synthesize(
    "好激动，我终于做到了！",
    "out.mp3",
    emotion="excited",
    emotion_scale=4.0,  # 0-5; higher = more intense
)

Supported emotions vary by speaker — check the Volcengine console.

STT features

ITN (inverse text normalization): "一百二十三" → "123"
Punctuation: Automatic commas, periods, question marks
Disfluency removal (DDC): Strips "嗯", "啊", repeated syllables
Utterance timestamps: Available on the async generator path
Low-latency streaming: 200ms chunks by default; tune via --segment-ms

Disable any of the above with flags:

doubao-speech transcribe lecture.mp3 --no-itn --no-punctuation --no-ddc

Or in Python:

transcribe("lecture.mp3", enable_itn=False, enable_punc=False, enable_ddc=False)

Performance

TTS: One synthesize call opens a fresh WebSocket and tears it down at the end. End-to-end latency for ~2 s of speech is ~750 ms on a healthy connection — network dominates.
STT: Streaming starts returning partial transcripts within ~500 ms. Final transcript for a 10 s clip arrives in ~1.5-2 s total.
import doubao_speech is ~3 ms — websockets and yaml are loaded lazily only when you actually call synthesize() / transcribe().

Error handling

All user-facing errors inherit from DoubaoSpeechError:

from doubao_speech import (
    DoubaoSpeechError, DoubaoConfigError,
    DoubaoAuthError, DoubaoAPIError, DoubaoTimeoutError,
    synthesize, transcribe,
)

try:
    transcribe("audio.mp3")
except DoubaoAuthError:
    ...  # rotate your token
except DoubaoTimeoutError:
    ...  # retry or check network
except DoubaoSpeechError as exc:
    ...  # catch-all

DoubaoTTSError is kept as a back-compat alias for users porting from the earlier doubao-tts package.

Security

Access tokens are redacted in all logs and CLI output — see SECURITY.md for the exact policy.
User text and transcribed content are not logged by default. Opt in with DOUBAO_SPEECH_TRACE_PAYLOADS=1 only for protocol debugging.
~/.doubao-speech/config.yaml is user-scoped; the shipped .gitignore excludes .env files.
Report vulnerabilities: hypnus.yuan@gmail.com or a private GitHub security advisory.

Development

git clone https://github.com/Hypnus-Yuan/doubao-speech.git
cd doubao-speech

uv sync --all-extras --group dev
uv run pre-commit install
uv run pytest

See CONTRIBUTING.md for the full workflow.

Roadmap

v0.2 — connection-reuse daemon (TCP+TLS amortization), streaming callback API for partial transcripts, richer voice metadata sync.
v0.3 — LangChain/LlamaIndex/Dify integration recipes.
v1.0 — API frozen, semver guarantees.

License

MIT — see LICENSE.

Credits

Protocol framing extracted and hardened from Hermes Agent community work. Thanks to the Volcengine Speech team for the seed-tts-2.0 bidirectional streaming API and the bigmodel ASR endpoint.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doubao_speech-0.1.0.tar.gz (39.2 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doubao_speech-0.1.0-py3-none-any.whl (29.8 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file doubao_speech-0.1.0.tar.gz.

File metadata

Download URL: doubao_speech-0.1.0.tar.gz
Upload date: Apr 30, 2026
Size: 39.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for doubao_speech-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b0d298595f9b972f0068a66eca1aca9d7d0538b66f74f7f5da3f2488d7db2dad`
MD5	`7b216475a20824b52c3ec0e38e2b00e0`
BLAKE2b-256	`a2e14ceeb67f945c88edfde57446fcd53cdc524ebd1f2b78f434d89a9819d83b`

See more details on using hashes here.

File details

Details for the file doubao_speech-0.1.0-py3-none-any.whl.

File metadata

Download URL: doubao_speech-0.1.0-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 29.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for doubao_speech-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1928c8261b497286934767be2c00df2b9e2f6bf85da51f101f685e92b713ca3c`
MD5	`9eaef7786474ebd0decb8b5ff0fa601e`
BLAKE2b-256	`ad5942d92c627e956fa822dc89446e5f0002798a4a5b3111f15c6ab2e5173235`

See more details on using hashes here.

doubao-speech 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

doubao-speech

Why

Install

Quick start

Text → speech

Speech → text

Async, both directions

CLI

Credentials

Hermes Agent integration

Audio format support

Voices (TTS)

Emotion control

STT features

Performance

Error handling

Security

Development

Roadmap

License

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes