Skip to main content

Python client and CLI for Volcengine/ByteDance Doubao seed-tts-2.0 streaming TTS and bigmodel streaming ASR — full voice (speech-in + speech-out) toolkit.

Project description

doubao-speech

English | 中文

PyPI Python CI Coverage License: MIT Ruff uv pre-commit mypy strict Downloads

A production-minded Python client and CLI for Volcengine Doubao voice APIs — seed-tts-2.0 (text → speech) and bigmodel (speech → text) in a single package. Native-quality Chinese voices with emotion control, and streaming ASR with ITN and punctuation.

Why

doubao-speech is the first PyPI package that covers both directions of Volcengine's modern voice stack:

  • Other Python TTS wrappers hit the older SAMI HTTP endpoint (no streaming, older voice quality).
  • No published PyPI package speaks seed-tts-2.0 bidi-stream or the bigmodel ASR endpoint.

This package fills that gap with a clean, unified surface:

  • synthesize() / transcribe() for text → speech / speech → text.
  • A CLI that drops straight into agent frameworks (Hermes Agent, Dify, LangChain, n8n, …).
  • Strict mypy on every public module.
  • 95% unit test coverage, atomic output writes, proper credential redaction.
  • Same credentials, same config file, same error hierarchy for both directions.

Install

pip install doubao-speech

# or with uv:
uv add doubao-speech

# CLI-only:
uv tool install doubao-speech

Quick start

Text → speech

from doubao_speech import synthesize

synthesize("你好,世界", "hello.mp3")

Speech → text

from doubao_speech import transcribe

text = transcribe("meeting.mp3")
print(text)

Async, both directions

from doubao_speech import synthesize_async, transcribe_async

await synthesize_async(
    "Hello from Doubao seed-tts-2.0!",
    "hello.mp3",
    voice="en-female-assistant",
    speed=1.1,
)

transcript = await transcribe_async("interview.wav", enable_punc=True)

CLI

# TTS
doubao-speech say "你好" --out hello.mp3
doubao-speech say "好激动!" --voice zh-female-warm --speed 1.2 --out excited.mp3

# STT
doubao-speech transcribe meeting.mp3
doubao-speech transcribe voice-note.ogg --out transcript.txt
doubao-speech transcribe recording.wav --no-punctuation --sample-rate 16000

# Voice catalog
doubao-speech list-voices --lang zh

# Inspect effective config (tokens redacted)
doubao-speech config show

Credentials

Resolve order — first match wins:

  1. Keyword arguments to synthesize(...) / transcribe(...)
  2. Environment variables: VOLCENGINE_APP_ID, VOLCENGINE_ACCESS_TOKEN (also accepted as DOUBAO_APP_ID, DOUBAO_ACCESS_TOKEN)
  3. ~/.doubao-speech/config.yaml
  4. Built-in defaults

Example ~/.doubao-speech/config.yaml:

app_id: "1234567890"
access_token: "volc_...."
speaker: zh_female_vv_uranus_bigtts
audio_format: mp3
sample_rate: 24000

Credentials come from the Volcengine Speech console. You need seed-tts-2.0 activated for TTS and a bigmodel ASR resource enabled for STT (free tier suffices for testing).

Hermes Agent integration

Hermes Agent's declarative tts.providers.<name> command-type surface makes doubao-speech a one-liner:

# ~/.hermes/config.yaml
tts:
  provider: doubao
  providers:
    doubao:
      type: command
      command: 'doubao-speech say --text-file {input_path} --out {output_path}'
      output_format: mp3
      max_text_length: 1024
      timeout: 30

Any Hermes voice-out path now routes through Doubao seed-tts-2.0.

Hermes does not yet have a command-type STT provider; if you want Doubao ASR in Hermes today, use the bundled voice-volcengine plugin (separate install) for STT while doubao-speech handles TTS.

Audio format support

Direction Input / Output Notes
TTS mp3 (default), wav, ogg, pcm 24 kHz by default
STT wav, mp3, ogg, flac, raw PCM Auto-detected from extension; requires ffmpeg for non-WAV inputs

For STT, non-WAV inputs are transcoded to PCM16 mono via ffmpeg at the target sample rate. Install ffmpeg once (brew install ffmpeg / apt install ffmpeg) and any format works.

Voices (TTS)

The CLI ships with curated aliases for common voices:

Alias Language Gender Style
zh-female-warm (default) zh-CN female warm, conversational
zh-female-reporter zh-CN female crisp, news-reporter
zh-male-warm zh-CN male warm, narrator
zh-male-energetic zh-CN male energetic host
en-female-assistant en-US female assistant, neutral
en-male-assistant en-US male assistant, neutral

Volcengine publishes hundreds more speaker IDs; pass any raw speaker ID to voice= directly.

Emotion control

seed-tts-2.0 supports per-utterance emotion tags:

synthesize(
    "好激动,我终于做到了!",
    "out.mp3",
    emotion="excited",
    emotion_scale=4.0,  # 0-5; higher = more intense
)

Supported emotions vary by speaker — check the Volcengine console.

STT features

  • ITN (inverse text normalization): "一百二十三" → "123"
  • Punctuation: Automatic commas, periods, question marks
  • Disfluency removal (DDC): Strips "嗯", "啊", repeated syllables
  • Utterance timestamps: Available on the async generator path
  • Low-latency streaming: 200ms chunks by default; tune via --segment-ms

Disable any of the above with flags:

doubao-speech transcribe lecture.mp3 --no-itn --no-punctuation --no-ddc

Or in Python:

transcribe("lecture.mp3", enable_itn=False, enable_punc=False, enable_ddc=False)

Performance

  • TTS: One synthesize call opens a fresh WebSocket and tears it down at the end. End-to-end latency for ~2 s of speech is ~750 ms on a healthy connection — network dominates.
  • STT: Streaming starts returning partial transcripts within ~500 ms. Final transcript for a 10 s clip arrives in ~1.5-2 s total.
  • import doubao_speech is ~3 mswebsockets and yaml are loaded lazily only when you actually call synthesize() / transcribe().

Error handling

All user-facing errors inherit from DoubaoSpeechError:

from doubao_speech import (
    DoubaoSpeechError, DoubaoConfigError,
    DoubaoAuthError, DoubaoAPIError, DoubaoTimeoutError,
    synthesize, transcribe,
)

try:
    transcribe("audio.mp3")
except DoubaoAuthError:
    ...  # rotate your token
except DoubaoTimeoutError:
    ...  # retry or check network
except DoubaoSpeechError as exc:
    ...  # catch-all

DoubaoTTSError is kept as a back-compat alias for users porting from the earlier doubao-tts package.

Security

  • Access tokens are redacted in all logs and CLI output — see SECURITY.md for the exact policy.
  • User text and transcribed content are not logged by default. Opt in with DOUBAO_SPEECH_TRACE_PAYLOADS=1 only for protocol debugging.
  • ~/.doubao-speech/config.yaml is user-scoped; the shipped .gitignore excludes .env files.
  • Report vulnerabilities: hypnus.yuan@gmail.com or a private GitHub security advisory.

Development

git clone https://github.com/Hypnus-Yuan/doubao-speech.git
cd doubao-speech

uv sync --all-extras --group dev
uv run pre-commit install
uv run pytest

See CONTRIBUTING.md for the full workflow.

Roadmap

  • v0.2 — connection-reuse daemon (TCP+TLS amortization), streaming callback API for partial transcripts, richer voice metadata sync.
  • v0.3 — LangChain/LlamaIndex/Dify integration recipes.
  • v1.0 — API frozen, semver guarantees.

License

MIT — see LICENSE.

Credits

Protocol framing extracted and hardened from Hermes Agent community work. Thanks to the Volcengine Speech team for the seed-tts-2.0 bidirectional streaming API and the bigmodel ASR endpoint.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doubao_speech-0.1.0.tar.gz (39.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doubao_speech-0.1.0-py3-none-any.whl (29.8 kB view details)

Uploaded Python 3

File details

Details for the file doubao_speech-0.1.0.tar.gz.

File metadata

  • Download URL: doubao_speech-0.1.0.tar.gz
  • Upload date:
  • Size: 39.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for doubao_speech-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b0d298595f9b972f0068a66eca1aca9d7d0538b66f74f7f5da3f2488d7db2dad
MD5 7b216475a20824b52c3ec0e38e2b00e0
BLAKE2b-256 a2e14ceeb67f945c88edfde57446fcd53cdc524ebd1f2b78f434d89a9819d83b

See more details on using hashes here.

File details

Details for the file doubao_speech-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: doubao_speech-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for doubao_speech-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1928c8261b497286934767be2c00df2b9e2f6bf85da51f101f685e92b713ca3c
MD5 9eaef7786474ebd0decb8b5ff0fa601e
BLAKE2b-256 ad5942d92c627e956fa822dc89446e5f0002798a4a5b3111f15c6ab2e5173235

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page