A Python library for converting various document formats to speech using VibeVoice TTS (Text-to-Speech) model.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sunnyliu2

These details have not been verified by PyPI

Project description

Doc-to-Speech

A Python library for converting various document formats to speech using VibeVoice TTS (Text-to-Speech) model.

Features

Multiple Document Format Support: Converts text, PDF, HTML, Markdown, and other document formats to speech
High-Quality TTS: Uses VibeVoice Hindi 7B model for natural-sounding speech synthesis
Multiple Output Formats: Supports both WAV and MP3 audio output
Smart Text Processing: Handles long documents by intelligently chunking text while preserving word boundaries
Configurable Voice: Supports custom voice samples for personalized speech synthesis
GPU Acceleration: Optimized for CUDA, MPS, and CPU execution

Supported Document Formats

Text Files (.txt)
PDF Documents (.pdf) - Using PyMuPDF
HTML Files (.html, .htm) - Extracts clean text content
Markdown Files (.md, .mmd) - Converts to plain text
Office Documents - Word, PowerPoint, Excel and other formats via Docling

Installation

pip install doc-to-speech

Requirements

Python >= 3.12
PyTorch with CUDA support (recommended for best performance)
See pyproject.toml for complete dependency list

Quick Start

Basic Usage

from dts.speaker import DocumentToSpeech

# Initialize the TTS engine
tts = DocumentToSpeech()

# Convert a document to WAV audio
tts.doc_to_wav("document.pdf", "output.wav")

# Convert a document to MP3 audio
tts.doc_to_mp3("document.txt", "output.mp3")

Advanced Configuration

from dts.speaker import DocumentToSpeech
import torch

# Custom configuration
tts = DocumentToSpeech(
    model_id="tarun7r/vibevoice-hindi-7b",  # TTS model
    device="cuda",                          # Device: "cuda", "mps", or "cpu"
    dtype=torch.float16,                    # Data type for optimization
    sample_rate=24000,                      # Audio sample rate
    ddpm_steps=5,                          # Diffusion steps
    cfg_scale=1.3,                         # Guidance scale
    lowercase=False,                       # Text preprocessing
    sample_voice_file="voice_sample.wav"   # Custom voice sample
)

# Process with custom settings
tts.doc_to_wav(
    doc_file="long_document.pdf",
    output_wav_path="speech.wav",
    max_chars=2048,                        # Characters per chunk
    progress=True                          # Show progress
)

Using Custom Voice Samples

# Load a custom voice sample
tts = DocumentToSpeech(sample_voice_file="my_voice.wav")

# Or load programmatically
from dts.vibevoice_tts import TextToSpeech
tts_engine = TextToSpeech()
tts_engine.load_voice_sample(voice_path="custom_voice.wav")

API Reference

DocumentToSpeech Class

The main class for document-to-speech conversion.

Constructor Parameters

model_id (str): HuggingFace model ID (default: "tarun7r/vibevoice-hindi-7b")
device (str, optional): Computing device ("cuda", "mps", "cpu")
dtype (torch.dtype, optional): Data type for model computation
sample_rate (int): Audio sample rate (default: 24000)
ddpm_steps (int): Diffusion model steps (default: 5)
cfg_scale (float): Classifier-free guidance scale (default: 1.3)
lowercase (bool): Convert text to lowercase (default: False)
sample_voice_file (str, optional): Path to voice sample file

Methods

`doc_to_wav(doc_file, output_wav_path=None, max_chars=2048, progress=True)`

Convert document to WAV audio format.

Parameters:

doc_file (str|Path): Path to input document
output_wav_path (str|Path, optional): Output WAV file path
max_chars (int): Maximum characters per processing chunk
progress (bool): Show progress during conversion

Returns: WAV bytes if output_wav_path is None, otherwise None

`doc_to_mp3(doc_file, output_mp3_path=None, max_chars=2048, progress=True)`

Convert document to MP3 audio format.

Parameters:

doc_file (str|Path): Path to input document
output_mp3_path (str|Path, optional): Output MP3 file path
max_chars (int): Maximum characters per processing chunk
progress (bool): Show progress during conversion

Returns: MP3 bytes if output_mp3_path is None, otherwise None

TextToSpeech Class

Low-level TTS engine for direct text synthesis.

Key Methods

`synthesize_text(text, max_chars=350, progress=True)`

Synthesize audio from text string.

`convert_text_to_wav(input_text, output_wav_path, max_chars=350, progress=True)`

Convert text directly to WAV file.

`convert_text_to_mp3(input_text, output_mp3_path, max_chars=350, progress=True)`

Convert text directly to MP3 file.

`load_voice_sample(voice_path=None, voice_array=None, voice_sr=None)`

Load custom voice sample for personalized synthesis.

Document Readers

The library includes specialized readers for different document formats:

Text Reader (`dts.readers.text_reader`)

Handles plain text files with UTF-8 encoding

PDF Reader (`dts.readers.pdf_reader`)

Uses PyMuPDF for robust PDF text extraction
Preserves text structure and formatting

HTML Reader (`dts.readers.html_reader`)

Uses BeautifulSoup for clean text extraction
Removes scripts, styles, and other non-content elements
Normalizes whitespace

Markdown Reader (`dts.readers.markdown_reader`)

Converts Markdown to HTML then to plain text
Preserves document structure while removing formatting

Document Reader (`dts.readers.doc_reader`)

Uses Docling for comprehensive document format support
Handles Word, PowerPoint, Excel, and other office formats

Performance Optimization

GPU Acceleration

The library automatically detects and uses available GPU acceleration:

# CUDA (NVIDIA GPUs)
tts = DocumentToSpeech(device="cuda", dtype=torch.float16)

# MPS (Apple Silicon)
tts = DocumentToSpeech(device="mps")

# CPU (fallback)
tts = DocumentToSpeech(device="cpu", dtype=torch.float32)

Memory Management

For large documents, adjust the max_chars parameter to balance quality and memory usage:

# Conservative setting for limited memory
tts.doc_to_wav("large_doc.pdf", "output.wav", max_chars=350)

# Higher throughput for capable systems
tts.doc_to_wav("large_doc.pdf", "output.wav", max_chars=2048)

Text Processing

The library includes intelligent text processing:

Punctuation Removal: Cleans text for better TTS quality
Whitespace Normalization: Handles inconsistent formatting
Smart Chunking: Splits long text at word boundaries
Speaker Formatting: Formats text for VibeVoice model requirements

Error Handling

The library gracefully handles common issues:

Falls back between different PDF reading methods
Provides informative error messages for unsupported formats
Validates audio processing parameters
Handles memory constraints through chunking

Examples

Convert Multiple Documents

import os
from dts.speaker import DocumentToSpeech

tts = DocumentToSpeech()

# Convert all PDFs in a directory
pdf_dir = "documents/"
for filename in os.listdir(pdf_dir):
    if filename.endswith(".pdf"):
        input_path = os.path.join(pdf_dir, filename)
        output_path = f"audio/{filename[:-4]}.wav"
        tts.doc_to_wav(input_path, output_path)
        print(f"Converted {filename} to audio")

Batch Processing with Progress

documents = ["doc1.pdf", "doc2.txt", "doc3.html"]
tts = DocumentToSpeech()

for i, doc in enumerate(documents, 1):
    print(f"Processing document {i}/{len(documents)}: {doc}")
    output_file = f"audio/output_{i}.mp3"
    tts.doc_to_mp3(doc, output_file, progress=True)

Custom Voice with Text Input

from dts.vibevoice_tts import TextToSpeech

# Direct text-to-speech with custom voice
tts = TextToSpeech(sample_voice_file="my_voice.wav")
text = "This is a test of custom voice synthesis."
tts.convert_text_to_wav(text, "custom_output.wav")

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

License

[Add your license information here]

Dependencies

Key dependencies include:

PyTorch: Deep learning framework
VibeVoice: TTS model for speech synthesis
PyMuPDF: PDF text extraction
BeautifulSoup4: HTML parsing
Docling: Universal document processing
librosa: Audio processing
pydub: Audio format conversion

See pyproject.toml for the complete dependency list.

Troubleshooting

Common Issues

CUDA Out of Memory: Reduce max_chars parameter
Unsupported Document Format: Check if format is in supported list
Audio Quality Issues: Verify voice sample quality and format
Slow Processing: Enable GPU acceleration if available

Getting Help

Check the documentation for parameter details
Verify input document format compatibility
Ensure proper PyTorch installation for your system
Monitor memory usage during processing

This library leverages the VibeVoice model for high-quality text-to-speech synthesis. For more information about the underlying TTS technology, visit the VibeVoice project.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sunnyliu2

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Nov 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_to_speech-0.1.0.tar.gz (692.8 kB view details)

Uploaded Nov 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doc_to_speech-0.1.0-py3-none-any.whl (692.4 kB view details)

Uploaded Nov 1, 2025 Python 3

File details

Details for the file doc_to_speech-0.1.0.tar.gz.

File metadata

Download URL: doc_to_speech-0.1.0.tar.gz
Upload date: Nov 1, 2025
Size: 692.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for doc_to_speech-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`fd9f6f80438074fc92e08718f3e62d3cd89808ed5996ad51c81f16ae5a920d01`
MD5	`f2617b976bfbbc3654d35a3d1093c443`
BLAKE2b-256	`827c627d8fe7f8d3d89654eb0facb2225cac81a8da5a6cc1cb0ce5c691ae2ac4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for doc_to_speech-0.1.0.tar.gz:

Publisher: python-package.yml on devsunny/doc-to-speech

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: doc_to_speech-0.1.0.tar.gz
- Subject digest: fd9f6f80438074fc92e08718f3e62d3cd89808ed5996ad51c81f16ae5a920d01
- Sigstore transparency entry: 660746272
- Sigstore integration time: Nov 1, 2025
Source repository:
- Permalink: devsunny/doc-to-speech@1c5da6a4f22dfe2f227f0c7b3d89338a850517b7
- Branch / Tag: refs/heads/main
- Owner: https://github.com/devsunny
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-package.yml@1c5da6a4f22dfe2f227f0c7b3d89338a850517b7
- Trigger Event: workflow_dispatch

File details

Details for the file doc_to_speech-0.1.0-py3-none-any.whl.

File metadata

Download URL: doc_to_speech-0.1.0-py3-none-any.whl
Upload date: Nov 1, 2025
Size: 692.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for doc_to_speech-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`004c1b9bb01b0358158ef3c7d39fc74e5e3136dd2b21529dede82a8d5a6c2903`
MD5	`a9b81ed240969d445d87e70fa4a64248`
BLAKE2b-256	`e6c8c95546b76cea6b6aab3c2813b35d88ba7303f41e4d8ac255f15f32d9f8de`

See more details on using hashes here.

Provenance

The following attestation bundles were made for doc_to_speech-0.1.0-py3-none-any.whl:

Publisher: python-package.yml on devsunny/doc-to-speech

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: doc_to_speech-0.1.0-py3-none-any.whl
- Subject digest: 004c1b9bb01b0358158ef3c7d39fc74e5e3136dd2b21529dede82a8d5a6c2903
- Sigstore transparency entry: 660746275
- Sigstore integration time: Nov 1, 2025
Source repository:
- Permalink: devsunny/doc-to-speech@1c5da6a4f22dfe2f227f0c7b3d89338a850517b7
- Branch / Tag: refs/heads/main
- Owner: https://github.com/devsunny
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-package.yml@1c5da6a4f22dfe2f227f0c7b3d89338a850517b7
- Trigger Event: workflow_dispatch

doc-to-speech 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Doc-to-Speech

Features

Supported Document Formats

Installation

Requirements

Quick Start

Basic Usage

Advanced Configuration

Using Custom Voice Samples

API Reference

DocumentToSpeech Class

Constructor Parameters

Methods

doc_to_wav(doc_file, output_wav_path=None, max_chars=2048, progress=True)

doc_to_mp3(doc_file, output_mp3_path=None, max_chars=2048, progress=True)

TextToSpeech Class

Key Methods

synthesize_text(text, max_chars=350, progress=True)

convert_text_to_wav(input_text, output_wav_path, max_chars=350, progress=True)

convert_text_to_mp3(input_text, output_mp3_path, max_chars=350, progress=True)

load_voice_sample(voice_path=None, voice_array=None, voice_sr=None)

Document Readers

Text Reader (dts.readers.text_reader)

PDF Reader (dts.readers.pdf_reader)

HTML Reader (dts.readers.html_reader)

Markdown Reader (dts.readers.markdown_reader)

Document Reader (dts.readers.doc_reader)

Performance Optimization

GPU Acceleration

Memory Management

Text Processing

Error Handling

Examples

Convert Multiple Documents

Batch Processing with Progress

Custom Voice with Text Input

Contributing

License

Dependencies

Troubleshooting

Common Issues

Getting Help

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`doc_to_wav(doc_file, output_wav_path=None, max_chars=2048, progress=True)`

`doc_to_mp3(doc_file, output_mp3_path=None, max_chars=2048, progress=True)`

`synthesize_text(text, max_chars=350, progress=True)`

`convert_text_to_wav(input_text, output_wav_path, max_chars=350, progress=True)`

`convert_text_to_mp3(input_text, output_mp3_path, max_chars=350, progress=True)`

`load_voice_sample(voice_path=None, voice_array=None, voice_sr=None)`

Text Reader (`dts.readers.text_reader`)

PDF Reader (`dts.readers.pdf_reader`)

HTML Reader (`dts.readers.html_reader`)

Markdown Reader (`dts.readers.markdown_reader`)

Document Reader (`dts.readers.doc_reader`)