A Python library to convert PDF documents into podcasts
Project description
pdf2podcast
A Python library to convert PDF documents into podcasts using LLMs and Text-to-Speech.
Installation
pip install pdf2podcast
Requirements
- Python 3.8 or higher
- Google API key for Gemini LLM
- AWS credentials for Polly TTS (optional, can use Google TTS instead)
Dependencies
This library uses several key technologies:
-
Text Processing
PyMuPDF: Advanced PDF processing with metadata and image caption extractionsentence-transformers: Text embeddings for semantic analysisfaiss-cpu: Fast similarity search for text chunks
-
Language Models
langchain-google-genai: Integration with Google's Gemini LLMlangchain-community: Core LangChain functionalityaccelerate: ML model optimization
-
Text-to-Speech
boto3: AWS Polly integration for high-quality TTSffmpeg-python: Audio processing and manipulationgTTS: Google Text-to-Speech alternative
-
Utils
python-dotenv: Environment variable managementpydantic: Data validation and settings managementdatasets: Data handling utilities
Quick Start
from pdf2podcast import PodcastGenerator, SimplePDFProcessor
# Initialize PDF processor with advanced features
pdf_processor = SimplePDFProcessor(
max_chars_per_chunk=8000, # Customize chunk size
extract_images=True, # Include image captions
metadata=True # Include document metadata
)
# Create podcast generator with configuration
generator = PodcastGenerator(
rag_system=pdf_processor,
llm_type="gemini", # Specify LLM provider
tts_type="aws", # Specify TTS provider
llm_config={
"api_key": "your_google_api_key",
"model_name": "gemini-1.5-flash",
"temperature": 0.2
},
tts_config={
"voice_id": "Joanna",
"region_name": "us-west-2"
}
)
# Generate podcast
result = generator.generate(
pdf_path="document.pdf",
output_path="podcast.mp3",
complexity="intermediate", # Options: "simple", "intermediate", "advanced"
voice_id="Joanna" # Optional: override default voice
)
# Access results
print(f"Generated podcast: {result['audio']['path']}")
print(f"Audio size: {result['audio']['size']} bytes")
print(f"Script length: {len(result['script'])} characters")
Available Providers
LLM Providers
"gemini": Google's Gemini LLM- Requires: GENAI_API_KEY
- Configuration options:
- model_name: Model version to use
- temperature: Output randomness (0-1)
- max_output_tokens: Maximum output length
- top_p: Nucleus sampling parameter
- streaming: Enable/disable streaming mode
- prompt_builder: Custom prompt builder instance
TTS Providers
-
"aws": Amazon Polly- Requires: AWS credentials
- Configuration options:
- voice_id: Voice to use (e.g., "Joanna", "Matthew")
- region_name: AWS region
- engine: "standard" or "neural"
-
"google": Google Text-to-Speech- No API key required
- Configuration options:
- language: Language code (e.g., "en", "es")
- tld: Top-level domain for accent (e.g., "com", "co.uk")
- slow: Speech speed
PDF Processing Features
The library offers advanced PDF processing capabilities:
Basic Features
- Metadata extraction (title, author, subject, keywords)
- Image caption extraction from documents
- Efficient processing of large documents
- Support for complex PDF layouts
Text Processing
- Smart text chunking with customizable size
- Paragraph-aware text splitting
- Sentence boundary preservation
Semantic Search & Retrieval
- Vector-based semantic search using FAISS
- Embedding generation with Sentence Transformers
- Retrieval of relevant text chunks based on queries
Configuration
Environment Variables
You can set these environment variables instead of passing them directly:
GENAI_API_KEY=your_google_api_key
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_DEFAULT_REGION=your_aws_region
Advanced Configuration Examples
High-Quality Production Setup
generator = PodcastGenerator(
rag_system=processor,
llm_type="gemini",
tts_type="aws",
llm_config={
"model_name": "gemini-1.5-flash",
"temperature": 0.2,
"top_p": 0.9,
"max_output_tokens": 8192,
"streaming": True
},
tts_config={
"voice_id": "Joanna",
"engine": "neural",
"region_name": "us-west-2"
}
)
Fast Development Setup
generator = PodcastGenerator(
rag_system=processor,
llm_type="gemini",
tts_type="google", # Faster, no API key needed
llm_config={
"temperature": 0.3,
"max_output_tokens": 4096
},
tts_config={
"language": "en",
"tld": "com"
}
)
Custom Prompt Builders
You can customize how content is processed by creating custom prompt builders:
from pdf2podcast.core.base import BasePromptBuilder
class TechnicalPromptBuilder(BasePromptBuilder):
"""Specialized for technical documentation."""
def build_prompt(self, text: str, **kwargs) -> str:
return (
"Create a technical podcast script following these guidelines:\n"
"1. Start with a high-level overview\n"
"2. Break down complex concepts step by step\n"
"3. Include practical examples\n\n"
f"Content: {text}\n"
f"Complexity: {kwargs.get('complexity', 'intermediate')}"
)
# Use custom prompt builder
generator = PodcastGenerator(
rag_system=processor,
llm_type="gemini",
tts_type="aws",
llm_config={
"prompt_builder": TechnicalPromptBuilder(),
"temperature": 0.2
}
)
Common Use Cases
Academic Paper Processing
generator = PodcastGenerator(
rag_system=SimplePDFProcessor(
max_chars_per_chunk=8000,
extract_images=True,
metadata=True
),
llm_type="gemini",
tts_type="aws",
llm_config={
"temperature": 0.2,
"max_output_tokens": 8192
},
tts_config={
"voice_id": "Joanna",
"engine": "neural"
}
)
result = generator.generate(
pdf_path="paper.pdf",
output_path="paper_podcast.mp3",
complexity="advanced",
query="Focus on methodology and key findings"
)
Business Report Summary
generator = PodcastGenerator(
rag_system=SimplePDFProcessor(
max_chars_per_chunk=4000
),
llm_type="gemini",
tts_type="aws",
llm_config={
"temperature": 0.3,
"max_output_tokens": 4096
}
)
result = generator.generate(
pdf_path="report.pdf",
output_path="summary.mp3",
complexity="intermediate",
query="Summarize key business metrics and trends"
)
Educational Content
generator = PodcastGenerator(
rag_system=SimplePDFProcessor(extract_images=True),
llm_type="gemini",
tts_type="google",
llm_config={
"temperature": 0.4,
"max_output_tokens": 6144
},
tts_config={
"language": "en",
"tld": "com",
"slow": True # Better for learning
}
)
result = generator.generate(
pdf_path="lesson.pdf",
output_path="tutorial.mp3",
complexity="simple"
)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2podcast-0.1.1.tar.gz.
File metadata
- Download URL: pdf2podcast-0.1.1.tar.gz
- Upload date:
- Size: 20.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89e54268ca8999b1883fefe2dbf15981f61b71c0d9b1e237da7c5f1e014d2d02
|
|
| MD5 |
699f50434c6158529ca0356a0c4f6b3e
|
|
| BLAKE2b-256 |
8a5dbdc550fd2565a3fe1ea39d102dc59d290c650f43b6c626db504711b7dedc
|
File details
Details for the file pdf2podcast-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pdf2podcast-0.1.1-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b81448125753eb113bb6a001d9c887da86eaf67082f3c506d0cb0579d542e33
|
|
| MD5 |
6e1806ba5d0aec763fa7432c86142f95
|
|
| BLAKE2b-256 |
b6d50f7f211d60ff96fe52b0894a2f8e16fc3e5414693ea46f0db502b37a76f8
|