Professional EPUB to Text/Markdown/JSON converter with batch processing
Project description
EPUB to Text Converter
A professional, high-performance EPUB conversion library for extracting and converting EPUB files to multiple formats (Text, Markdown, JSON). Supports both single-file and batch processing with parallel execution.
Features:
- ๐ Extract chapters, images, and metadata from EPUB files
- ๐ Export to multiple formats: Text, Markdown, JSON, HTML
- ๐ Batch process multiple EPUB files in parallel
- ๐ผ๏ธ Extract and link images with proper paths
- ๐ Get detailed book information and statistics
- ๐ฏ Simple CLI and comprehensive Python API
Installation
pip install epub-to-text
Or from source:
pip install https://github.com/thinh-vu/epub_to_text.git
Quick Start
Command Line
# Convert to markdown chapters with images
epub-to-text your_book.epub --chapters-markdown --extract-images
# Convert to all formats
epub-to-text book.epub --all -o output/
# Show book information
epub-to-text your_book.epub --info
# Batch process EPUBs with parallel execution
epub-to-text /your_epub_folder_path --batch --all --parallel
Python API
from epub_to_text import EpubProcessor
# Basic usage
processor = EpubProcessor('book.epub', 'output/')
summary = processor.get_summary()
processor.export_chapters_markdown()
processor.extract_images()
from epub_to_text import BatchProcessor
# Batch processing
batch = BatchProcessor(max_workers=4)
result = batch.process_batch(
'/epub/folder',
'./output',
{'chapters_markdown': True, 'extract_images': True},
recursive=True,
parallel=True
)
Documentation
Complete documentation available in the docs/ folder:
- Quick Start - Get started in minutes
- API Reference - Complete class/method documentation
- Architecture Guide - System design and patterns
- Integration Guide - AI agent integration patterns
- Advanced Usage - Custom processors and optimization
CLI Options
Usage: epub-to-text [OPTIONS] <file_or_directory>
Options:
--single-text Export entire book as text
--single-markdown Export entire book as markdown
--chapters-text Export each chapter as text files
--chapters-markdown Export each chapter as markdown files
--json Export as JSON with metadata
--all Export in all formats
--extract-images Extract and save images
--batch Process multiple EPUBs
--recursive Search subdirectories
--parallel Process files in parallel
--max-workers N Number of parallel workers (default: 4)
--info Show book information only
--verbose Detailed output
-o, --output DIR Output directory (default: ./exported_books)
Output Structure
Single file mode:
output/
โโโ book.md # Complete book as markdown
โโโ book.txt # Complete book as text
โโโ book.json # Structured data
Chapter-wise mode:
output/
โโโ Book_Title/
โโโ 01_Introduction.md
โโโ 02_Chapter_Two.md
โโโ 03_Conclusion.md
โโโ images/
โโโ cover.jpg
โโโ diagram1.png
Project Structure
epub_to_text/
โโโ __init__.py # Package initialization
โโโ cli.py # Command-line interface
โโโ reader.py # EPUB file reading
โโโ extractor.py # Content extraction
โโโ converter.py # Format conversion
โโโ processor.py # Single-file processing
โโโ batch_processor.py # Batch processing
Key Classes
| Class | Purpose |
|---|---|
EpubProcessor |
High-level single-file processing |
BatchProcessor |
Batch processing with parallel support |
EpubExtractor |
Extract chapters, images, metadata |
ContentConverter |
Format conversion utilities |
EpubReader |
Low-level EPUB file reading |
Requirements
- Python 3.10+
- ebooklib >= 0.17.1
- beautifulsoup4 >= 4.9.0
Use Cases
- Knowledge Base: Extract EPUB content for building AI training datasets
- Content Analysis: Process multiple books for NLP tasks
- Digital Library: Convert EPUB collections to searchable text/markdown
- Accessibility: Generate alternative formats from EPUB books
- Content Preservation: Archive book content in multiple formats
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
For issues and questions:
- Check the documentation
- Review API Reference and Architecture Guide
- Search existing issues
Acknowledgments
- ebooklib - EPUB parsing
- BeautifulSoup - HTML/XML parsing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file epub_to_text-2.0.0.tar.gz.
File metadata
- Download URL: epub_to_text-2.0.0.tar.gz
- Upload date:
- Size: 15.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e72bd00699fee05ef1509d868fbf2ab53c9109ba64134c22c27ccd1c0465b37c
|
|
| MD5 |
7c3bdce552440844837c4308f0452bfc
|
|
| BLAKE2b-256 |
d63cfc00b1e4b79b2ce4127453bb3ef06e5e49b4da77ca6a87385a01ebcf8247
|
File details
Details for the file epub_to_text-2.0.0-py3-none-any.whl.
File metadata
- Download URL: epub_to_text-2.0.0-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8622d120c3b2a1ee37e8672edd87306ca0c0450382f2bda5733455fa95cae17d
|
|
| MD5 |
fbe913324286b656847babafd36e9e6d
|
|
| BLAKE2b-256 |
e9bafcf41939b3a29d201882cfcb0165c858af4cd3e6ce536a8328bd5e27bd16
|