Multi-format document extraction library for EPUB, PDF, HTML, Markdown, and JSON documents
Project description
extraction
Multi-format document extraction library for processing EPUB, PDF, HTML, Markdown, and JSON documents into structured, hierarchical chunks with domain-specific metadata enrichment.
Version: 0.0.1 (First Public Release)
Features
- Multi-format support: EPUB, PDF, HTML, Markdown, JSON
- Chunking strategies: RAG/embeddings mode (100-500 words) and NLP/paragraph mode
- Hierarchical structure: Maintains 6-level heading hierarchy across documents
- Quality scoring: Automatic document quality assessment with routing (A/B/C)
- Noise filtering: Removes index pages, copyright boilerplate, navigation fragments
- Domain analyzers: Catholic literature and generic analyzers
- Formatting preservation: Poetry, blockquotes, lists, tables, emphasis
- Reference extraction: Scripture references, cross-references, dates
- Vatican pipeline: Specialized pipeline for vatican.va document processing
- Token re-chunking: Optimize chunks for embedding models (embeddinggemma-300m)
Installation
From PyPI (Recommended)
pip install doc-extraction
# With optional dependencies
pip install doc-extraction[pdf] # PDF support
pip install doc-extraction[vatican] # Vatican pipeline with S3
pip install doc-extraction[images] # Image scraping + EPUB creation
pip install doc-extraction[finetuning] # Token re-chunking tools
pip install doc-extraction[dev] # Testing tools
From Source (Development)
git clone https://github.com/hello-world-bfree/extraction.git
cd extraction
uv pip install -e ".[pdf,dev]"
Quick Start
Basic Usage
from extraction.extractors import EpubExtractor
from extraction.analyzers import GenericAnalyzer
# Extract chunks from EPUB
extractor = EpubExtractor("book.epub")
extractor.load()
extractor.parse()
metadata = extractor.extract_metadata()
# Get output
output = extractor.get_output_data()
print(f"Extracted {len(output['chunks'])} chunks")
With Domain Analysis
from extraction.extractors import EpubExtractor
from extraction.analyzers import CatholicAnalyzer
extractor = EpubExtractor("encyclical.epub")
extractor.load()
extractor.parse()
metadata = extractor.extract_metadata()
# Enrich with Catholic-specific metadata
analyzer = CatholicAnalyzer()
enriched = analyzer.enrich_metadata(
metadata.to_dict(),
extractor.full_text,
[c.to_dict() for c in extractor.chunks]
)
output = extractor.get_output_data()
output['metadata'].update(enriched)
CLI Usage
# Extract single document (default: RAG chunking, noise filtering enabled)
extract document.epub
# Batch processing
extract documents/ -r --output-dir outputs/
# Custom chunking strategy
extract book.epub --chunking-strategy rag --min-chunk-words 200 --max-chunk-words 800
extract document.pdf --chunking-strategy nlp # Paragraph-level chunks
# Disable noise filtering (keep index pages, copyright, etc.)
extract document.html --no-filter-noise
# Enable formatting preservation
extract book.epub --preserve-formatting
# With Catholic domain analysis
extract encyclical.epub --analyzer catholic
# Vatican archive pipeline
vatican-extract --sections BIBLE CATECHISM --upload
Chunking Strategies
Choose between two chunking modes across all formats (EPUB, PDF, HTML, Markdown, JSON):
RAG/Semantic Mode (Default)
- Merges paragraphs under same heading hierarchy
- Target: 100-500 words per chunk (optimal for embeddings)
- Use for: Vector search, RAG systems, semantic retrieval
- ~60-80% reduction in chunk count vs paragraph mode
extract document.epub # Default
extract document.pdf --min-chunk-words 200 --max-chunk-words 800
NLP/Paragraph Mode
- Paragraph-level chunks (one paragraph = one chunk)
- Use for: Fine-grained NLP tasks, sentence classification, NER
- Preserves exact paragraph boundaries
extract document.epub --chunking-strategy nlp
Token-Based Re-Chunking for Embeddings
Transform word-based extraction output into token-optimized chunks for embedding models:
# Install finetuning dependencies
pip install doc-extraction[finetuning]
# Retrieval mode: 256-400 tokens (precision-optimized)
token-rechunk document.json --mode retrieval
# Recommendation mode: 512-700 tokens (context-optimized)
token-rechunk document.json --mode recommendation
# Custom configuration
token-rechunk document.json --min-tokens 300 --max-tokens 500 --overlap-percent 0.12
# Batch processing for RAG applications
mkdir rag_corpus/
for file in extractions/*.json; do
token-rechunk "$file" --mode retrieval --output "rag_corpus/$(basename $file .json).jsonl"
done
Features:
- Sentence-aware overlap (10-20%)
- Actual tokenization using embeddinggemma-300m tokenizer
- 2048 token hard limit with automatic splitting
- Hierarchy preservation across chunks
Architecture
Three-Layer Design
-
Core Utilities (
src/extraction/core/)- Format-agnostic text processing, chunking, quality scoring
- Models:
Chunk,Metadata,Provenance,Quality,Hierarchy
-
Extractors (
src/extraction/extractors/)- Format-specific parsers: EPUB, PDF, HTML, Markdown, JSON
- All inherit from
BaseExtractorABC - Produce uniform
Chunkobjects regardless of format
-
Analyzers (
src/extraction/analyzers/)- Domain-specific metadata enrichment
CatholicAnalyzer: Document type, subjects, themes, related documents, geographic focusGenericAnalyzer: Basic metadata extraction for general content
Output Format
All extractors produce identical JSON structure:
{
"metadata": {
"title": "Document Title",
"author": "Author Name",
"provenance": {
"doc_id": "unique-id",
"source_file": "path/to/file.epub"
},
"quality": {
"score": 0.95,
"route": "A"
},
"document_type": "Encyclical",
"subjects": ["Liturgy", "Sacraments"]
},
"chunks": [
{
"stable_id": "abc123...",
"text": "Chunk text content",
"hierarchy": {
"level_1": "Part I",
"level_2": "Chapter 1"
},
"word_count": 42,
"scripture_references": ["John 3:16"],
"formatted_text": "> Blockquote with *emphasis*",
"structure_metadata": {...}
}
]
}
Noise Filtering
Automatic detection and removal of content with zero semantic value:
- Index pages: Reference lists, number sequences
- Copyright boilerplate: Legal notices, ISBN numbers, publisher info
- Navigation fragments: TOC entries, page numbers, "Next/Previous" links
Impact: ~3-5% chunk reduction with zero false positives (tested on 72k+ chunks)
# Enabled by default
extract document.html
# Disable if needed
extract document.html --no-filter-noise
Formatting Preservation
Preserve structural intent during extraction:
# Enable all formatting preservation
extract book.epub --preserve-formatting
# Fine-grained control
extract book.epub --preserve-formatting --no-preserve-tables
What gets preserved:
- Poetry/verse line breaks
- Blockquotes with attribution
- Nested lists (ordered/unordered)
- Tables (markdown format)
- Emphasis (italic/bold)
- Code blocks
Image Extraction to EPUB
Scrape images from websites and create EPUB photo galleries:
# Install image dependencies
pip install doc-extraction[images]
# Basic usage - scrape and build EPUB
extract-images https://example.com/gallery --title "Gallery" --output gallery.epub
# Just download images (no EPUB)
extract-images https://example.com/photos --output-dir ./images --no-epub
# With S3 backup
extract-images https://example.com --title "Photos" \
--output photos.epub \
--upload-s3 --s3-bucket my-bucket --s3-prefix "galleries/"
Configuration
Each extractor accepts a config dict with format-specific options:
from extraction.extractors import EpubExtractor
extractor = EpubExtractor("book.epub", config={
'chunking_strategy': 'rag', # 'rag' or 'nlp'
'min_chunk_words': 100, # Minimum chunk size
'max_chunk_words': 500, # Maximum chunk size
'filter_noise': True, # Enable noise filtering
'preserve_formatting': True, # Preserve structure
'preserve_hierarchy_across_docs': True, # EPUB: hierarchy flows across spine
'toc_hierarchy_level': 1, # EPUB: TOC level to use
})
See documentation for all options.
Testing
# Run all tests
uv run pytest
# Skip integration tests
uv run pytest -m "not integration"
# Run with coverage
uv run pytest --cov=src/extraction --cov-report=html
Current status: 228 tests, 41% coverage
Requirements
- Python 3.13+ (required)
- uv for package management (recommended)
Documentation
- Homepage: https://hello-world-bfree.github.io/extraction/
- PyPI: https://pypi.org/project/doc-extraction/
- GitHub: https://github.com/hello-world-bfree/extraction
- Issues: https://github.com/hello-world-bfree/extraction/issues
For detailed documentation on architecture, adding extractors/analyzers, testing strategy, and common patterns, see CLAUDE.md.
Use Cases
Catholic Literature Processing
- Encyclicals, catechisms, prayer books
- Vatican archive document extraction
- Scripture reference extraction
General Document Processing
- Multi-format document conversion
- Hierarchical chunking for large documents
- Quality-based routing for document review
RAG/Embedding Applications
- Vector database population
- Semantic search corpus preparation
- Token-optimized chunk generation
Project Structure
extraction/
├── src/extraction/
│ ├── core/ # Core utilities (chunking, quality, extraction)
│ ├── extractors/ # Format-specific extractors
│ ├── analyzers/ # Domain analyzers
│ ├── builders/ # EPUB builder for image galleries
│ ├── scrapers/ # Image scraping utilities
│ ├── storage/ # S3 upload support
│ ├── cli/ # CLI entry points
│ ├── tools/ # Token re-chunking
│ └── pipelines/ # Specialized pipelines (Vatican)
├── tests/ # Test suite
├── docs/ # MkDocs documentation
├── examples/ # Example scripts
├── pyproject.toml # Package configuration
└── CLAUDE.md # Detailed development guide
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doc_extraction-2.5.0.tar.gz.
File metadata
- Download URL: doc_extraction-2.5.0.tar.gz
- Upload date:
- Size: 234.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.25 {"installer":{"name":"uv","version":"0.9.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3efb6821379a8f2c197d2c40695ae30c3af78ca40e25e3aef1a628ff5226f556
|
|
| MD5 |
ad3155bbf198447c455ea8a4f9072c26
|
|
| BLAKE2b-256 |
ae83c5fe5d7559b8e1a2ab55007a03ee432db28f3b6666a788a7b8add8b4e380
|
File details
Details for the file doc_extraction-2.5.0-py3-none-any.whl.
File metadata
- Download URL: doc_extraction-2.5.0-py3-none-any.whl
- Upload date:
- Size: 175.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.25 {"installer":{"name":"uv","version":"0.9.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed23e689b2a0778c990fa41484ad7f84164a5853eebcf88b74f68bb8f351038b
|
|
| MD5 |
f6d63f93dc1be21e0b40d842a1f5f41d
|
|
| BLAKE2b-256 |
964be3cfad84c6f21e48ec9d47452a34d658c4132978843d289263de378e4037
|