Skip to main content

Multi-format document extraction library for EPUB, PDF, HTML, Markdown, and JSON documents

Project description

extraction

PyPI version Python 3.13+ Documentation

Multi-format document extraction library for processing EPUB, PDF, HTML, Markdown, and JSON documents into structured, hierarchical chunks with domain-specific metadata enrichment.

Version: 0.0.1 (First Public Release)

Features

  • Multi-format support: EPUB, PDF, HTML, Markdown, JSON
  • Chunking strategies: RAG/embeddings mode (100-500 words) and NLP/paragraph mode
  • Hierarchical structure: Maintains 6-level heading hierarchy across documents
  • Quality scoring: Automatic document quality assessment with routing (A/B/C)
  • Noise filtering: Removes index pages, copyright boilerplate, navigation fragments
  • Domain analyzers: Catholic literature and generic analyzers
  • Formatting preservation: Poetry, blockquotes, lists, tables, emphasis
  • Reference extraction: Scripture references, cross-references, dates
  • Vatican pipeline: Specialized pipeline for vatican.va document processing
  • Token re-chunking: Optimize chunks for embedding models (embeddinggemma-300m)

Installation

From PyPI (Recommended)

pip install doc-extraction

# With optional dependencies
pip install doc-extraction[pdf]              # PDF support
pip install doc-extraction[vatican]          # Vatican pipeline with S3
pip install doc-extraction[images]           # Image scraping + EPUB creation
pip install doc-extraction[finetuning]       # Token re-chunking tools
pip install doc-extraction[dev]              # Testing tools

From Source (Development)

git clone https://github.com/hello-world-bfree/extraction.git
cd extraction
uv pip install -e ".[pdf,dev]"

Quick Start

Basic Usage

from extraction.extractors import EpubExtractor
from extraction.analyzers import GenericAnalyzer

# Extract chunks from EPUB
extractor = EpubExtractor("book.epub")
extractor.load()
extractor.parse()
metadata = extractor.extract_metadata()

# Get output
output = extractor.get_output_data()
print(f"Extracted {len(output['chunks'])} chunks")

With Domain Analysis

from extraction.extractors import EpubExtractor
from extraction.analyzers import CatholicAnalyzer

extractor = EpubExtractor("encyclical.epub")
extractor.load()
extractor.parse()
metadata = extractor.extract_metadata()

# Enrich with Catholic-specific metadata
analyzer = CatholicAnalyzer()
enriched = analyzer.enrich_metadata(
    metadata.to_dict(),
    extractor.full_text,
    [c.to_dict() for c in extractor.chunks]
)

output = extractor.get_output_data()
output['metadata'].update(enriched)

CLI Usage

# Extract single document (default: RAG chunking, noise filtering enabled)
extract document.epub

# Batch processing
extract documents/ -r --output-dir outputs/

# Custom chunking strategy
extract book.epub --chunking-strategy rag --min-chunk-words 200 --max-chunk-words 800
extract document.pdf --chunking-strategy nlp  # Paragraph-level chunks

# Disable noise filtering (keep index pages, copyright, etc.)
extract document.html --no-filter-noise

# Enable formatting preservation
extract book.epub --preserve-formatting

# With Catholic domain analysis
extract encyclical.epub --analyzer catholic

# Vatican archive pipeline
vatican-extract --sections BIBLE CATECHISM --upload

Chunking Strategies

Choose between two chunking modes across all formats (EPUB, PDF, HTML, Markdown, JSON):

RAG/Semantic Mode (Default)

  • Merges paragraphs under same heading hierarchy
  • Target: 100-500 words per chunk (optimal for embeddings)
  • Use for: Vector search, RAG systems, semantic retrieval
  • ~60-80% reduction in chunk count vs paragraph mode
extract document.epub  # Default
extract document.pdf --min-chunk-words 200 --max-chunk-words 800

NLP/Paragraph Mode

  • Paragraph-level chunks (one paragraph = one chunk)
  • Use for: Fine-grained NLP tasks, sentence classification, NER
  • Preserves exact paragraph boundaries
extract document.epub --chunking-strategy nlp

Token-Based Re-Chunking for Embeddings

Transform word-based extraction output into token-optimized chunks for embedding models:

# Install finetuning dependencies
pip install doc-extraction[finetuning]

# Retrieval mode: 256-400 tokens (precision-optimized)
token-rechunk document.json --mode retrieval

# Recommendation mode: 512-700 tokens (context-optimized)
token-rechunk document.json --mode recommendation

# Custom configuration
token-rechunk document.json --min-tokens 300 --max-tokens 500 --overlap-percent 0.12

# Batch processing for RAG applications
mkdir rag_corpus/
for file in extractions/*.json; do
    token-rechunk "$file" --mode retrieval --output "rag_corpus/$(basename $file .json).jsonl"
done

Features:

  • Sentence-aware overlap (10-20%)
  • Actual tokenization using embeddinggemma-300m tokenizer
  • 2048 token hard limit with automatic splitting
  • Hierarchy preservation across chunks

Architecture

Three-Layer Design

  1. Core Utilities (src/extraction/core/)

    • Format-agnostic text processing, chunking, quality scoring
    • Models: Chunk, Metadata, Provenance, Quality, Hierarchy
  2. Extractors (src/extraction/extractors/)

    • Format-specific parsers: EPUB, PDF, HTML, Markdown, JSON
    • All inherit from BaseExtractor ABC
    • Produce uniform Chunk objects regardless of format
  3. Analyzers (src/extraction/analyzers/)

    • Domain-specific metadata enrichment
    • CatholicAnalyzer: Document type, subjects, themes, related documents, geographic focus
    • GenericAnalyzer: Basic metadata extraction for general content

Output Format

All extractors produce identical JSON structure:

{
  "metadata": {
    "title": "Document Title",
    "author": "Author Name",
    "provenance": {
      "doc_id": "unique-id",
      "source_file": "path/to/file.epub"
    },
    "quality": {
      "score": 0.95,
      "route": "A"
    },
    "document_type": "Encyclical",
    "subjects": ["Liturgy", "Sacraments"]
  },
  "chunks": [
    {
      "stable_id": "abc123...",
      "text": "Chunk text content",
      "hierarchy": {
        "level_1": "Part I",
        "level_2": "Chapter 1"
      },
      "word_count": 42,
      "scripture_references": ["John 3:16"],
      "formatted_text": "> Blockquote with *emphasis*",
      "structure_metadata": {...}
    }
  ]
}

Noise Filtering

Automatic detection and removal of content with zero semantic value:

  • Index pages: Reference lists, number sequences
  • Copyright boilerplate: Legal notices, ISBN numbers, publisher info
  • Navigation fragments: TOC entries, page numbers, "Next/Previous" links

Impact: ~3-5% chunk reduction with zero false positives (tested on 72k+ chunks)

# Enabled by default
extract document.html

# Disable if needed
extract document.html --no-filter-noise

Formatting Preservation

Preserve structural intent during extraction:

# Enable all formatting preservation
extract book.epub --preserve-formatting

# Fine-grained control
extract book.epub --preserve-formatting --no-preserve-tables

What gets preserved:

  • Poetry/verse line breaks
  • Blockquotes with attribution
  • Nested lists (ordered/unordered)
  • Tables (markdown format)
  • Emphasis (italic/bold)
  • Code blocks

Image Extraction to EPUB

Scrape images from websites and create EPUB photo galleries:

# Install image dependencies
pip install doc-extraction[images]

# Basic usage - scrape and build EPUB
extract-images https://example.com/gallery --title "Gallery" --output gallery.epub

# Just download images (no EPUB)
extract-images https://example.com/photos --output-dir ./images --no-epub

# With S3 backup
extract-images https://example.com --title "Photos" \
  --output photos.epub \
  --upload-s3 --s3-bucket my-bucket --s3-prefix "galleries/"

Configuration

Each extractor accepts a config dict with format-specific options:

from extraction.extractors import EpubExtractor

extractor = EpubExtractor("book.epub", config={
    'chunking_strategy': 'rag',           # 'rag' or 'nlp'
    'min_chunk_words': 100,               # Minimum chunk size
    'max_chunk_words': 500,               # Maximum chunk size
    'filter_noise': True,                 # Enable noise filtering
    'preserve_formatting': True,          # Preserve structure
    'preserve_hierarchy_across_docs': True,  # EPUB: hierarchy flows across spine
    'toc_hierarchy_level': 1,             # EPUB: TOC level to use
})

See documentation for all options.

Testing

# Run all tests
uv run pytest

# Skip integration tests
uv run pytest -m "not integration"

# Run with coverage
uv run pytest --cov=src/extraction --cov-report=html

Current status: 228 tests, 41% coverage

Requirements

  • Python 3.13+ (required)
  • uv for package management (recommended)

Documentation

For detailed documentation on architecture, adding extractors/analyzers, testing strategy, and common patterns, see CLAUDE.md.

Use Cases

Catholic Literature Processing

  • Encyclicals, catechisms, prayer books
  • Vatican archive document extraction
  • Scripture reference extraction

General Document Processing

  • Multi-format document conversion
  • Hierarchical chunking for large documents
  • Quality-based routing for document review

RAG/Embedding Applications

  • Vector database population
  • Semantic search corpus preparation
  • Token-optimized chunk generation

Project Structure

extraction/
├── src/extraction/
│   ├── core/          # Core utilities (chunking, quality, extraction)
│   ├── extractors/    # Format-specific extractors
│   ├── analyzers/     # Domain analyzers
│   ├── builders/      # EPUB builder for image galleries
│   ├── scrapers/      # Image scraping utilities
│   ├── storage/       # S3 upload support
│   ├── cli/           # CLI entry points
│   ├── tools/         # Token re-chunking
│   └── pipelines/     # Specialized pipelines (Vatican)
├── tests/             # Test suite
├── docs/              # MkDocs documentation
├── examples/          # Example scripts
├── pyproject.toml     # Package configuration
└── CLAUDE.md          # Detailed development guide

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_extraction-2.5.0.tar.gz (234.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doc_extraction-2.5.0-py3-none-any.whl (175.0 kB view details)

Uploaded Python 3

File details

Details for the file doc_extraction-2.5.0.tar.gz.

File metadata

  • Download URL: doc_extraction-2.5.0.tar.gz
  • Upload date:
  • Size: 234.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.25 {"installer":{"name":"uv","version":"0.9.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for doc_extraction-2.5.0.tar.gz
Algorithm Hash digest
SHA256 3efb6821379a8f2c197d2c40695ae30c3af78ca40e25e3aef1a628ff5226f556
MD5 ad3155bbf198447c455ea8a4f9072c26
BLAKE2b-256 ae83c5fe5d7559b8e1a2ab55007a03ee432db28f3b6666a788a7b8add8b4e380

See more details on using hashes here.

File details

Details for the file doc_extraction-2.5.0-py3-none-any.whl.

File metadata

  • Download URL: doc_extraction-2.5.0-py3-none-any.whl
  • Upload date:
  • Size: 175.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.25 {"installer":{"name":"uv","version":"0.9.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for doc_extraction-2.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed23e689b2a0778c990fa41484ad7f84164a5853eebcf88b74f68bb8f351038b
MD5 f6d63f93dc1be21e0b40d842a1f5f41d
BLAKE2b-256 964be3cfad84c6f21e48ec9d47452a34d658c4132978843d289263de378e4037

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page