Skip to main content

Production-ready PDF chunking library with intelligent content filtering and strategic header detection

Project description

PDF Chunker Library v2.0

A production-ready Python library for intelligently chunking PDF documents using sophisticated font analysis, enhanced content filtering, and strategic header detection.

🚀 Features

  • Strategic Header Chunking: Advanced font-size analysis with frequency-based header selection
  • Enhanced Meaning Detection: AI-powered content analysis with metadata pattern filtering
  • Multi-Level Processing: Undersized → Oversized → Hierarchical sub-chunking pipeline
  • Robust Content Filtering: Removes document metadata, page markers, and meaningless fragments
  • Smart Chunk Processing: Intelligent merging of meaningful short chunks
  • Professional Summarization: Extractive summaries with rich metadata output
  • Dual Usage Modes: Simple convenience methods AND advanced custom processing
  • Multiple Output Formats: JSON, CSV, and custom formats with rich metadata

📦 Installation

Basic Installation

pip install PyMuPDF pypdf

🎯 Quick Start - Two Approaches

🟢 Approach 1: Simple Convenience (Recommended for Most Users)

Perfect for: Quick prototyping, standard use cases, minimal configuration

from pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker

# Initialize and process in one line
chunker = CleanHybridPDFChunker()
output_file = chunker.process_and_save('document.pdf')
print(f"✅ Chunks saved to: {output_file}")

Run the example:

cd examples/
python simple_usage.py

What you get:

  • Automatic header detection and chunking
  • JSON output with metadata
  • Multiple format options (JSON/CSV)
  • Error handling and validation

🔵 Approach 2: Advanced Custom Processing

Perfect for: Custom applications, data analysis, integration with other systems

from pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker

# Get raw chunk data for custom processing
chunker = CleanHybridPDFChunker()
chunks, headers = chunker.strategic_header_chunking('document.pdf')

# Now you have direct access to chunk data
for chunk in chunks:
    topic = chunk['topic']
    content = chunk['content'] 
    word_count = chunk['word_count']
    # Your custom logic here...

# Save however you want
import json
with open('my_chunks.json', 'w') as f:
    json.dump({'chunks': chunks}, f, indent=2)

Your own save logic here


**Run the example:**
```bash
cd examples/
python advanced_usage.py

What you get:

  • Direct access to chunk data and headers
  • Custom filtering and analysis
  • Multiple output formats with custom metadata
  • Advanced statistics and reporting

With Enhanced NLP (recommended)

pip install PyMuPDF pypdf spacy
python -m spacy download en_core_web_sm

Development Installation

pip install -e .[dev,nlp]

Quick Start

from pdf_chunker_for_rag import CleanHybridPDFChunker

# Initialize the production chunker
chunker = CleanHybridPDFChunker()

# Process PDF with strategic header chunking
chunks = chunker.strategic_header_chunking(
    pdf_path="your_document.pdf",
    target_words_per_chunk=200
)

print(f"✅ Created {len(chunks)} structured chunks")
print(f"📊 Average chunk size: {sum(c.get('word_count', 0) for c in chunks) // len(chunks)} words")

# Access chunk data
for chunk in chunks:
    print(f"📖 {chunk['topic']} ({chunk['word_count']} words)")
    print(f"📋 {chunk['summary']}")
    print()

Advanced Usage

from pdf_chunker_for_rag import PDFChunker, ChunkingConfig, SummarizationMethod

# Custom configuration
config = ChunkingConfig(
    target_words_per_chunk=300,
    min_header_occurrences=2,
    oversized_threshold=600,
    critical_threshold=1000,
    min_meaningful_words=30,
    summarization_method=SummarizationMethod.EXTRACTIVE
)

chunker = PDFChunker(config)
result = chunker.chunk_pdf("your_document.pdf")

Key Classes

PDFChunker

Main interface for PDF chunking operations.

Methods:

  • chunk_pdf(pdf_path): Complete chunking process
  • detect_headers(pdf_path): Header detection only
  • extract_text(pdf_path): Text extraction only
  • get_font_analysis(pdf_path): Font analysis only

ChunkingConfig

Configuration for chunking behavior.

Parameters:

  • target_words_per_chunk: Target words per chunk (default: 200)
  • min_header_occurrences: Minimum header occurrences for selection (default: 3)
  • font_size_tolerance: Tolerance for font size grouping (default: 2.0)
  • oversized_threshold: Word count threshold for oversized chunks (default: 500)
  • critical_threshold: Critical threshold requiring forced splitting (default: 800)
  • min_meaningful_words: Minimum words for meaningful chunks (default: 50)

Data Structures

ChunkData: Represents a processed chunk

  • chunk_id: Unique identifier
  • topic: Header/topic text
  • content: Chunk content
  • word_count: Number of words
  • summary: Generated summary
  • parent_chunk_info: Information about parent chunk (for split chunks)

HeaderData: Represents a detected header

  • text: Header text
  • font_size: Font size in points
  • page: Page number
  • is_bold: Whether header is bold

Processing Pipeline

  1. Font Analysis: Analyze document fonts and determine normal text size
  2. Header Detection: Identify potential headers based on font size
  3. Strategic Selection: Select optimal header level using frequency analysis
  4. Text Extraction: Extract text with proper reading order
  5. Chunk Creation: Create initial chunks based on headers
  6. Content Filtering: Remove meaningless content and merge short meaningful chunks
  7. Summarization: Generate summaries for all chunks
  8. Oversized Processing: Handle large chunks through sub-header detection or forced splitting

Content Quality Features

Meaningless Content Detection

  • Version numbers and dates
  • Page markers and formatting artifacts
  • Low meaningful word ratios
  • Incomplete sentences and titles

Smart Merging

  • Preserves short but meaningful content
  • Forward-direction merging with adjacent chunks
  • Maintains topic coherence

NLP-Enhanced Analysis (with spaCy)

  • Sentence structure analysis
  • Named entity recognition
  • Vocabulary diversity scoring
  • Professional content detection

Library Architecture

pdf_chunker_for_rag/
├── core/           # Core types and main chunker class
├── analysis/       # Font analysis and header detection
├── filtering/      # Content quality filtering and merging
├── processing/     # Summarization and oversized chunk handling
└── utils/          # Text extraction and utility functions

Examples

Processing Multiple PDFs

import os
from pdf_chunker_for_rag import PDFChunker

chunker = PDFChunker()
results = {}

for filename in os.listdir("pdfs/"):
    if filename.endswith(".pdf"):
        pdf_path = os.path.join("pdfs", filename)
        results[filename] = chunker.chunk_pdf(pdf_path)

# Analyze results
for filename, result in results.items():
    print(f"{filename}: {len(result.chunks)} chunks, "
          f"avg {result.average_chunk_size:.0f} words")

Custom Content Filtering

from pdf_chunker_for_rag.filtering import ContentFilter

# Create custom filter
filter = ContentFilter(min_meaningful_words=30)

# Check if content is meaningful
is_meaningful = filter.has_meaningful_sentence_structure("Your text here")
is_meaningless = filter.is_meaningless_content("Your text here")

Font Analysis Only

from pdf_chunker_for_rag.analysis import FontAnalyzer

analyzer = FontAnalyzer()
font_info = analyzer.analyze_document_fonts("document.pdf")

print(f"Normal text size: {font_info['normal_font_size']:.1f}pt")
print(f"Header threshold: {font_info['min_header_threshold']:.1f}pt")
print(f"Unique font sizes: {len(font_info['all_font_sizes'])}")

Requirements

  • Python 3.8+
  • PyMuPDF (fitz) >= 1.20.0
  • pypdf >= 3.0.0
  • spaCy >= 3.4.0 (optional, for enhanced NLP features)

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

Changelog

Version 1.0.0

  • Initial release
  • Complete modular architecture
  • Font-based header detection
  • Content quality filtering
  • Smart chunk merging
  • Multiple summarization methods
  • Oversized chunk processing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_chunker_for_rag-2.0.0.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_chunker_for_rag-2.0.0-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file pdf_chunker_for_rag-2.0.0.tar.gz.

File metadata

  • Download URL: pdf_chunker_for_rag-2.0.0.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for pdf_chunker_for_rag-2.0.0.tar.gz
Algorithm Hash digest
SHA256 adf6abb6c3683ca25a5170192ceeef192982eca394239ac86f7d24661bd3e19a
MD5 62c176e6cb906cbdf8bc9fd882a0b7ec
BLAKE2b-256 5719a1e44f2b9cfcbdc050e567fcdcfd7092b27b0c6385961b5ac949b510446d

See more details on using hashes here.

File details

Details for the file pdf_chunker_for_rag-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_chunker_for_rag-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5ec0af6ecbece5c93c912ab09ce1a46b0cc598c8ed8b9280a3d2e6c5b367f8b1
MD5 12398a6f12307953c81315765bb2a928
BLAKE2b-256 b1ac0d2a767259f60a053c3b15c4d4457bea226e67f07ef02f1fa254ee5825e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page