Production-ready PDF chunking library with intelligent content filtering and strategic header detection

These details have not been verified by PyPI

Project links

Homepage

Project description

PDF Chunker Library v2.0

A production-ready Python library for intelligently chunking PDF documents using sophisticated font analysis, enhanced content filtering, and strategic header detection.

🚀 Features

Strategic Header Chunking: Advanced font-size analysis with frequency-based header selection
Enhanced Meaning Detection: AI-powered content analysis with metadata pattern filtering
Multi-Level Processing: Undersized → Oversized → Hierarchical sub-chunking pipeline
Robust Content Filtering: Removes document metadata, page markers, and meaningless fragments
Smart Chunk Processing: Intelligent merging of meaningful short chunks
Professional Summarization: Extractive summaries with rich metadata output
Dual Usage Modes: Simple convenience methods AND advanced custom processing
Multiple Output Formats: JSON, CSV, and custom formats with rich metadata

📦 Installation

Basic Installation

pip install PyMuPDF pypdf

🎯 Quick Start - Two Approaches

🟢 Approach 1: Simple Convenience (Recommended for Most Users)

Perfect for: Quick prototyping, standard use cases, minimal configuration

from pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker

# Initialize and process in one line
chunker = CleanHybridPDFChunker()
output_file = chunker.process_and_save('document.pdf')
print(f"✅ Chunks saved to: {output_file}")

Run the example:

cd examples/
python simple_usage.py

What you get:

Automatic header detection and chunking
JSON output with metadata
Multiple format options (JSON/CSV)
Error handling and validation

🔵 Approach 2: Advanced Custom Processing

Perfect for: Custom applications, data analysis, integration with other systems

from pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker

# Get raw chunk data for custom processing
chunker = CleanHybridPDFChunker()
chunks, headers = chunker.strategic_header_chunking('document.pdf')

# Now you have direct access to chunk data
for chunk in chunks:
    topic = chunk['topic']
    content = chunk['content'] 
    word_count = chunk['word_count']
    # Your custom logic here...

# Save however you want
import json
with open('my_chunks.json', 'w') as f:
    json.dump({'chunks': chunks}, f, indent=2)

Your own save logic here


**Run the example:**
```bash
cd examples/
python advanced_usage.py

What you get:

Direct access to chunk data and headers
Custom filtering and analysis
Multiple output formats with custom metadata
Advanced statistics and reporting

With Enhanced NLP (recommended)

pip install PyMuPDF pypdf spacy
python -m spacy download en_core_web_sm

Development Installation

pip install -e .[dev,nlp]

Quick Start

from pdf_chunker_for_rag import CleanHybridPDFChunker

# Initialize the production chunker
chunker = CleanHybridPDFChunker()

# Process PDF with strategic header chunking
chunks = chunker.strategic_header_chunking(
    pdf_path="your_document.pdf",
    target_words_per_chunk=200
)

print(f"✅ Created {len(chunks)} structured chunks")
print(f"📊 Average chunk size: {sum(c.get('word_count', 0) for c in chunks) // len(chunks)} words")

# Access chunk data
for chunk in chunks:
    print(f"📖 {chunk['topic']} ({chunk['word_count']} words)")
    print(f"📋 {chunk['summary']}")
    print()

Advanced Usage

from pdf_chunker_for_rag import PDFChunker, ChunkingConfig, SummarizationMethod

# Custom configuration
config = ChunkingConfig(
    target_words_per_chunk=300,
    min_header_occurrences=2,
    oversized_threshold=600,
    critical_threshold=1000,
    min_meaningful_words=30,
    summarization_method=SummarizationMethod.EXTRACTIVE
)

chunker = PDFChunker(config)
result = chunker.chunk_pdf("your_document.pdf")

Key Classes

PDFChunker

Main interface for PDF chunking operations.

Methods:

chunk_pdf(pdf_path): Complete chunking process
detect_headers(pdf_path): Header detection only
extract_text(pdf_path): Text extraction only
get_font_analysis(pdf_path): Font analysis only

ChunkingConfig

Configuration for chunking behavior.

Parameters:

target_words_per_chunk: Target words per chunk (default: 200)
min_header_occurrences: Minimum header occurrences for selection (default: 3)
font_size_tolerance: Tolerance for font size grouping (default: 2.0)
oversized_threshold: Word count threshold for oversized chunks (default: 500)
critical_threshold: Critical threshold requiring forced splitting (default: 800)
min_meaningful_words: Minimum words for meaningful chunks (default: 50)

Data Structures

ChunkData: Represents a processed chunk

chunk_id: Unique identifier
topic: Header/topic text
content: Chunk content
word_count: Number of words
summary: Generated summary
parent_chunk_info: Information about parent chunk (for split chunks)

HeaderData: Represents a detected header

text: Header text
font_size: Font size in points
page: Page number
is_bold: Whether header is bold

Processing Pipeline

Font Analysis: Analyze document fonts and determine normal text size
Header Detection: Identify potential headers based on font size
Strategic Selection: Select optimal header level using frequency analysis
Text Extraction: Extract text with proper reading order
Chunk Creation: Create initial chunks based on headers
Content Filtering: Remove meaningless content and merge short meaningful chunks
Summarization: Generate summaries for all chunks
Oversized Processing: Handle large chunks through sub-header detection or forced splitting

Content Quality Features

Meaningless Content Detection

Version numbers and dates
Page markers and formatting artifacts
Low meaningful word ratios
Incomplete sentences and titles

Smart Merging

Preserves short but meaningful content
Forward-direction merging with adjacent chunks
Maintains topic coherence

NLP-Enhanced Analysis (with spaCy)

Sentence structure analysis
Named entity recognition
Vocabulary diversity scoring
Professional content detection

Library Architecture

pdf_chunker_for_rag/
├── core/           # Core types and main chunker class
├── analysis/       # Font analysis and header detection
├── filtering/      # Content quality filtering and merging
├── processing/     # Summarization and oversized chunk handling
└── utils/          # Text extraction and utility functions

Examples

Processing Multiple PDFs

import os
from pdf_chunker_for_rag import PDFChunker

chunker = PDFChunker()
results = {}

for filename in os.listdir("pdfs/"):
    if filename.endswith(".pdf"):
        pdf_path = os.path.join("pdfs", filename)
        results[filename] = chunker.chunk_pdf(pdf_path)

# Analyze results
for filename, result in results.items():
    print(f"{filename}: {len(result.chunks)} chunks, "
          f"avg {result.average_chunk_size:.0f} words")

Custom Content Filtering

from pdf_chunker_for_rag.filtering import ContentFilter

# Create custom filter
filter = ContentFilter(min_meaningful_words=30)

# Check if content is meaningful
is_meaningful = filter.has_meaningful_sentence_structure("Your text here")
is_meaningless = filter.is_meaningless_content("Your text here")

Font Analysis Only

from pdf_chunker_for_rag.analysis import FontAnalyzer

analyzer = FontAnalyzer()
font_info = analyzer.analyze_document_fonts("document.pdf")

print(f"Normal text size: {font_info['normal_font_size']:.1f}pt")
print(f"Header threshold: {font_info['min_header_threshold']:.1f}pt")
print(f"Unique font sizes: {len(font_info['all_font_sizes'])}")

Requirements

Python 3.8+
PyMuPDF (fitz) >= 1.20.0
pypdf >= 3.0.0
spaCy >= 3.4.0 (optional, for enhanced NLP features)

License

MIT License - see LICENSE file for details.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Changelog

Version 1.0.0

Initial release
Complete modular architecture
Font-based header detection
Content quality filtering
Smart chunk merging
Multiple summarization methods
Oversized chunk processing

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.0.0

Aug 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_chunker_for_rag-2.0.0.tar.gz (32.6 kB view details)

Uploaded Aug 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_chunker_for_rag-2.0.0-py3-none-any.whl (29.6 kB view details)

Uploaded Aug 4, 2025 Python 3

File details

Details for the file pdf_chunker_for_rag-2.0.0.tar.gz.

File metadata

Download URL: pdf_chunker_for_rag-2.0.0.tar.gz
Upload date: Aug 4, 2025
Size: 32.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for pdf_chunker_for_rag-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`adf6abb6c3683ca25a5170192ceeef192982eca394239ac86f7d24661bd3e19a`
MD5	`62c176e6cb906cbdf8bc9fd882a0b7ec`
BLAKE2b-256	`5719a1e44f2b9cfcbdc050e567fcdcfd7092b27b0c6385961b5ac949b510446d`

See more details on using hashes here.

File details

Details for the file pdf_chunker_for_rag-2.0.0-py3-none-any.whl.

File metadata

Download URL: pdf_chunker_for_rag-2.0.0-py3-none-any.whl
Upload date: Aug 4, 2025
Size: 29.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for pdf_chunker_for_rag-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5ec0af6ecbece5c93c912ab09ce1a46b0cc598c8ed8b9280a3d2e6c5b367f8b1`
MD5	`12398a6f12307953c81315765bb2a928`
BLAKE2b-256	`b1ac0d2a767259f60a053c3b15c4d4457bea226e67f07ef02f1fa254ee5825e1`

See more details on using hashes here.

pdf-chunker-for-rag 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF Chunker Library v2.0

🚀 Features

📦 Installation

Basic Installation

🎯 Quick Start - Two Approaches

🟢 Approach 1: Simple Convenience (Recommended for Most Users)

🔵 Approach 2: Advanced Custom Processing

Your own save logic here

With Enhanced NLP (recommended)

Development Installation

Quick Start

Advanced Usage

Key Classes

PDFChunker

ChunkingConfig

Data Structures

Processing Pipeline

Content Quality Features

Meaningless Content Detection

Smart Merging

NLP-Enhanced Analysis (with spaCy)

Library Architecture

Examples

Processing Multiple PDFs

Custom Content Filtering

Font Analysis Only

Requirements

License

Contributing

Changelog

Version 1.0.0

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes