Production-ready PDF chunking library with intelligent content filtering and strategic header detection
Project description
PDF Chunker Library v2.0
A production-ready Python library for intelligently chunking PDF documents using sophisticated font analysis, enhanced content filtering, and strategic header detection.
🚀 Features
- Strategic Header Chunking: Advanced font-size analysis with frequency-based header selection
- Enhanced Meaning Detection: AI-powered content analysis with metadata pattern filtering
- Multi-Level Processing: Undersized → Oversized → Hierarchical sub-chunking pipeline
- Robust Content Filtering: Removes document metadata, page markers, and meaningless fragments
- Smart Chunk Processing: Intelligent merging of meaningful short chunks
- Professional Summarization: Extractive summaries with rich metadata output
- Dual Usage Modes: Simple convenience methods AND advanced custom processing
- Multiple Output Formats: JSON, CSV, and custom formats with rich metadata
📦 Installation
Basic Installation
pip install PyMuPDF pypdf
🎯 Quick Start - Two Approaches
🟢 Approach 1: Simple Convenience (Recommended for Most Users)
Perfect for: Quick prototyping, standard use cases, minimal configuration
from pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker
# Initialize and process in one line
chunker = CleanHybridPDFChunker()
output_file = chunker.process_and_save('document.pdf')
print(f"✅ Chunks saved to: {output_file}")
Run the example:
cd examples/
python simple_usage.py
What you get:
- Automatic header detection and chunking
- JSON output with metadata
- Multiple format options (JSON/CSV)
- Error handling and validation
🔵 Approach 2: Advanced Custom Processing
Perfect for: Custom applications, data analysis, integration with other systems
from pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker
# Get raw chunk data for custom processing
chunker = CleanHybridPDFChunker()
chunks, headers = chunker.strategic_header_chunking('document.pdf')
# Now you have direct access to chunk data
for chunk in chunks:
topic = chunk['topic']
content = chunk['content']
word_count = chunk['word_count']
# Your custom logic here...
# Save however you want
import json
with open('my_chunks.json', 'w') as f:
json.dump({'chunks': chunks}, f, indent=2)
Your own save logic here
**Run the example:**
```bash
cd examples/
python advanced_usage.py
What you get:
- Direct access to chunk data and headers
- Custom filtering and analysis
- Multiple output formats with custom metadata
- Advanced statistics and reporting
With Enhanced NLP (recommended)
pip install PyMuPDF pypdf spacy
python -m spacy download en_core_web_sm
Development Installation
pip install -e .[dev,nlp]
Quick Start
from pdf_chunker_for_rag import CleanHybridPDFChunker
# Initialize the production chunker
chunker = CleanHybridPDFChunker()
# Process PDF with strategic header chunking
chunks = chunker.strategic_header_chunking(
pdf_path="your_document.pdf",
target_words_per_chunk=200
)
print(f"✅ Created {len(chunks)} structured chunks")
print(f"📊 Average chunk size: {sum(c.get('word_count', 0) for c in chunks) // len(chunks)} words")
# Access chunk data
for chunk in chunks:
print(f"📖 {chunk['topic']} ({chunk['word_count']} words)")
print(f"📋 {chunk['summary']}")
print()
Advanced Usage
from pdf_chunker_for_rag import PDFChunker, ChunkingConfig, SummarizationMethod
# Custom configuration
config = ChunkingConfig(
target_words_per_chunk=300,
min_header_occurrences=2,
oversized_threshold=600,
critical_threshold=1000,
min_meaningful_words=30,
summarization_method=SummarizationMethod.EXTRACTIVE
)
chunker = PDFChunker(config)
result = chunker.chunk_pdf("your_document.pdf")
Key Classes
PDFChunker
Main interface for PDF chunking operations.
Methods:
chunk_pdf(pdf_path): Complete chunking processdetect_headers(pdf_path): Header detection onlyextract_text(pdf_path): Text extraction onlyget_font_analysis(pdf_path): Font analysis only
ChunkingConfig
Configuration for chunking behavior.
Parameters:
target_words_per_chunk: Target words per chunk (default: 200)min_header_occurrences: Minimum header occurrences for selection (default: 3)font_size_tolerance: Tolerance for font size grouping (default: 2.0)oversized_threshold: Word count threshold for oversized chunks (default: 500)critical_threshold: Critical threshold requiring forced splitting (default: 800)min_meaningful_words: Minimum words for meaningful chunks (default: 50)
Data Structures
ChunkData: Represents a processed chunk
chunk_id: Unique identifiertopic: Header/topic textcontent: Chunk contentword_count: Number of wordssummary: Generated summaryparent_chunk_info: Information about parent chunk (for split chunks)
HeaderData: Represents a detected header
text: Header textfont_size: Font size in pointspage: Page numberis_bold: Whether header is bold
Processing Pipeline
- Font Analysis: Analyze document fonts and determine normal text size
- Header Detection: Identify potential headers based on font size
- Strategic Selection: Select optimal header level using frequency analysis
- Text Extraction: Extract text with proper reading order
- Chunk Creation: Create initial chunks based on headers
- Content Filtering: Remove meaningless content and merge short meaningful chunks
- Summarization: Generate summaries for all chunks
- Oversized Processing: Handle large chunks through sub-header detection or forced splitting
Content Quality Features
Meaningless Content Detection
- Version numbers and dates
- Page markers and formatting artifacts
- Low meaningful word ratios
- Incomplete sentences and titles
Smart Merging
- Preserves short but meaningful content
- Forward-direction merging with adjacent chunks
- Maintains topic coherence
NLP-Enhanced Analysis (with spaCy)
- Sentence structure analysis
- Named entity recognition
- Vocabulary diversity scoring
- Professional content detection
Library Architecture
pdf_chunker_for_rag/
├── core/ # Core types and main chunker class
├── analysis/ # Font analysis and header detection
├── filtering/ # Content quality filtering and merging
├── processing/ # Summarization and oversized chunk handling
└── utils/ # Text extraction and utility functions
Examples
Processing Multiple PDFs
import os
from pdf_chunker_for_rag import PDFChunker
chunker = PDFChunker()
results = {}
for filename in os.listdir("pdfs/"):
if filename.endswith(".pdf"):
pdf_path = os.path.join("pdfs", filename)
results[filename] = chunker.chunk_pdf(pdf_path)
# Analyze results
for filename, result in results.items():
print(f"{filename}: {len(result.chunks)} chunks, "
f"avg {result.average_chunk_size:.0f} words")
Custom Content Filtering
from pdf_chunker_for_rag.filtering import ContentFilter
# Create custom filter
filter = ContentFilter(min_meaningful_words=30)
# Check if content is meaningful
is_meaningful = filter.has_meaningful_sentence_structure("Your text here")
is_meaningless = filter.is_meaningless_content("Your text here")
Font Analysis Only
from pdf_chunker_for_rag.analysis import FontAnalyzer
analyzer = FontAnalyzer()
font_info = analyzer.analyze_document_fonts("document.pdf")
print(f"Normal text size: {font_info['normal_font_size']:.1f}pt")
print(f"Header threshold: {font_info['min_header_threshold']:.1f}pt")
print(f"Unique font sizes: {len(font_info['all_font_sizes'])}")
Requirements
- Python 3.8+
- PyMuPDF (fitz) >= 1.20.0
- pypdf >= 3.0.0
- spaCy >= 3.4.0 (optional, for enhanced NLP features)
License
MIT License - see LICENSE file for details.
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Changelog
Version 1.0.0
- Initial release
- Complete modular architecture
- Font-based header detection
- Content quality filtering
- Smart chunk merging
- Multiple summarization methods
- Oversized chunk processing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_chunker_for_rag-2.0.0.tar.gz.
File metadata
- Download URL: pdf_chunker_for_rag-2.0.0.tar.gz
- Upload date:
- Size: 32.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
adf6abb6c3683ca25a5170192ceeef192982eca394239ac86f7d24661bd3e19a
|
|
| MD5 |
62c176e6cb906cbdf8bc9fd882a0b7ec
|
|
| BLAKE2b-256 |
5719a1e44f2b9cfcbdc050e567fcdcfd7092b27b0c6385961b5ac949b510446d
|
File details
Details for the file pdf_chunker_for_rag-2.0.0-py3-none-any.whl.
File metadata
- Download URL: pdf_chunker_for_rag-2.0.0-py3-none-any.whl
- Upload date:
- Size: 29.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ec0af6ecbece5c93c912ab09ce1a46b0cc598c8ed8b9280a3d2e6c5b367f8b1
|
|
| MD5 |
12398a6f12307953c81315765bb2a928
|
|
| BLAKE2b-256 |
b1ac0d2a767259f60a053c3b15c4d4457bea226e67f07ef02f1fa254ee5825e1
|