Production-grade PDF parsing: spec-compliant text extraction, intelligent reading order, OCR support. Ultra-fast Rust performance.
Project description
PDFOxide
High-performance PDF text extraction and markdown conversion library built in Rust.
A production-ready, high-performance PDF parsing and conversion library with Python bindings. Processes 103 PDFs in 5.43 seconds with 100% success rate.
๐ Documentation | ๐ Comparison | ๐ค Contributing | ๐ Security
Why This Library?
- โจ Ultra-fast - Process 100 PDFs in 5.3 seconds (average 53ms per PDF)
- ๐ Form field extraction - Complete form field structure and hierarchy
- ๐ฏ 100% text accuracy - Perfect word spacing and bold detection
- ๐ Production ready - 100% success rate on 103-file test suite
- โก Low latency - Average 53ms per PDF, perfect for web services
- ๐ฆ Pure Rust - Memory-safe, no C dependencies, single binary
Features
Currently Available (v0.2.0+)
- ๐ Complete PDF Parsing - PDF 1.0-1.7 with robust error handling and cycle detection
- ๐ Text Extraction - 100% accurate with perfect word spacing and Unicode support
- โ๏ธ Bold Detection - Accurate font weight detection (16,074 bold sections in test suite)
- ๐ Form Field Extraction - Unique feature: extracts complete form field structure and hierarchy
- ๐ Bookmarks/Outline - Extract PDF document outline with hierarchical structure
- ๐ Annotations - Extract PDF annotations including comments, highlights, and links
- ๐ฏ Layout Analysis - DBSCAN clustering, XY-Cut, and structure tree-based reading order
- ๐ง Intelligent Text Processing - Auto-detection of OCR vs native PDFs with per-block processing (NEW - v0.2.0)
- ๐ Markdown Export - Clean, properly formatted output with reading order preservation
- ๐ผ๏ธ Image Extraction - Extract embedded images with CCITT bilevel support
- ๐ Comprehensive Extraction - Captures all text including OCR and technical diagrams
- โก Ultra-Fast Processing - 5.43 seconds for 103 PDFs (average 53ms per PDF)
- ๐พ Efficient Output - Compact markdown and HTML generation
- ๐ฏ PDF Spec Aligned - Section 9, 14.7-14.8 compliance with proper reading order (NEW - v0.2.0)
Python Integration
- ๐ Python Bindings - Easy-to-use API via PyO3
- ๐ฆ Pure Rust Core - Memory-safe, fast, no C dependencies
- ๐ฆ Single Binary - No complex dependencies or installations
- ๐งช Production Ready - 100% success rate on comprehensive test suite
- ๐ Well Documented - Complete API documentation and examples
v0.2.0 Enhancements (Current) โจ
- ๐ง Intelligent Text Processing - Auto-detects OCR vs native PDFs per text block
- ๐ Reading Order Strategies - XY-Cut spatial analysis, structure tree, column-aware
- ๐๏ธ Modern Pipeline Architecture - Extensible OutputConverter trait, OrderedTextSpan metadata
- ๐ฏ PDF Spec Aligned - PDF 1.7 spec compliance (Sections 9, 14.7-14.8)
- ๐งน Code Quality - 72% warning reduction, no dead code, 946 tests passing
- ๐ Backward Compatible - Old API still works, deprecated with migration path
- ๐๏ธ CCITT Bilevel Images - Group 3/4 decompression for scanned PDFs
Future Enhancements (v0.3.0+) - Bidirectional Features
v0.3.0 - PDF Creation Foundations
- ๐ PDF Creation API - Fluent PdfBuilder for programmatic PDF generation
- ๐ Markdown โ PDF - Convert Markdown files to PDF documents
- ๐ HTML โ PDF - Convert HTML content to PDF (basic CSS support)
- ๐ Text โ PDF - Generate PDFs from plain text with styling
- ๐จ PDF Templates - Reusable document templates and code-based layouts
- ๐ผ๏ธ Image Embedding - JPEG/PNG/TIFF image support in generated PDFs
v0.4.0 - Structured Data
- ๐ Tables (Read โ Write) - Extract table structure โ Generate tables with borders/headers
- ๐ Forms (Read โ Write) - Extract filled forms โ Create fillable interactive forms
- ๐๏ธ Document Hierarchy (Read โ Write) - Parse outlines โ Generate bookmarks/TOC
v0.5.0 - Advanced Structure
- ๐ผ๏ธ Figures & Captions (Read โ Write) - Extract with context โ Place with auto-numbering
- ๐ Citations (Read โ Write) - Parse bibliography โ Generate citations
- ๐ Footnotes (Read โ Write) - Extract footnotes โ Create footnotes automatically
v0.6.0 - Interactivity & Accessibility
- ๐ฌ Annotations (Read โ Write) - Extract comments/highlights โ Add programmatically
- โฟ Tagged PDF (Read โ Write) - Parse structure trees โ Create accessible PDFs (WCAG/Section 508)
- ๐ Hyperlinks (Read โ Write) - Extract URLs/links โ Create clickable links
v0.7.0+ - Specialized Features
- ๐งฎ Math Formulas (Read โ Write) - Extract equations โ LaTeX to PDF
- ๐ Multi-Script (Read โ Write) - Bidirectional text, vertical CJK, complex ligatures
- ๐ Encryption (Read โ Write) - Decrypt/permissions โ Encrypt/sign PDFs
- ๐ฆ Embedded Files (Read โ Write) - Extract attachments โ PDF portfolios
- โ๏ธ Vector Graphics (Read โ Write) - Extract paths โ SVG to PDF
Quick Start
Rust - Basic Usage
use pdf_oxide::PdfDocument;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Open a PDF
let mut doc = PdfDocument::open("paper.pdf")?;
// Get page count
println!("Pages: {}", doc.page_count());
// Extract text from first page
let text = doc.extract_text(0)?;
println!("{}", text);
// Convert to Markdown (uses intelligent processing automatically)
let markdown = doc.to_markdown(0, Default::default())?;
// Extract images
let images = doc.extract_images(0)?;
println!("Found {} images", images.len());
// Get bookmarks/outline
if let Some(outline) = doc.get_outline()? {
for item in outline {
println!("Bookmark: {}", item.title);
}
}
// Get annotations
let annotations = doc.get_annotations(0)?;
for annot in annotations {
if let Some(contents) = annot.contents {
println!("Annotation: {}", contents);
}
}
Ok(())
}
Rust - Advanced Usage (v0.2.0 Pipeline API)
use pdf_oxide::PdfDocument;
use pdf_oxide::pipeline::{TextPipeline, TextPipelineConfig, ReadingOrderContext};
use pdf_oxide::pipeline::converters::{MarkdownOutputConverter, OutputConverter};
use pdf_oxide::converters::ConversionOptions;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut doc = PdfDocument::open("paper.pdf")?;
// Extract spans (raw text with positions)
let spans = doc.extract_spans(0)?;
// Step 1: Apply intelligent text processing (auto-detects OCR vs native PDF)
let spans = doc.apply_intelligent_text_processing(spans)?;
// Step 2: Create pipeline with reading order strategy
let config = TextPipelineConfig::from_conversion_options(&ConversionOptions::default());
let pipeline = TextPipeline::with_config(config.clone());
// Step 3: Create reading order context
let context = ReadingOrderContext::new().with_page(0);
// Step 4: Process through pipeline (applies reading order + intelligent processing)
let ordered_spans = pipeline.process(spans, context)?;
// Step 5: Convert to Markdown or other format
let converter = MarkdownOutputConverter::new();
let markdown = converter.convert(&ordered_spans, &config)?;
println!("{}", markdown);
Ok(())
}
Key v0.2.0 Improvements
- Automatic OCR Detection: Detects scanned PDFs per text block
- Reading Order: Proper document reading order via structure tree (PDF spec Section 14.7)
- Intelligent Processing: Three-stage pipeline (punctuation, ligatures, hyphenation)
- Per-Block Analysis: No global configuration needed, adapts per text span
- PDF Spec Aligned: Follows ISO 32000-1:2008 (PDF 1.7)
Rust - HTML Conversion Example
use pdf_oxide::PdfDocument;
use pdf_oxide::pipeline::converters::HtmlOutputConverter;
use pdf_oxide::pipeline::{TextPipeline, TextPipelineConfig};
use pdf_oxide::converters::ConversionOptions;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut doc = PdfDocument::open("document.pdf")?;
let spans = doc.extract_spans(0)?;
// Create pipeline
let config = TextPipelineConfig::from_conversion_options(&ConversionOptions::default());
let pipeline = TextPipeline::with_config(config.clone());
// Process through pipeline
let ordered_spans = pipeline.process(spans, Default::default())?;
// Convert to HTML instead of Markdown
let converter = HtmlOutputConverter::new();
let html = converter.convert(&ordered_spans, &config)?;
println!("{}", html);
Ok(())
}
Rust - Markdown with Configuration
use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut doc = PdfDocument::open("paper.pdf")?;
// Create custom conversion options
let options = ConversionOptions {
detect_headings: true, // Auto-detect heading levels by font size
include_images: true, // Extract and reference images
preserve_layout: false, // Use semantic structure instead of visual layout
image_output_dir: Some("./extracted_images".to_string()),
};
// Convert to Markdown with options
let markdown = doc.to_markdown(0, options)?;
println!("{}", markdown);
// Convert entire document
let full_markdown = doc.to_markdown_all(options)?;
std::fs::write("output.md", &full_markdown)?;
Ok(())
}
Rust - Intelligent OCR Detection (Mixed Documents)
use pdf_oxide::PdfDocument;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut doc = PdfDocument::open("mixed_content.pdf")?;
let spans = doc.extract_spans(0)?;
// Apply intelligent text processing
// Automatically detects OCR blocks and applies appropriate cleaning:
// - Punctuation reconstruction for OCR text
// - Ligature handling (fi, fl, etc.)
// - Hyphenation cleanup
let processed = doc.apply_intelligent_text_processing(spans)?;
for span in &processed {
println!("Text: '{}' (cleaned: {})",
&span.text,
span.text.len()); // OCR artifacts automatically removed
}
Ok(())
}
Rust - Form Field Extraction
use pdf_oxide::PdfDocument;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut doc = PdfDocument::open("form.pdf")?;
// Extract form fields from page
let fields = doc.extract_form_fields(0)?;
for field in fields {
println!("Field: {}", field.name);
println!(" Type: {:?}", field.field_type); // Text, Checkbox, Radio, Dropdown, etc.
println!(" Value: {:?}", field.value);
println!(" Required: {}", field.required);
println!(" Options: {:?}", field.options); // For dropdown/radio fields
println!();
}
Ok(())
}
Python - HTML Conversion
from pdf_oxide import PdfDocument
# Open PDF and extract spans
doc = PdfDocument("document.pdf")
spans = doc.extract_spans(0)
# Apply intelligent text processing
processed_spans = doc.apply_intelligent_text_processing(spans)
# Convert to HTML (semantic mode - best for readability)
html = doc.to_html(
0,
preserve_layout=False,
detect_headings=True,
include_images=True,
image_output_dir="./images"
)
print(html)
# Or use layout mode (preserves visual positioning)
html_layout = doc.to_html(0, preserve_layout=True)
Python - Markdown with Configuration
from pdf_oxide import PdfDocument
# Open a PDF
doc = PdfDocument("paper.pdf")
# Convert to Markdown with options
markdown = doc.to_markdown(
0,
detect_headings=True, # Auto-detect heading levels
include_images=True, # Extract and reference images
image_output_dir="./extracted_images"
)
print(markdown)
# Convert entire document to single Markdown file
full_markdown = doc.to_markdown_all(
detect_headings=True,
include_images=True,
image_output_dir="./doc_images"
)
# Save to file
with open("output.md", "w") as f:
f.write(full_markdown)
Python - Intelligent OCR Detection
from pdf_oxide import PdfDocument
# Open PDF with mixed native and scanned content
doc = PdfDocument("mixed_content.pdf")
# Extract spans (text with positions)
spans = doc.extract_spans(0)
# Apply intelligent text processing
# Automatically detects and cleans OCR blocks:
# - Punctuation reconstruction
# - Ligature handling (fi, fl, etc.)
# - Hyphenation cleanup
processed = doc.apply_intelligent_text_processing(spans)
# Use processed spans for higher quality conversion
markdown = doc.to_markdown(0, detect_headings=True)
html = doc.to_html(0, preserve_layout=False, detect_headings=True)
Python - Form Field Extraction
from pdf_oxide import PdfDocument
# Open PDF with form fields
doc = PdfDocument("form.pdf")
# Extract form fields
fields = doc.extract_form_fields(0)
# Access field information
for field in fields:
print(f"Field Name: {field.name}")
print(f"Type: {field.field_type}") # Text, Checkbox, Radio, Dropdown, etc.
print(f"Value: {field.value}")
print(f"Required: {field.required}")
if field.options: # For dropdown/radio buttons
print(f"Options: {field.options}")
print()
# Extract all form data from page
form_data = {field.name: field.value for field in fields}
print(f"Form Data: {form_data}")
What's Coming in v0.3.0 - PDF Creation
v0.3.0 will introduce PDF generation from code with support for multiple input formats:
// Build PDFs programmatically
use pdf_oxide::builder::{PdfBuilder, PdfPage, PdfText};
let pdf = PdfBuilder::new()
.add_page(PdfPage::new(8.5, 11.0))
.add_text("Document Title", 24.0, 72.0, 750.0)
.add_markdown("# Introduction\n\nThis is a **markdown** document.")
.add_text("Page 1 content here", 12.0, 72.0, 650.0)
.build()?
.save("output.pdf")?;
// Convert Markdown to PDF
let markdown_content = std::fs::read_to_string("document.md")?;
let pdf = PdfBuilder::from_markdown(&markdown_content)?
.save("document.pdf")?;
// Convert HTML to PDF
let html_content = "<h1>Title</h1><p>HTML content</p>";
let pdf = PdfBuilder::from_html(html_content)?
.save("output.pdf")?;
// Use templates for consistent styling
let pdf = PdfBuilder::with_template("business_letter")
.add_content("This is the letter content")
.save("letter.pdf")?;
v0.3.0 Features:
- โ๏ธ
PdfBuilder- Fluent API for PDF creation - ๐
PdfPage- Page management with custom sizing - ๐ค
PdfText- Text with font and styling - ๐๏ธ
PdfImage- Image embedding and positioning - ๐ Markdown โ PDF conversion
- ๐ HTML โ PDF conversion (with CSS support)
- ๐ Text โ PDF generation
- ๐จ Template system for consistent designs
- ๐ค Font embedding and selection
This positions pdf_oxide as a bidirectional PDF toolkit - extract from PDFs AND create them!
Installation
Rust Library
Add to your Cargo.toml:
[dependencies]
pdf_oxide = "0.2"
Python Package
pip install pdf_oxide
Python API Reference
PdfDocument - Main class for PDF operations
Constructor:
PdfDocument(path: str)- Open a PDF file
Methods:
version() -> Tuple[int, int]- Get PDF version (major, minor)page_count() -> int- Get number of pagesextract_text(page: int) -> str- Extract text from a pageto_markdown(page, preserve_layout=False, detect_headings=True, include_images=True, image_output_dir=None) -> strto_html(page, preserve_layout=False, detect_headings=True, include_images=True, image_output_dir=None) -> strto_markdown_all(...) -> str- Convert all pages to Markdownto_html_all(...) -> str- Convert all pages to HTML
See python/pdf_oxide/__init__.pyi for full type hints and documentation.
Python Examples
See examples/python_example.py for a complete working example demonstrating all features.
Project Structure
pdf_oxide/
โโโ src/ # Rust source code
โ โโโ lib.rs # Main library entry point
โ โโโ error.rs # Error types
โ โโโ object.rs # PDF object types
โ โโโ lexer.rs # PDF lexer
โ โโโ parser.rs # PDF parser
โ โโโ document.rs # Document API
โ โโโ decoders.rs # Stream decoders
โ โโโ geometry.rs # Geometric primitives
โ โโโ layout.rs # Layout analysis
โ โโโ content.rs # Content stream parsing
โ โโโ fonts.rs # Font handling
โ โโโ text.rs # Text extraction
โ โโโ images.rs # Image extraction
โ โโโ converters.rs # Format converters
โ โโโ config.rs # Configuration
โ โโโ ml/ # ML integration (optional)
โ
โโโ python/ # Python bindings
โ โโโ src/lib.rs # PyO3 bindings
โ โโโ pdf_oxide.pyi # Type stubs
โ
โโโ tests/ # Integration tests
โ โโโ fixtures/ # Test PDFs
โ โโโ *.rs # Test files
โ
โโโ benches/ # Benchmarks
โ โโโ *.rs # Criterion benchmarks
โ
โโโ examples/ # Usage examples
โ โโโ rust/ # Rust examples
โ โโโ python/ # Python examples
โ
โโโ docs/ # Documentation
โ โโโ spec/ # PDF specification reference
โ โโโ pdf.md # ISO 32000-1:2008 excerpts
โ
โโโ training/ # ML training scripts (optional)
โ โโโ dataset/ # Dataset tools
โ โโโ finetune_*.py # Fine-tuning scripts
โ โโโ evaluate.py # Evaluation
โ
โโโ models/ # ONNX models (optional)
โ โโโ registry.json # Model metadata
โ โโโ *.onnx # Model files
โ
โโโ Cargo.toml # Rust dependencies
โโโ LICENSE-MIT # MIT license
โโโ LICENSE-APACHE # Apache-2.0 license
โโโ README.md # This file
Development Roadmap
โ Completed (v0.1.0)
- Core PDF Parsing - Complete PDF 1.0-1.7 support with robust error handling
- Text Extraction - 100% accurate extraction with perfect word spacing
- Layout Analysis - DBSCAN clustering and XY-Cut algorithms
- Markdown Export - Clean formatting with bold detection and form fields
- Image Extraction - Extract embedded images with metadata
- Python Bindings - Full PyO3 integration
- Performance Optimization - Ultra-fast processing (53ms average per PDF)
- Production Quality - 100% success rate on comprehensive test suite
โ Completed (v0.2.0) - PDF Spec Alignment & Intelligent Processing
- Intelligent Text Processing - Auto-detection of OCR vs native PDFs per text block
- Reading Order Strategies - XY-Cut spatial analysis, structure tree navigation
- Modern Pipeline Architecture - Extensible OutputConverter trait, OrderedTextSpan metadata
- PDF Spec Compliance - ISO 32000-1:2008 (PDF 1.7) Sections 9, 14.7-14.8
- Code Quality - 72% warning reduction, no dead code, 946 tests passing
- API Migration - Old APIs deprecated, modern TextPipeline recommended
- CCITT Bilevel Support - Group 3/4 image decompression for scanned PDFs
๐ง In Development (v0.3.0) - PDF Creation Foundations
- PDF Builder API - Fluent interface for programmatic PDF creation
- Markdown โ PDF - Convert Markdown files to PDF documents
- HTML โ PDF - Convert HTML with CSS to PDF
- Text โ PDF - Generate PDFs from plain text with styling
- PDF Templates - Reusable document templates for consistent designs
- Image Embedding - Support for embedded images in generated PDFs
- Bidirectional Toolkit - Extract FROM PDFs AND create PDFs
๐ฎ Planned (v0.4.0-v0.6.0) - Bidirectional Features
- Tables (Read โ Write) - v0.4.0
- Forms (Read โ Write) - v0.4.0
- Figures & Citations (Read โ Write) - v0.5.0
- Annotations & Tagged PDF (Read โ Write) - v0.6.0
- Hyperlinks & Advanced Graphics (Read โ Write) - v0.6.0
๐ฎ Future (v0.7.0+) - Specialized Features
- Math Formulas (Read โ Write) - Extract/generate equations
- Multi-Script Support - Bidirectional text, vertical CJK
- Encryption & Signatures - Password protection, digital signatures
- Embedded Files - PDF portfolios and attachments
- Vector Graphics - SVG to PDF, path extraction
- Advanced OCR - Multi-language detection and processing
- Performance Optimizations - Streaming, parallel processing, WASM
Versioning Philosophy: pdf_oxide follows forever 0.x versioning (0.1, 0.2, ... 0.100, 0.101, ...). We believe software evolves continuously rather than reaching a "1.0 finish line." Each version represents progress toward comprehensive PDF mastery, inspired by TeX's asymptotic approach (ฯ = 3.1, 3.14, 3.141...).
Current Status: โ v0.2.0 Production Ready - Spec-aligned with intelligent processing | ๐ง v0.3.0 - PDF Creation in development
Versioning Philosophy: Forever 0.x
pdf_oxide follows continuous evolution versioning:
- Versions: 0.1 โ 0.2 โ 0.3 โ ... โ 0.10 โ ... โ 0.100 โ ... (never 1.0)
- Rationale: Software is never "finished." Like TeX approaching ฯ asymptotically (3.1, 3.14, 3.141...), we approach perfect PDF handling without claiming to be done.
- Why not 1.0? Version 1.0 implies "feature complete" or "API frozen," but PDFs evolve and so should we.
- Production-Ready from 0.1.0+ - The 0.x doesn't mean unstable; it means "continuously improving"
Breaking Changes Policy
- Major features (v0.x.0): Possible breaking changes with deprecation warnings
- Minor features (v0.x.y): Backward compatible improvements
- Patches (v0.x.y.z): Bug fixes and security updates
Deprecation Examples
- v0.2.0:
MarkdownConvertermarked deprecated - v0.3.0-v0.4.0: Still works but flagged with migration warnings
- v0.5.0+: Removed (3+ versions later)
This gives users time to migrate while maintaining a clean codebase.
Building from Source
Prerequisites
- Rust 1.70+ (Install Rust)
- Python 3.8+ (for Python bindings)
- C compiler (gcc/clang)
Build Core Library
# Clone repository
git clone https://github.com/yfedoseev/pdf_oxide
cd pdf_oxide
# Build
cargo build --release
# Run tests
cargo test
# Run benchmarks
cargo bench
Build Python Package
# Development install
maturin develop
# Release build
maturin build --release
# Install wheel
pip install target/wheels/*.whl
Performance
Real-world benchmark results (103 diverse PDFs including forms, financial documents, and technical papers):
Benchmark Results
| Metric | Result |
|---|---|
| Total Time (103 PDFs) | 5.43s |
| Average Per PDF | 53ms |
| Success Rate | 100% (103/103) |
| Bold Sections Detected | 16,074 |
Scaling Projections
- 100 PDFs: ~5.3 seconds
- 1,000 PDFs: ~53 seconds
- 10,000 PDFs: ~8.8 minutes
- 100,000 PDFs: ~1.5 hours
Perfect for:
- High-throughput batch processing
- Real-time web services (53ms average latency)
- Cost-effective cloud deployments
- Resource-constrained environments
See COMPARISON.md for detailed analysis.
Quality Metrics & Improvements
Based on comprehensive analysis of diverse PDFs and recent validation testing (49ms median performance, 100% success rate), with improvements to achieve production-grade accuracy:
Overall Quality
| Metric | Result | Details |
|---|---|---|
| Quality Score | 8.5+/10 | Up from 3.4/10 (150% improvement) |
| Text Extraction | 100% | Perfect character extraction with proper encoding |
| Word Spacing | 100% | Unified adaptive threshold algorithm |
| Bold Detection | 16,074 | Bold sections detected in test suite |
| Form Field Extraction | 13 files | Complete form structure extraction |
| Quality Rating | 67% GOOD+ | 67% of files rated GOOD or EXCELLENT |
| Success Rate | 100% | All 103 PDFs processed successfully |
Specific Quality Improvements (v0.1.2+)
Fixed Issues from previous versions:
| Issue | Before | After | Improvement |
|---|---|---|---|
| Spurious Spaces | 1,623 in arxiv PDF | <50 | 96.9% reduction |
| Word Fusions | 3 instances | 0 | 100% elimination |
| Empty Bold Markers | 3 instances | 0 | 100% elimination |
Root Causes Addressed:
- Unified Space Decision: Single source of truth eliminates double space insertion
- Split Boundary Preservation: CamelCase words stay split during merging
- Bold Pre-Validation: Whitespace blocks filtered before bold grouping
- Adaptive Thresholds: Document profile detection tunes thresholds automatically
See docs/QUALITY_FIX_IMPLEMENTATION.md for comprehensive documentation.
Comprehensive Extraction Approach
- Adaptive Quality: Automatically adjusts extraction strategy based on document type (academic papers, policy documents, mixed layouts)
- Captures all text: Including technical diagrams and annotations
- Preserves structure: Form fields, bookmarks, and annotations intact
- Extracts metadata: PDF metadata, outline, and annotations
- Perfect for: Archival, search indexing, complete content analysis, LLM consumption
Text Extraction Quality Troubleshooting
Common Issues and Solutions
Problem: Double spaces in extracted text (e.g., "Over the past")
- Cause: Adaptive threshold too low for document's gap distribution
- Solution: Increase adaptive threshold multiplier or use legacy fixed thresholds
- See: docs/QUALITY_FIX_IMPLEMENTATION.md#troubleshooting-guide
Problem: CamelCase words fused (e.g., "theGeneralwas")
- Cause: CamelCase detection or split preservation disabled
- Solution: Enable CamelCase detection in config or use default settings
- See: docs/QUALITY_FIX_IMPLEMENTATION.md#camelcase-words-arent-being-split
Problem: Empty bold markers in output (e.g., ** **)
- Cause: Whitespace blocks inheriting bold styling
- Solution: Pre-validation filtering is enabled by default; file an issue if still occurs
- See: docs/QUALITY_FIX_IMPLEMENTATION.md#bold-formatting-is-missing
For detailed troubleshooting and configuration options, see the comprehensive guide: docs/QUALITY_FIX_IMPLEMENTATION.md
Testing
# Run all tests
cargo test
# Run with features
cargo test --features ml
# Run integration tests
cargo test --test '*'
# Run quality-specific tests
cargo test quality
# Run benchmarks
cargo bench
# Run performance benchmarks
cargo bench --bench pdf_extraction_performance
# Generate coverage report
cargo install cargo-tarpaulin
cargo tarpaulin --out Html
Documentation
Specification References
- docs/spec/pdf.md - ISO 32000-1:2008 sections 9, 14.7-14.8 (PDF specification excerpts)
API Documentation
# Generate and open docs
cargo doc --open
# With all features
cargo doc --all-features --open
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
What this means:
โ You CAN:
- Use this library freely for any purpose (personal, commercial, SaaS, web services)
- Modify and distribute the code
- Use it in proprietary applications without open-sourcing your code
- Sublicense and redistribute under different terms
โ ๏ธ You MUST:
- Include the copyright notice and license text in your distributions
- If using Apache-2.0 and modifying the library, note that you've made changes
โ You DON'T need to:
- Open-source your application code
- Share your modifications (but we'd appreciate contributions!)
- Pay any fees or royalties
Why MIT OR Apache-2.0?
We chose dual MIT/Apache-2.0 licensing (standard in the Rust ecosystem) to:
- Maximize adoption - No restrictions on commercial or proprietary use
- Patent protection - Apache-2.0 provides explicit patent grants
- Flexibility - Users can choose the license that best fits their needs
Apache-2.0 offers stronger patent protection, while MIT is simpler and more permissive. Choose whichever works best for your project.
See LICENSE-MIT and LICENSE-APACHE for full terms.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Contributing
We welcome contributions! To get started:
Getting Started
- Familiarize yourself with the codebase:
src/for Rust,python/for Python bindings - Check open issues for areas needing help
- Create an issue to discuss your approach
- Submit a pull request with tests
Development Setup
# Clone and build
git clone https://github.com/yfedoseev/pdf_oxide
cd pdf_oxide
cargo build
# Install development tools
cargo install cargo-watch cargo-tarpaulin
# Run tests on file changes
cargo watch -x test
# Format code
cargo fmt
# Run linter
cargo clippy -- -D warnings
Acknowledgments
Research Sources:
- PDF Reference 1.7 (ISO 32000-1:2008)
- Academic papers on document layout analysis
- Open-source implementations (lopdf, pdf-rs, pdfium-render)
Support
- Documentation:
docs/planning/ - Issues: GitHub Issues
Citation
If you use this library in academic research, please cite:
@software{pdf_oxide,
title = {PDF Oxide: High-Performance PDF Parsing in Rust},
author = {Yury Fedoseev},
year = {2025},
url = {https://github.com/yfedoseev/pdf_oxide}
}
Built with ๐ฆ Rust + ๐ Python
Status: โ Production Ready | v0.2.0 | ๐ 53ms per PDF | ๐ง Intelligent OCR Detection | ๐ PDF Spec Aligned (1.7) | โ Quality Validated (100% success) | ๐ Bidirectional Read/Write | โพ๏ธ Forever 0.x (Continuous Evolution)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_oxide-0.2.4.tar.gz.
File metadata
- Download URL: pdf_oxide-0.2.4.tar.gz
- Upload date:
- Size: 2.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b480f8dcc6788666f5012768040a44e3a51f1e90f5cdefd7278768882ef47ebe
|
|
| MD5 |
66731c731e0a5fb8dcf0bb1deb7a02c9
|
|
| BLAKE2b-256 |
a8a3f0080cbfa08d48626d6d8edc3be614514e2132fa89741d0a6d5dbfed9af5
|
File details
Details for the file pdf_oxide-0.2.4-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: pdf_oxide-0.2.4-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7d12d9c57ebfc1c4671f83ac1616ebe8a2943deb373cbf8cdebfa019f11f306
|
|
| MD5 |
6f0b299bb2183dbfff3049f33c70d27e
|
|
| BLAKE2b-256 |
e4fa8d6e511f9c360fb1875e6e4eeba49ee4b1e72a2be99278930f94aef42339
|
File details
Details for the file pdf_oxide-0.2.4-cp311-cp311-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: pdf_oxide-0.2.4-cp311-cp311-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.11, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ca8c802047fc21e9728f7842e1150b5bd5f209f7ca73351a0383551538eba03
|
|
| MD5 |
351d0721f113314774925f534f3fc1cf
|
|
| BLAKE2b-256 |
b2e3e0efe3d9c0c071c487fcbc6b969875f227da4c666dab33c7f8df1a6ad7c7
|
File details
Details for the file pdf_oxide-0.2.4-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: pdf_oxide-0.2.4-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c0b559e05348cbd86810ec6d2848f8047487be6e6cc3c5d27342ce057bad114
|
|
| MD5 |
461d3dc9f68457cf69176ec012782695
|
|
| BLAKE2b-256 |
6ccb52bb3ba4185246e284c14a0e89935966c80ae5b00224b317643541778bc1
|
File details
Details for the file pdf_oxide-0.2.4-cp311-cp311-macosx_10_12_x86_64.whl.
File metadata
- Download URL: pdf_oxide-0.2.4-cp311-cp311-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.11, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ce524a7acad9fda790cd6f7086d6da46d05b87c1f2e724a54534e6160f8c7e0
|
|
| MD5 |
575e6528662fdb8f189f4f72abedc3a2
|
|
| BLAKE2b-256 |
f450f81f34380ba408e402444f4c8de38c7b52277370016dadbdf85bccb2e3d8
|