Skip to main content

PDF parsing and content extraction for academic papers

Project description

Paper2Data Parser

PyPI version Python License: MIT

A powerful Python library for extracting and parsing content from academic papers. Transform PDF files, arXiv papers, and DOI-referenced documents into structured, searchable data repositories.

🚀 Features

  • 📄 Multi-format Input: PDF files, arXiv URLs, DOI resolution with automatic retrieval
  • 🔍 Intelligent Parsing: Advanced section detection, table extraction to CSV, figure processing
  • 🌐 API Integration: Live arXiv and CrossRef DOI resolution with metadata enrichment
  • ⚡ Performance Optimized: Rate limiting, caching, and batch processing capabilities
  • 🔧 Advanced Configuration: YAML-based configuration with smart defaults and validation
  • 🔌 Enhanced Plugin System v1.1: Marketplace, dependency management, and auto-updates
  • 🧮 Mathematical Processing: LaTeX equation detection, conversion, and MathML support
  • 🖼️ Advanced Figure Processing: AI-powered figure classification and caption extraction
  • 📚 Enhanced Metadata: Institution detection, author disambiguation, and funding information
  • 📖 Bibliographic Parsing: Citation style detection, reference normalization, and network analysis
  • 🎨 Multi-Format Export: HTML, LaTeX, Word, EPUB, Markdown with professional templates
  • 🧪 Production Ready: 100% test coverage with comprehensive quality assurance

📦 Installation

# Install the latest version
pip install paper2data-parser

# Install with API integration dependencies
pip install paper2data-parser[api]

# Install with development dependencies
pip install paper2data-parser[dev]

🛠️ Quick Start

Basic Usage

from paper2data import PDFIngestor, extract_all_content

# Initialize ingestor
ingestor = PDFIngestor()

# Extract content from a PDF
content = ingestor.ingest("path/to/paper.pdf")

# Extract all content with optimization
results = extract_all_content("path/to/paper.pdf")
print(f"Extracted {len(results.sections)} sections")
print(f"Found {len(results.figures)} figures")
print(f"Extracted {len(results.tables)} tables")

Advanced Usage with Configuration

from paper2data import (
    create_config_interactive,
    extract_all_content_optimized,
    MultiFormatExporter
)

# Create configuration interactively
config = create_config_interactive()

# Extract content with full optimization
results = extract_all_content_optimized(
    "path/to/paper.pdf",
    config=config,
    enable_parallel=True,
    enable_caching=True
)

# Export to multiple formats
exporter = MultiFormatExporter({
    "formats": ["html", "latex", "word"],
    "theme": "academic"
})
exporter.export_document(results, "output/")

Enhanced Plugin System v1.1

from paper2data import initialize_enhanced_plugin_system
import asyncio

# Initialize the enhanced plugin system
system = initialize_enhanced_plugin_system({
    "auto_update_enabled": True,
    "health_monitoring_enabled": True
})

# Search and install plugins
results = system.search_plugins("latex", min_rating=4.0)
await system.install_plugin("latex-processor")

# Monitor system health
metrics = system.get_system_metrics()
print(f"Active plugins: {metrics.active_plugins}")

Mathematical Processing

from paper2data import EquationProcessor

# Process mathematical equations
processor = EquationProcessor()
equations = processor.extract_equations("path/to/paper.pdf")

for eq in equations:
    print(f"LaTeX: {eq.latex}")
    print(f"MathML: {eq.mathml}")
    print(f"Complexity: {eq.complexity_score}")

Advanced Figure Processing

from paper2data import AdvancedFigureProcessor

# Process figures with AI analysis
processor = AdvancedFigureProcessor()
figures = processor.process_figures("path/to/paper.pdf")

for fig in figures:
    print(f"Type: {fig.figure_type}")
    print(f"Caption: {fig.caption.text}")
    print(f"Quality: {fig.analysis.quality}")

Enhanced Metadata Extraction

from paper2data import EnhancedMetadataExtractor

# Extract comprehensive metadata
extractor = EnhancedMetadataExtractor()
metadata = extractor.extract_metadata("path/to/paper.pdf")

print(f"Title: {metadata.title}")
print(f"Authors: {[author.full_name for author in metadata.authors]}")
print(f"Institutions: {[inst.name for inst in metadata.institutions]}")
print(f"Funding: {[fund.name for fund in metadata.funding_sources]}")

🎯 Key Components

Core Extraction

  • PDFIngestor: Primary PDF processing engine
  • ContentExtractor: Comprehensive content extraction
  • SectionExtractor: Intelligent section detection
  • FigureExtractor: Image and figure processing
  • TableExtractor: Table detection and CSV conversion

Advanced Processing

  • EquationProcessor: Mathematical content processing
  • AdvancedFigureProcessor: AI-powered figure analysis
  • EnhancedMetadataExtractor: Comprehensive metadata extraction
  • BibliographicParser: Citation and reference processing

Plugin System v1.1

  • PluginManager: Core plugin management
  • DependencyManager: Automatic dependency resolution
  • PluginMarketplace: Community plugin ecosystem
  • EnhancedPluginSystem: Unified management interface

Output & Export

  • MultiFormatExporter: Professional multi-format export
  • OutputFormatters: Specialized format converters
  • ConfigManager: Advanced configuration management

🔧 Configuration

Paper2Data uses YAML-based configuration with smart defaults:

processing:
  max_workers: 4
  enable_caching: true
  cache_size: 1000
  
extraction:
  extract_figures: true
  extract_tables: true
  extract_equations: true
  
output:
  base_dir: "./output"
  formats: ["html", "markdown"]
  
plugins:
  auto_update: true
  security_scan: true

📊 Performance Features

  • Parallel Processing: Multi-threaded extraction
  • Intelligent Caching: Smart result caching
  • Memory Optimization: Efficient memory usage
  • Batch Processing: Process multiple documents
  • Progress Tracking: Real-time progress monitoring

🔌 Plugin Ecosystem

The enhanced plugin system v1.1 provides:

  • Plugin Marketplace: Discover and install community plugins
  • Dependency Management: Automatic dependency resolution
  • Security Scanning: Automated security validation
  • Health Monitoring: Real-time plugin performance tracking
  • Auto-Updates: Background plugin updates

🧪 Testing & Quality

  • 100% Test Coverage: Comprehensive test suite
  • Type Hints: Full type annotation support
  • Linting: Code quality enforcement
  • Performance Testing: Benchmarking and optimization
  • Integration Testing: End-to-end validation

📚 Documentation

  • API Reference: Complete API documentation
  • Examples: Comprehensive usage examples
  • Tutorials: Step-by-step guides
  • Best Practices: Recommended patterns

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

🚀 What's New in v1.1

Enhanced Plugin System

  • Revolutionary plugin architecture with marketplace integration
  • Automatic dependency resolution and conflict management
  • Security scanning and health monitoring
  • Background auto-updates and performance analytics

Mathematical Processing

  • Advanced LaTeX equation detection and extraction
  • MathML conversion for web compatibility
  • Mathematical complexity analysis
  • Symbol recognition and validation

Advanced Figure Processing

  • AI-powered figure classification
  • Automatic caption extraction with OCR fallback
  • Image quality assessment and analysis
  • Figure-text association and context analysis

Enhanced Metadata Extraction

  • Author disambiguation and institution detection
  • Funding source identification and categorization
  • Enhanced bibliographic data extraction
  • Cross-reference validation and enrichment

Multi-Format Export

  • Professional HTML export with interactive features
  • LaTeX reconstruction for academic submission
  • Microsoft Word compatibility
  • EPUB generation for e-book readers
  • Enhanced Markdown with rich formatting

Paper2Data v1.1 - Transform academic papers into structured data repositories with enterprise-grade processing capabilities.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper2data_parser-1.1.0.tar.gz (235.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper2data_parser-1.1.0-py3-none-any.whl (217.5 kB view details)

Uploaded Python 3

File details

Details for the file paper2data_parser-1.1.0.tar.gz.

File metadata

  • Download URL: paper2data_parser-1.1.0.tar.gz
  • Upload date:
  • Size: 235.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for paper2data_parser-1.1.0.tar.gz
Algorithm Hash digest
SHA256 c393b5341946d517d769008259cba8679b2c4dd86b10c79fc7a239dccccdddf8
MD5 571af2a1e7d494e9d15ba0e5465d42c8
BLAKE2b-256 b45ad9d769fd26498e6746bc4b916e97d647c8e0d4dcccb949b19c85bb686487

See more details on using hashes here.

File details

Details for the file paper2data_parser-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for paper2data_parser-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2e9605b660a59f70d664a94b40311e7580832b756704ac5d60b5b0f2ba2a9024
MD5 cbc526879dc7f3e85d14166421ef145f
BLAKE2b-256 208c471d62ed7588300d1e3ff84d654ec2949001e6e90e8da2685a6efd824f59

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page