Skip to main content

PDF parsing and content extraction for academic papers

Project description

Paper2Data Parser

PyPI version Python License: MIT

A powerful Python library for extracting and parsing content from academic papers. Transform PDF files, arXiv papers, and DOI-referenced documents into structured, searchable data repositories.

🚀 Features

  • 📄 Multi-format Input: PDF files, arXiv URLs, DOI resolution with automatic retrieval
  • 🔍 Intelligent Parsing: Advanced section detection, table extraction to CSV, figure processing
  • 🌐 API Integration: Live arXiv and CrossRef DOI resolution with metadata enrichment
  • ⚡ Performance Optimized: Rate limiting, caching, and batch processing capabilities
  • 🔧 Advanced Configuration: YAML-based configuration with smart defaults and validation
  • 🔌 Enhanced Plugin System v1.1: Marketplace, dependency management, and auto-updates
  • 🧮 Mathematical Processing: LaTeX equation detection, conversion, and MathML support
  • 🖼️ Advanced Figure Processing: AI-powered figure classification and caption extraction
  • 📚 Enhanced Metadata: Institution detection, author disambiguation, and funding information
  • 📖 Bibliographic Parsing: Citation style detection, reference normalization, and network analysis
  • 🎨 Multi-Format Export: HTML, LaTeX, Word, EPUB, Markdown with professional templates
  • 🧪 Production Ready: 100% test coverage with comprehensive quality assurance

📦 Installation

# Install the latest version
pip install paper2data-parser

# Install with API integration dependencies
pip install paper2data-parser[api]

# Install with development dependencies
pip install paper2data-parser[dev]

🛠️ Quick Start

Basic Usage

from paper2data import PDFIngestor, extract_all_content

# Initialize ingestor
ingestor = PDFIngestor()

# Extract content from a PDF
content = ingestor.ingest("path/to/paper.pdf")

# Extract all content with optimization
results = extract_all_content("path/to/paper.pdf")
print(f"Extracted {len(results.sections)} sections")
print(f"Found {len(results.figures)} figures")
print(f"Extracted {len(results.tables)} tables")

Advanced Usage with Configuration

from paper2data import (
    create_config_interactive,
    extract_all_content_optimized,
    MultiFormatExporter
)

# Create configuration interactively
config = create_config_interactive()

# Extract content with full optimization
results = extract_all_content_optimized(
    "path/to/paper.pdf",
    config=config,
    enable_parallel=True,
    enable_caching=True
)

# Export to multiple formats
exporter = MultiFormatExporter({
    "formats": ["html", "latex", "word"],
    "theme": "academic"
})
exporter.export_document(results, "output/")

Enhanced Plugin System v1.1

from paper2data import initialize_enhanced_plugin_system
import asyncio

# Initialize the enhanced plugin system
system = initialize_enhanced_plugin_system({
    "auto_update_enabled": True,
    "health_monitoring_enabled": True
})

# Search and install plugins
results = system.search_plugins("latex", min_rating=4.0)
await system.install_plugin("latex-processor")

# Monitor system health
metrics = system.get_system_metrics()
print(f"Active plugins: {metrics.active_plugins}")

Mathematical Processing

from paper2data import EquationProcessor

# Process mathematical equations
processor = EquationProcessor()
equations = processor.extract_equations("path/to/paper.pdf")

for eq in equations:
    print(f"LaTeX: {eq.latex}")
    print(f"MathML: {eq.mathml}")
    print(f"Complexity: {eq.complexity_score}")

Advanced Figure Processing

from paper2data import AdvancedFigureProcessor

# Process figures with AI analysis
processor = AdvancedFigureProcessor()
figures = processor.process_figures("path/to/paper.pdf")

for fig in figures:
    print(f"Type: {fig.figure_type}")
    print(f"Caption: {fig.caption.text}")
    print(f"Quality: {fig.analysis.quality}")

Enhanced Metadata Extraction

from paper2data import EnhancedMetadataExtractor

# Extract comprehensive metadata
extractor = EnhancedMetadataExtractor()
metadata = extractor.extract_metadata("path/to/paper.pdf")

print(f"Title: {metadata.title}")
print(f"Authors: {[author.full_name for author in metadata.authors]}")
print(f"Institutions: {[inst.name for inst in metadata.institutions]}")
print(f"Funding: {[fund.name for fund in metadata.funding_sources]}")

🎯 Key Components

Core Extraction

  • PDFIngestor: Primary PDF processing engine
  • ContentExtractor: Comprehensive content extraction
  • SectionExtractor: Intelligent section detection
  • FigureExtractor: Image and figure processing
  • TableExtractor: Table detection and CSV conversion

Advanced Processing

  • EquationProcessor: Mathematical content processing
  • AdvancedFigureProcessor: AI-powered figure analysis
  • EnhancedMetadataExtractor: Comprehensive metadata extraction
  • BibliographicParser: Citation and reference processing

Plugin System v1.1

  • PluginManager: Core plugin management
  • DependencyManager: Automatic dependency resolution
  • PluginMarketplace: Community plugin ecosystem
  • EnhancedPluginSystem: Unified management interface

Output & Export

  • MultiFormatExporter: Professional multi-format export
  • OutputFormatters: Specialized format converters
  • ConfigManager: Advanced configuration management

🔧 Configuration

Paper2Data uses YAML-based configuration with smart defaults:

processing:
  max_workers: 4
  enable_caching: true
  cache_size: 1000
  
extraction:
  extract_figures: true
  extract_tables: true
  extract_equations: true
  
output:
  base_dir: "./output"
  formats: ["html", "markdown"]
  
plugins:
  auto_update: true
  security_scan: true

📊 Performance Features

  • Parallel Processing: Multi-threaded extraction
  • Intelligent Caching: Smart result caching
  • Memory Optimization: Efficient memory usage
  • Batch Processing: Process multiple documents
  • Progress Tracking: Real-time progress monitoring

🔌 Plugin Ecosystem

The enhanced plugin system v1.1 provides:

  • Plugin Marketplace: Discover and install community plugins
  • Dependency Management: Automatic dependency resolution
  • Security Scanning: Automated security validation
  • Health Monitoring: Real-time plugin performance tracking
  • Auto-Updates: Background plugin updates

🧪 Testing & Quality

  • 100% Test Coverage: Comprehensive test suite
  • Type Hints: Full type annotation support
  • Linting: Code quality enforcement
  • Performance Testing: Benchmarking and optimization
  • Integration Testing: End-to-end validation

📚 Documentation

  • API Reference: Complete API documentation
  • Examples: Comprehensive usage examples
  • Tutorials: Step-by-step guides
  • Best Practices: Recommended patterns

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

🚀 What's New in v1.1

Enhanced Plugin System

  • Revolutionary plugin architecture with marketplace integration
  • Automatic dependency resolution and conflict management
  • Security scanning and health monitoring
  • Background auto-updates and performance analytics

Mathematical Processing

  • Advanced LaTeX equation detection and extraction
  • MathML conversion for web compatibility
  • Mathematical complexity analysis
  • Symbol recognition and validation

Advanced Figure Processing

  • AI-powered figure classification
  • Automatic caption extraction with OCR fallback
  • Image quality assessment and analysis
  • Figure-text association and context analysis

Enhanced Metadata Extraction

  • Author disambiguation and institution detection
  • Funding source identification and categorization
  • Enhanced bibliographic data extraction
  • Cross-reference validation and enrichment

Multi-Format Export

  • Professional HTML export with interactive features
  • LaTeX reconstruction for academic submission
  • Microsoft Word compatibility
  • EPUB generation for e-book readers
  • Enhanced Markdown with rich formatting

Paper2Data v1.1 - Transform academic papers into structured data repositories with enterprise-grade processing capabilities.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper2data_parser-1.1.1.tar.gz (235.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper2data_parser-1.1.1-py3-none-any.whl (217.5 kB view details)

Uploaded Python 3

File details

Details for the file paper2data_parser-1.1.1.tar.gz.

File metadata

  • Download URL: paper2data_parser-1.1.1.tar.gz
  • Upload date:
  • Size: 235.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for paper2data_parser-1.1.1.tar.gz
Algorithm Hash digest
SHA256 cc1c4cffae6812125ba4ca9d9e78c3e1375d57917b770e6746b6627c516a44d9
MD5 41ee50a96b4ba4957f4ee829ee0dc67d
BLAKE2b-256 b67123ecdea9c459239b1b28a9904c0c41939915513e796f36a437c9d44c6c63

See more details on using hashes here.

File details

Details for the file paper2data_parser-1.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for paper2data_parser-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 caf45cf221fd7ad86a7fe6a7bf91d430e51f758a2735bc6b22d0552e1da72356
MD5 5b59bc145fd106a31f45e5b632900ecc
BLAKE2b-256 6ea3a309592650761a8fd329b6eecdf0778c70ec0aefde86a17b94637dd90358

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page