PDF parsing and content extraction for academic papers

These details have not been verified by PyPI

Project links

Project description

Paper2Data Parser

A powerful Python library for extracting and parsing content from academic papers. Transform PDF files, arXiv papers, and DOI-referenced documents into structured, searchable data repositories.

🚀 Features

📄 Multi-format Input: PDF files, arXiv URLs, DOI resolution with automatic retrieval
🔍 Intelligent Parsing: Advanced section detection, table extraction to CSV, figure processing
🌐 API Integration: Live arXiv and CrossRef DOI resolution with metadata enrichment
⚡ Performance Optimized: Rate limiting, caching, and batch processing capabilities
🔧 Advanced Configuration: YAML-based configuration with smart defaults and validation
🔌 Enhanced Plugin System v1.1: Marketplace, dependency management, and auto-updates
🧮 Mathematical Processing: LaTeX equation detection, conversion, and MathML support
🖼️ Advanced Figure Processing: AI-powered figure classification and caption extraction
📚 Enhanced Metadata: Institution detection, author disambiguation, and funding information
📖 Bibliographic Parsing: Citation style detection, reference normalization, and network analysis
🎨 Multi-Format Export: HTML, LaTeX, Word, EPUB, Markdown with professional templates
🧪 Production Ready: 100% test coverage with comprehensive quality assurance

📦 Installation

# Install the latest version
pip install paper2data-parser

# Install with API integration dependencies
pip install paper2data-parser[api]

# Install with development dependencies
pip install paper2data-parser[dev]

🛠️ Quick Start

Basic Usage

from paper2data import PDFIngestor, extract_all_content

# Initialize ingestor
ingestor = PDFIngestor()

# Extract content from a PDF
content = ingestor.ingest("path/to/paper.pdf")

# Extract all content with optimization
results = extract_all_content("path/to/paper.pdf")
print(f"Extracted {len(results.sections)} sections")
print(f"Found {len(results.figures)} figures")
print(f"Extracted {len(results.tables)} tables")

Advanced Usage with Configuration

from paper2data import (
    create_config_interactive,
    extract_all_content_optimized,
    MultiFormatExporter
)

# Create configuration interactively
config = create_config_interactive()

# Extract content with full optimization
results = extract_all_content_optimized(
    "path/to/paper.pdf",
    config=config,
    enable_parallel=True,
    enable_caching=True
)

# Export to multiple formats
exporter = MultiFormatExporter({
    "formats": ["html", "latex", "word"],
    "theme": "academic"
})
exporter.export_document(results, "output/")

Enhanced Plugin System v1.1

from paper2data import initialize_enhanced_plugin_system
import asyncio

# Initialize the enhanced plugin system
system = initialize_enhanced_plugin_system({
    "auto_update_enabled": True,
    "health_monitoring_enabled": True
})

# Search and install plugins
results = system.search_plugins("latex", min_rating=4.0)
await system.install_plugin("latex-processor")

# Monitor system health
metrics = system.get_system_metrics()
print(f"Active plugins: {metrics.active_plugins}")

Mathematical Processing

from paper2data import EquationProcessor

# Process mathematical equations
processor = EquationProcessor()
equations = processor.extract_equations("path/to/paper.pdf")

for eq in equations:
    print(f"LaTeX: {eq.latex}")
    print(f"MathML: {eq.mathml}")
    print(f"Complexity: {eq.complexity_score}")

Advanced Figure Processing

from paper2data import AdvancedFigureProcessor

# Process figures with AI analysis
processor = AdvancedFigureProcessor()
figures = processor.process_figures("path/to/paper.pdf")

for fig in figures:
    print(f"Type: {fig.figure_type}")
    print(f"Caption: {fig.caption.text}")
    print(f"Quality: {fig.analysis.quality}")

Enhanced Metadata Extraction

from paper2data import EnhancedMetadataExtractor

# Extract comprehensive metadata
extractor = EnhancedMetadataExtractor()
metadata = extractor.extract_metadata("path/to/paper.pdf")

print(f"Title: {metadata.title}")
print(f"Authors: {[author.full_name for author in metadata.authors]}")
print(f"Institutions: {[inst.name for inst in metadata.institutions]}")
print(f"Funding: {[fund.name for fund in metadata.funding_sources]}")

🎯 Key Components

Core Extraction

PDFIngestor: Primary PDF processing engine
ContentExtractor: Comprehensive content extraction
SectionExtractor: Intelligent section detection
FigureExtractor: Image and figure processing
TableExtractor: Table detection and CSV conversion

Advanced Processing

EquationProcessor: Mathematical content processing
AdvancedFigureProcessor: AI-powered figure analysis
EnhancedMetadataExtractor: Comprehensive metadata extraction
BibliographicParser: Citation and reference processing

Plugin System v1.1

PluginManager: Core plugin management
DependencyManager: Automatic dependency resolution
PluginMarketplace: Community plugin ecosystem
EnhancedPluginSystem: Unified management interface

Output & Export

MultiFormatExporter: Professional multi-format export
OutputFormatters: Specialized format converters
ConfigManager: Advanced configuration management

🔧 Configuration

Paper2Data uses YAML-based configuration with smart defaults:

processing:
  max_workers: 4
  enable_caching: true
  cache_size: 1000
  
extraction:
  extract_figures: true
  extract_tables: true
  extract_equations: true
  
output:
  base_dir: "./output"
  formats: ["html", "markdown"]
  
plugins:
  auto_update: true
  security_scan: true

📊 Performance Features

Parallel Processing: Multi-threaded extraction
Intelligent Caching: Smart result caching
Memory Optimization: Efficient memory usage
Batch Processing: Process multiple documents
Progress Tracking: Real-time progress monitoring

🔌 Plugin Ecosystem

The enhanced plugin system v1.1 provides:

Plugin Marketplace: Discover and install community plugins
Dependency Management: Automatic dependency resolution
Security Scanning: Automated security validation
Health Monitoring: Real-time plugin performance tracking
Auto-Updates: Background plugin updates

🧪 Testing & Quality

100% Test Coverage: Comprehensive test suite
Type Hints: Full type annotation support
Linting: Code quality enforcement
Performance Testing: Benchmarking and optimization
Integration Testing: End-to-end validation

📚 Documentation

API Reference: Complete API documentation
Examples: Comprehensive usage examples
Tutorials: Step-by-step guides
Best Practices: Recommended patterns

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

Homepage: https://github.com/paper2data/paper2data
Documentation: https://paper2data.readthedocs.io
PyPI: https://pypi.org/project/paper2data-parser/
Issues: https://github.com/paper2data/paper2data/issues

🚀 What's New in v1.1

Enhanced Plugin System

Revolutionary plugin architecture with marketplace integration
Automatic dependency resolution and conflict management
Security scanning and health monitoring
Background auto-updates and performance analytics

Mathematical Processing

Advanced LaTeX equation detection and extraction
MathML conversion for web compatibility
Mathematical complexity analysis
Symbol recognition and validation

Advanced Figure Processing

AI-powered figure classification
Automatic caption extraction with OCR fallback
Image quality assessment and analysis
Figure-text association and context analysis

Enhanced Metadata Extraction

Author disambiguation and institution detection
Funding source identification and categorization
Enhanced bibliographic data extraction
Cross-reference validation and enrichment

Multi-Format Export

Professional HTML export with interactive features
LaTeX reconstruction for academic submission
Microsoft Word compatibility
EPUB generation for e-book readers
Enhanced Markdown with rich formatting

Paper2Data v1.1 - Transform academic papers into structured data repositories with enterprise-grade processing capabilities.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.2

Jul 12, 2025

1.1.1

Jul 12, 2025

This version

1.1.0

Jul 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper2data_parser-1.1.0.tar.gz (235.3 kB view details)

Uploaded Jul 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

paper2data_parser-1.1.0-py3-none-any.whl (217.5 kB view details)

Uploaded Jul 12, 2025 Python 3

File details

Details for the file paper2data_parser-1.1.0.tar.gz.

File metadata

Download URL: paper2data_parser-1.1.0.tar.gz
Upload date: Jul 12, 2025
Size: 235.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for paper2data_parser-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c393b5341946d517d769008259cba8679b2c4dd86b10c79fc7a239dccccdddf8`
MD5	`571af2a1e7d494e9d15ba0e5465d42c8`
BLAKE2b-256	`b45ad9d769fd26498e6746bc4b916e97d647c8e0d4dcccb949b19c85bb686487`

See more details on using hashes here.

File details

Details for the file paper2data_parser-1.1.0-py3-none-any.whl.

File metadata

Download URL: paper2data_parser-1.1.0-py3-none-any.whl
Upload date: Jul 12, 2025
Size: 217.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for paper2data_parser-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2e9605b660a59f70d664a94b40311e7580832b756704ac5d60b5b0f2ba2a9024`
MD5	`cbc526879dc7f3e85d14166421ef145f`
BLAKE2b-256	`208c471d62ed7588300d1e3ff84d654ec2949001e6e90e8da2685a6efd824f59`

See more details on using hashes here.

paper2data-parser 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Paper2Data Parser

🚀 Features

📦 Installation

🛠️ Quick Start

Basic Usage

Advanced Usage with Configuration

Enhanced Plugin System v1.1

Mathematical Processing

Advanced Figure Processing

Enhanced Metadata Extraction

🎯 Key Components

Core Extraction

Advanced Processing

Plugin System v1.1

Output & Export

🔧 Configuration

📊 Performance Features

🔌 Plugin Ecosystem

🧪 Testing & Quality

📚 Documentation

🤝 Contributing

📄 License

🔗 Links

🚀 What's New in v1.1

Enhanced Plugin System

Mathematical Processing

Advanced Figure Processing

Enhanced Metadata Extraction

Multi-Format Export

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes