Skip to main content

Academic document intelligent cleaning pipeline for AI for Science, ensuring MinerU parsed data meets LLM input standards

Project description

PorosData-Processor

Python 3.8+ License: MIT

Academic Literature Data Engineering Toolkit - Specializes in cleaning MinerU outputs (scientific literature) and evaluating Token efficiency for LLMs, supporting the complete "AI for Science" workflow.

🎯 Project Positioning

In the AI for Science field, high-quality academic data preprocessing is the foundation for models to understand scientific literature. PorosData-Processor serves as a data engineering tool that specifically addresses the "last mile" problem from MinerU document parsing to LLM input, ensuring academic documents achieve maximum Token efficiency while maintaining integrity.

🌟 Core Features

  • 🔬 Professional Academic Cleaning: Intelligently handles LaTeX formulas, control characters, citation formats, and other academic document-specific issues
  • ⚡ Multi-process Parallel Processing: Cross-platform concurrent processing based on pathlib and spawn methods, supporting Windows/Linux
  • 📊 Real-time Token Evaluation: Integrated GPT-2 tokenizer, providing precise Token compression rate calculations
  • 🛡️ Intelligent Protection Mechanism: 20% compression rate threshold protection, ensuring text semantic integrity
  • 🔄 Self-healing Quality Assurance: Automatic detection and repair of data corruption during processing

🚀 Quick Start

Environment Requirements

pip install porosdata-processor ijson tiktoken psutil

Process MinerU Data

# Process all JSON files in the data/mineru_output_raw_data directory
# Automatically configure HF_ENDPOINT mirror for Chinese users to accelerate downloads
python run_processor.py --enable-evaluation

Output Data Format

Processed JSON files contain standardized fields:

  • text: Cleaned academic text, suitable for LLM input
  • original_text: Original input text, for quality comparison
  • healed_count: Self-healing repair count, reflecting data quality

📈 Performance Showcase

Materials Science Data Pilot Test Results

Test Metric Value Description
Files Processed 3 files MinerU-parsed academic paper JSON files
Items Processed 127 items Including text, formulas, tables, and other structured content
Avg Token Compression Rate 0.098 90.2% Token Savings
Processing Time 0.456s Multi-process parallel processing efficiency
Memory Peak 204.2MB Streaming processing ensures memory efficiency

Key Insights:

  • Token compression rate reaches 0.098 (90.2% reduction), significantly reducing LLM inference costs
  • Processing speed of 278.5 items/second, suitable for large-scale academic data processing
  • Memory peak usage of only 204.2MB, supporting TB-level data processing

🛠️ Academic Tools Suite

The project includes a complete academic data processing toolchain:

Core Processing Scripts

  • Batch Processing: academic_tools/standalone/batch_process.py
  • Single File Processing: academic_tools/standalone/process_single_json.py
  • Compatibility Processing: academic_tools/standalone/process_with_cleanlit.py

Advanced Configuration Options

# Custom input/output directories
python run_processor.py --input-dir ./data/input --output-dir ./data/output

# Basic cleaning only (no Token evaluation)
python run_processor.py

# Force reprocessing of all files
python run_processor.py --enable-evaluation --force-reprocess

📚 Technical Documentation

🔬 Technical Specifications

Supported Data Formats

  • Input: MinerU-parsed JSON format academic documents
  • Output: Standardized JSON, compatible with LLM training and inference
  • Encoding: UTF-8 cross-platform support, automatic control character handling

Quality Assurance Mechanisms

  • Compression Rate Protection: ≤20% threshold ensures text semantic integrity
  • Boundary Self-healing: Automatic detection and repair of Shield protection anomalies
  • Integrity Auditing: Multi-layer verification ensures data quality

📄 Open Source License

MIT License - see LICENSE file for details

📖 Citation

If you use PorosData-Processor in your research, please cite:

@software{porosdata_processor,
  title = {PorosData-Processor: Academic Document Intelligent Cleaning Pipeline},
  author = {YE, Kivent},
  year = {2025},
  url = {https://github.com/KiventYip/PorosData-doc},
  version = {0.2.4}
}

🤝 Contributions and Feedback

Issues and Pull Requests are welcome! The project adopts a data-driven development philosophy, and any performance optimization suggestions will be seriously considered.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

porosdata_processor-0.2.4.tar.gz (68.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

porosdata_processor-0.2.4-py3-none-any.whl (58.9 kB view details)

Uploaded Python 3

File details

Details for the file porosdata_processor-0.2.4.tar.gz.

File metadata

  • Download URL: porosdata_processor-0.2.4.tar.gz
  • Upload date:
  • Size: 68.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for porosdata_processor-0.2.4.tar.gz
Algorithm Hash digest
SHA256 82d6c955785e8fbbd844da994db72cd7a3b1bc4d78356db6b2e5bc41d527188c
MD5 793bb224173ac8e00e23ad83f98ff7a8
BLAKE2b-256 01a31fff504b3bef25cdcec67cb67f6bffab00f38bb65608f75e216868ccebc1

See more details on using hashes here.

File details

Details for the file porosdata_processor-0.2.4-py3-none-any.whl.

File metadata

File hashes

Hashes for porosdata_processor-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e6d05e1ff80a52343e703c5a88fb9148780a5feadfd0c3e6c359c552ef737177
MD5 586bc770461e7ec0d45068e4332f5e26
BLAKE2b-256 71cab12a1a729f46fda4132c1df5f8f3c6f3af30d19b047d6eb4e3948215d9c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page