Skip to main content

Academic document intelligent cleaning pipeline for AI for Science, ensuring MinerU parsed data meets LLM input standards

Project description

PorosData-Processor

Python 3.8+ License: MIT PyPI version

Academic Literature Data Engineering Toolkit - Specializes in cleaning MinerU outputs (scientific literature) and evaluating Token efficiency for LLMs, supporting the complete "AI for Science" workflow.

🎯 Project Positioning

In the AI for Science field, high-quality academic data preprocessing is the foundation for models to understand scientific literature. PorosData-Processor serves as a data engineering tool that specifically addresses the "last mile" problem from MinerU document parsing to LLM input, ensuring academic documents achieve maximum Token efficiency while maintaining integrity.

🌟 Core Features

  • 🔬 Professional Academic Cleaning: Intelligently handles LaTeX formulas, control characters, citation formats, and other academic document-specific issues
  • ⚡ Multi-process Parallel Processing: Cross-platform concurrent processing based on pathlib and spawn methods, supporting Windows/Linux
  • 📊 Real-time Token Evaluation: Integrated GPT-2 tokenizer, providing precise Token compression rate calculations
  • 🛡️ Intelligent Protection Mechanism: 20% compression rate threshold protection, ensuring text semantic integrity
  • 🔄 Self-healing Quality Assurance: Automatic detection and repair of data corruption during processing

🚀 Quick Start

Environment Requirements

pip install porosdata-processor ijson tiktoken psutil

Process MinerU Data

# Process all JSON files in the data/mineru_output_raw_data directory
# Automatically configure HF_ENDPOINT mirror for Chinese users to accelerate downloads
python run_processor.py --enable-evaluation

Output Data Format

Processed JSON files contain standardized fields:

  • text: Cleaned academic text, suitable for LLM input
  • original_text: Original input text, for quality comparison
  • healed_count: Self-healing repair count, reflecting data quality

📈 Performance Showcase

Materials Science Data Pilot Test Results

Test Metric Value Description
Files Processed 3 files MinerU-parsed academic paper JSON files
Items Processed 127 items Including text, formulas, tables, and other structured content
Avg Token Compression Rate 0.098 90.2% Token Savings
Processing Time 0.456s Multi-process parallel processing efficiency
Memory Peak 204.2MB Streaming processing ensures memory efficiency

Key Insights:

  • Token compression rate reaches 0.098 (90.2% reduction), significantly reducing LLM inference costs
  • Processing speed of 278.5 items/second, suitable for large-scale academic data processing
  • Memory peak usage of only 204.2MB, supporting TB-level data processing

🛠️ Academic Tools Suite

The project includes a complete academic data processing toolchain:

Core Processing Scripts

  • Batch Processing: academic_tools/standalone/batch_process.py
  • Single File Processing: academic_tools/standalone/process_single_json.py
  • Compatibility Processing: academic_tools/standalone/process_with_cleanlit.py

Advanced Configuration Options

# Custom input/output directories
python run_processor.py --input-dir ./data/input --output-dir ./data/output

# Basic cleaning only (no Token evaluation)
python run_processor.py

# Force reprocessing of all files
python run_processor.py --enable-evaluation --force-reprocess

📚 Technical Documentation

🔬 Technical Specifications

Supported Data Formats

  • Input: MinerU-parsed JSON format academic documents
  • Output: Standardized JSON, compatible with LLM training and inference
  • Encoding: UTF-8 cross-platform support, automatic control character handling

Quality Assurance Mechanisms

  • Compression Rate Protection: ≤20% threshold ensures text semantic integrity
  • Boundary Self-healing: Automatic detection and repair of Shield protection anomalies
  • Integrity Auditing: Multi-layer verification ensures data quality

📄 Open Source License

MIT License - see LICENSE file for details

📖 Citation

If you use PorosData-Processor in your research, please cite:

@software{porosdata_processor,
  title = {PorosData-Processor: Academic Document Intelligent Cleaning Pipeline},
  author = {YE, Kivent},
  year = {2025},
  url = {https://github.com/KiventYip/PorosData-doc},
  version = {0.2.2}
}

🤝 Contributions and Feedback

Issues and Pull Requests are welcome! The project adopts a data-driven development philosophy, and any performance optimization suggestions will be seriously considered.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

porosdata_processor-0.2.3.tar.gz (68.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

porosdata_processor-0.2.3-py3-none-any.whl (59.0 kB view details)

Uploaded Python 3

File details

Details for the file porosdata_processor-0.2.3.tar.gz.

File metadata

  • Download URL: porosdata_processor-0.2.3.tar.gz
  • Upload date:
  • Size: 68.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for porosdata_processor-0.2.3.tar.gz
Algorithm Hash digest
SHA256 1834bafaef0ff3d45e86b8106042d261e2c5fe1d147213296e2f842724d193bc
MD5 98b76bf54412ac7c08ded367e4e4f528
BLAKE2b-256 67e70e59e648d56de65a3ef8050fa867083b07de2871e0c0b2c541aa50fcab44

See more details on using hashes here.

File details

Details for the file porosdata_processor-0.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for porosdata_processor-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3306849bf28fc860200d200e33e3f69bba007b5cdb91f13dc1aefcac10bfce9f
MD5 42e75a1fba2acae267a02e0fe456a55d
BLAKE2b-256 c9934daffb4ef0727e6303867a7c151a52fc21505b1e0833a6a3ddbcb0899d0e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page