Academic document intelligent cleaning pipeline for AI for Science, ensuring MinerU parsed data meets LLM input standards

These details have not been verified by PyPI

Project links

Documentation

Project description

PorosData-Processor

Academic Literature Data Engineering Toolkit - Specializes in cleaning MinerU outputs (scientific literature) and evaluating Token efficiency for LLMs, supporting the complete "AI for Science" workflow.

🎯 Project Positioning

In the AI for Science field, high-quality academic data preprocessing is the foundation for models to understand scientific literature. PorosData-Processor serves as a data engineering tool that specifically addresses the "last mile" problem from MinerU document parsing to LLM input, ensuring academic documents achieve maximum Token efficiency while maintaining integrity.

🌟 Core Features

🔬 Professional Academic Cleaning: Intelligently handles LaTeX formulas, control characters, citation formats, and other academic document-specific issues
⚡ Multi-process Parallel Processing: Cross-platform concurrent processing based on pathlib and spawn methods, supporting Windows/Linux
📊 Real-time Token Evaluation: Integrated GPT-2 tokenizer, providing precise Token compression rate calculations
🛡️ Intelligent Protection Mechanism: 20% compression rate threshold protection, ensuring text semantic integrity
🔄 Self-healing Quality Assurance: Automatic detection and repair of data corruption during processing

🚀 Quick Start

Environment Requirements

pip install porosdata-processor ijson tiktoken psutil

Process MinerU Data

# Process all JSON files in the data/mineru_output_raw_data directory
# Automatically configure HF_ENDPOINT mirror for Chinese users to accelerate downloads
python run_processor.py --enable-evaluation

Output Data Format

Processed JSON files contain standardized fields:

text: Cleaned academic text, suitable for LLM input
original_text: Original input text, for quality comparison
healed_count: Self-healing repair count, reflecting data quality

📈 Performance Showcase

Materials Science Data Pilot Test Results

Test Metric	Value	Description
Files Processed	3 files	MinerU-parsed academic paper JSON files
Items Processed	127 items	Including text, formulas, tables, and other structured content
Avg Token Compression Rate	0.098	90.2% Token Savings
Processing Time	0.456s	Multi-process parallel processing efficiency
Memory Peak	204.2MB	Streaming processing ensures memory efficiency

Key Insights:

Token compression rate reaches 0.098 (90.2% reduction), significantly reducing LLM inference costs
Processing speed of 278.5 items/second, suitable for large-scale academic data processing
Memory peak usage of only 204.2MB, supporting TB-level data processing

🛠️ Academic Tools Suite

The project includes a complete academic data processing toolchain:

Core Processing Scripts

Batch Processing: academic_tools/standalone/batch_process.py
Single File Processing: academic_tools/standalone/process_single_json.py
Compatibility Processing: academic_tools/standalone/process_with_cleanlit.py

Advanced Configuration Options

# Custom input/output directories
python run_processor.py --input-dir ./data/input --output-dir ./data/output

# Basic cleaning only (no Token evaluation)
python run_processor.py

# Force reprocessing of all files
python run_processor.py --enable-evaluation --force-reprocess

📚 Technical Documentation

Architecture Design - Core components and implementation principles
Usage Guide - Detailed API and configuration instructions
Testing Guide - Development and testing environment setup

🔬 Technical Specifications

Supported Data Formats

Input: MinerU-parsed JSON format academic documents
Output: Standardized JSON, compatible with LLM training and inference
Encoding: UTF-8 cross-platform support, automatic control character handling

Quality Assurance Mechanisms

Compression Rate Protection: ≤20% threshold ensures text semantic integrity
Boundary Self-healing: Automatic detection and repair of Shield protection anomalies
Integrity Auditing: Multi-layer verification ensures data quality

📄 Open Source License

MIT License - see LICENSE file for details

📖 Citation

If you use PorosData-Processor in your research, please cite:

@software{porosdata_processor,
  title = {PorosData-Processor: Academic Document Intelligent Cleaning Pipeline},
  author = {YE, Kivent},
  year = {2025},
  url = {https://github.com/KiventYip/PorosData-doc},
  version = {0.2.4}
}

🤝 Contributions and Feedback

Issues and Pull Requests are welcome! The project adopts a data-driven development philosophy, and any performance optimization suggestions will be seriously considered.

Project details

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

0.4.1

May 16, 2026

0.4.0

Apr 28, 2026

0.3.0

Apr 8, 2026

This version

0.2.4

Feb 3, 2026

0.2.3

Feb 3, 2026

0.2.2

Dec 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

porosdata_processor-0.2.4.tar.gz (68.7 kB view details)

Uploaded Feb 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

porosdata_processor-0.2.4-py3-none-any.whl (58.9 kB view details)

Uploaded Feb 3, 2026 Python 3

File details

Details for the file porosdata_processor-0.2.4.tar.gz.

File metadata

Download URL: porosdata_processor-0.2.4.tar.gz
Upload date: Feb 3, 2026
Size: 68.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for porosdata_processor-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`82d6c955785e8fbbd844da994db72cd7a3b1bc4d78356db6b2e5bc41d527188c`
MD5	`793bb224173ac8e00e23ad83f98ff7a8`
BLAKE2b-256	`01a31fff504b3bef25cdcec67cb67f6bffab00f38bb65608f75e216868ccebc1`

See more details on using hashes here.

File details

Details for the file porosdata_processor-0.2.4-py3-none-any.whl.

File metadata

Download URL: porosdata_processor-0.2.4-py3-none-any.whl
Upload date: Feb 3, 2026
Size: 58.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for porosdata_processor-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6d05e1ff80a52343e703c5a88fb9148780a5feadfd0c3e6c359c552ef737177`
MD5	`586bc770461e7ec0d45068e4332f5e26`
BLAKE2b-256	`71cab12a1a729f46fda4132c1df5f8f3c6f3af30d19b047d6eb4e3948215d9c8`

See more details on using hashes here.

porosdata-processor 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PorosData-Processor

🎯 Project Positioning

🌟 Core Features

🚀 Quick Start

Environment Requirements

Process MinerU Data

Output Data Format

📈 Performance Showcase

Materials Science Data Pilot Test Results

🛠️ Academic Tools Suite

Core Processing Scripts

Advanced Configuration Options

📚 Technical Documentation

🔬 Technical Specifications

Supported Data Formats

Quality Assurance Mechanisms

📄 Open Source License

📖 Citation

🤝 Contributions and Feedback

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes