Academic document intelligent cleaning pipeline for AI for Science, ensuring MinerU parsed data meets LLM input standards
Project description
PorosData-Processor
Academic Literature Data Engineering Toolkit - Specializes in cleaning MinerU outputs (scientific literature) and evaluating Token efficiency for LLMs, supporting the complete "AI for Science" workflow.
🎯 Project Positioning
In the AI for Science field, high-quality academic data preprocessing is the foundation for models to understand scientific literature. PorosData-Processor serves as a data engineering tool that specifically addresses the "last mile" problem from MinerU document parsing to LLM input, ensuring academic documents achieve maximum Token efficiency while maintaining integrity.
🌟 Core Features
- 🔬 Professional Academic Cleaning: Intelligently handles LaTeX formulas, control characters, citation formats, and other academic document-specific issues
- ⚡ Multi-process Parallel Processing: Cross-platform concurrent processing based on pathlib and spawn methods, supporting Windows/Linux
- 📊 Real-time Token Evaluation: Integrated GPT-2 tokenizer, providing precise Token compression rate calculations
- 🛡️ Intelligent Protection Mechanism: 20% compression rate threshold protection, ensuring text semantic integrity
- 🔄 Self-healing Quality Assurance: Automatic detection and repair of data corruption during processing
🚀 Quick Start
Environment Requirements
pip install porosdata-processor ijson tiktoken psutil
Process MinerU Data
# Process all JSON files in the data/mineru_output_raw_data directory
# Automatically configure HF_ENDPOINT mirror for Chinese users to accelerate downloads
python run_processor.py --enable-evaluation
Output Data Format
Processed JSON files contain standardized fields:
text: Cleaned academic text, suitable for LLM inputoriginal_text: Original input text, for quality comparisonhealed_count: Self-healing repair count, reflecting data quality
📈 Performance Showcase
Materials Science Data Pilot Test Results
| Test Metric | Value | Description |
|---|---|---|
| Files Processed | 3 files | MinerU-parsed academic paper JSON files |
| Items Processed | 127 items | Including text, formulas, tables, and other structured content |
| Avg Token Compression Rate | 0.098 | 90.2% Token Savings |
| Processing Time | 0.456s | Multi-process parallel processing efficiency |
| Memory Peak | 204.2MB | Streaming processing ensures memory efficiency |
Key Insights:
- Token compression rate reaches 0.098 (90.2% reduction), significantly reducing LLM inference costs
- Processing speed of 278.5 items/second, suitable for large-scale academic data processing
- Memory peak usage of only 204.2MB, supporting TB-level data processing
🛠️ Academic Tools Suite
The project includes a complete academic data processing toolchain:
Core Processing Scripts
- Batch Processing:
academic_tools/standalone/batch_process.py - Single File Processing:
academic_tools/standalone/process_single_json.py - Compatibility Processing:
academic_tools/standalone/process_with_cleanlit.py
Advanced Configuration Options
# Custom input/output directories
python run_processor.py --input-dir ./data/input --output-dir ./data/output
# Basic cleaning only (no Token evaluation)
python run_processor.py
# Force reprocessing of all files
python run_processor.py --enable-evaluation --force-reprocess
📚 Technical Documentation
- Architecture Design - Core components and implementation principles
- Usage Guide - Detailed API and configuration instructions
- Testing Guide - Development and testing environment setup
🔬 Technical Specifications
Supported Data Formats
- Input: MinerU-parsed JSON format academic documents
- Output: Standardized JSON, compatible with LLM training and inference
- Encoding: UTF-8 cross-platform support, automatic control character handling
Quality Assurance Mechanisms
- Compression Rate Protection: ≤20% threshold ensures text semantic integrity
- Boundary Self-healing: Automatic detection and repair of Shield protection anomalies
- Integrity Auditing: Multi-layer verification ensures data quality
📄 Open Source License
MIT License - see LICENSE file for details
📖 Citation
If you use PorosData-Processor in your research, please cite:
@software{porosdata_processor,
title = {PorosData-Processor: Academic Document Intelligent Cleaning Pipeline},
author = {YE, Kivent},
year = {2025},
url = {https://github.com/KiventYip/PorosData-doc},
version = {0.2.2}
}
🤝 Contributions and Feedback
Issues and Pull Requests are welcome! The project adopts a data-driven development philosophy, and any performance optimization suggestions will be seriously considered.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file porosdata_processor-0.2.3.tar.gz.
File metadata
- Download URL: porosdata_processor-0.2.3.tar.gz
- Upload date:
- Size: 68.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1834bafaef0ff3d45e86b8106042d261e2c5fe1d147213296e2f842724d193bc
|
|
| MD5 |
98b76bf54412ac7c08ded367e4e4f528
|
|
| BLAKE2b-256 |
67e70e59e648d56de65a3ef8050fa867083b07de2871e0c0b2c541aa50fcab44
|
File details
Details for the file porosdata_processor-0.2.3-py3-none-any.whl.
File metadata
- Download URL: porosdata_processor-0.2.3-py3-none-any.whl
- Upload date:
- Size: 59.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3306849bf28fc860200d200e33e3f69bba007b5cdb91f13dc1aefcac10bfce9f
|
|
| MD5 |
42e75a1fba2acae267a02e0fe456a55d
|
|
| BLAKE2b-256 |
c9934daffb4ef0727e6303867a7c151a52fc21505b1e0833a6a3ddbcb0899d0e
|