Professional text cleaning tool supporting Markdown code block and LaTeX formula protection, automatic citation normalization, and Chinese-English typesetting optimization
Project description
PorosData-Processor
PorosData-Processor is a deep text cleaning pipeline specifically designed for AI for Science scenarios. Currently, it focuses on fine-grained processing of structured JSON data output from document parsers (MinerU), addressing various "hard wounds" in scientific literature when converting to large language model (LLM) readable formats, aiming to ensure academic documents achieve format standardization, token minimization, and logical completeness before input to LLMs.
📖 Project Motivation
In the AI for Science field, high-quality data preprocessing is the foundation for models to understand academic literature. Although MinerU provides powerful PDF parsing capabilities, its raw output still faces the following challenges:
- Formula Damage: Standard text cleaning rules often accidentally affect LaTeX formulas, leading to loss of scientific meaning.
- Structural Fragmentation: Deep nested JSON and non-standard reference markers interfere with RAG system indexing quality.
- Token Waste: Large amounts of redundant spaces and non-standard characters in academic documents increase LLM inference overhead.
🌟 Core Highlights
The emergence of PorosData-Processor is precisely to establish a perfect balance between "cleaning" and "protection".
- LaTeX Formula Protection: Automatically identifies and locks inline
$ ... $and block$$...$$formulas to avoid damaging key information during text cleaning. - Code Block Preservation: Protects Markdown code blocks and inline code.
- Placeholder Mechanism: Uses fixed-width intelligent placeholders to prevent space compression from affecting layout.
- Academic Standardization: Automatically fixes Greek letters ($\alpha, \beta$), Roman numerals, and chapter numbering.
- Reference Literature Purification: Unifies diverse citation formats (like
【 1 】) to standardized[1]. - Token Optimization: Cleans redundant spaces in LaTeX formulas to reduce LLM consumption.
🛡️ Intelligent Shield Protection Mechanism
For sensitive content in academic documents, we developed a "preprocessing-cleaning-restoration" three-stage protection process to ensure core information is "zero damage" and ensure absolute safety of sensitive data during the cleaning process:
- Pre-Shield: Uses regex engine to lock LaTeX formulas, code blocks and other areas, mapping them to fixed-width placeholders (like
__CLEANLIT_SHIELD_001__). - Safe Cleaning: Performs high-strength chapter standardization and whitespace compression on "pure text areas" outside placeholders.
- Precise Restoration: After cleaning, reversely restores placeholders to original literature content.
🔌 Modular Extension Plugins
Based on the Plugin Registry architecture, developers can easily extend business logic through decorators. This design achieves complete decoupling of cleaning rules from the core pipeline, supporting users to dynamically combine their own Pipeline according to different corpora and research needs.
@PluginRegistry.register("custom_academic_rule")
def my_rule(text: str) -> str:
# Custom cleaning logic for specific fields (such as physics, biology)
return processed_text
📊 Core Function Matrix
| Function Module | Problem Solved | Example (Input -> Output) |
|---|---|---|
| Chinese-English Punctuation Fix | Fix mixed punctuation and extra spaces | "Hello,world" -> Standardized output |
| Reference Literature Standardization | Unify citation markers like 【1】, [ 2 ] |
【1】 -> [1] |
| Roman Numeral Conversion | Unify number representation (like II, III) |
Chapter II -> Chapter 2 |
| Chapter Numbering Standardization | Fix chaotic document structure numbering | 第1章 1.1节 -> Chapter 1, Section 1.1 |
| Greek Letter Conversion | Convert Greek characters to LaTeX academic symbols | α + β -> \alpha + \beta |
| Whitespace Optimization | Clear redundant spaces and illegal line breaks | Text with spaces -> Standardized spacing |
| LaTeX Formula Compression | Preprocess spaces in formulas to optimize token consumption | $ \alpha + \beta $ -> $\alpha+\beta$ |
⚙️ Installation and Naming Specifications
⚠️ Important Notice:
- PyPI Installation Name:
PorosData-Processor(using hyphen-) - Python Import Name:
import porosdata_processor(using underscore_) - Command Line Tool:
porosdata-processor
pip install porosdata-processor
Notice: Although the PyPI package name is
PorosData-Processor(with hyphen -), it must be imported using underscore in code:import porosdata_processor
python -c "import porosdata_processor; print('porosdata_processor imported successfully')"
🚀 Quick Start
from porosdata_processor import TextCleaner
# Method 1: Use default pipeline (recommended)
cleaner = TextCleaner()
# Method 2: Custom plugin combination
cleaner = TextCleaner(pipeline=["patterns_cleaning", "greek_to_latex"])
# Execute cleaning
raw_text = "Identify α particles described in literature 【1】."
cleaned_text = cleaner.clean(raw_text)
# Enable advanced options: clean redundant spaces inside formulas
cleaner = TextCleaner(clean_options={"clean_latex_math_spaces": True})
# Enable only specific plugins
custom_cleaner = TextCleaner(pipeline=["citation_rules", "greek_to_latex"])
✅ Reliability Guarantee
This project adopts a strict unit testing framework, covering the following core dimensions:
- Core Registration System Coverage: 100%
- Encoding Compatibility Verification: Supports Windows/Linux/macOS full-platform UTF-8 immune processing
🗺️ Development Roadmap
- Field-Specific Optimization: Deep cleaning for arXiv papers and programming language-specific formats.
- AI Model Integration: Introduce lightweight LLM to assist in identifying cleaning quality.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file porosdata_processor-0.2.2.tar.gz.
File metadata
- Download URL: porosdata_processor-0.2.2.tar.gz
- Upload date:
- Size: 31.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e09a933c8a4c3030721b9df707fd4ecce3606fd96fe2dfe1dfe895236adbbb2
|
|
| MD5 |
4c8b1e852d556da0784d148e64dde378
|
|
| BLAKE2b-256 |
30f3af3ee8f99147a7246ec02ee1a2c20e366b2562c4a00a699e54ea400d2c22
|
File details
Details for the file porosdata_processor-0.2.2-py3-none-any.whl.
File metadata
- Download URL: porosdata_processor-0.2.2-py3-none-any.whl
- Upload date:
- Size: 31.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3020e42038f9fa099b6c417b40dbc324bf7c9233ec7f73be42fcbc6936bfcf29
|
|
| MD5 |
18e04f71f3d7a6d8a83e44dd07f28a19
|
|
| BLAKE2b-256 |
e3a72ee6ea193d84b12807508c46d489f05a9453bc631aa2911495d94feb3fe1
|