Professional text cleaning tool supporting Markdown code block and LaTeX formula protection, automatic citation normalization, and Chinese-English typesetting optimization

These details have not been verified by PyPI

Project description

PorosData-Processor

PorosData-Processor is a deep text cleaning pipeline specifically designed for AI for Science scenarios. Currently, it focuses on fine-grained processing of structured JSON data output from document parsers (MinerU), addressing various "hard wounds" in scientific literature when converting to large language model (LLM) readable formats, aiming to ensure academic documents achieve format standardization, token minimization, and logical completeness before input to LLMs.

📖 Project Motivation

In the AI for Science field, high-quality data preprocessing is the foundation for models to understand academic literature. Although MinerU provides powerful PDF parsing capabilities, its raw output still faces the following challenges:

Formula Damage: Standard text cleaning rules often accidentally affect LaTeX formulas, leading to loss of scientific meaning.
Structural Fragmentation: Deep nested JSON and non-standard reference markers interfere with RAG system indexing quality.
Token Waste: Large amounts of redundant spaces and non-standard characters in academic documents increase LLM inference overhead.

🌟 Core Highlights

The emergence of PorosData-Processor is precisely to establish a perfect balance between "cleaning" and "protection".

LaTeX Formula Protection: Automatically identifies and locks inline $ ... $ and block $$...$$ formulas to avoid damaging key information during text cleaning.
Code Block Preservation: Protects Markdown code blocks and inline code.
Placeholder Mechanism: Uses fixed-width intelligent placeholders to prevent space compression from affecting layout.
Academic Standardization: Automatically fixes Greek letters ($\alpha, \beta$), Roman numerals, and chapter numbering.
Reference Literature Purification: Unifies diverse citation formats (like 【 1 】) to standardized [1].
Token Optimization: Cleans redundant spaces in LaTeX formulas to reduce LLM consumption.

🛡️ Intelligent Shield Protection Mechanism

For sensitive content in academic documents, we developed a "preprocessing-cleaning-restoration" three-stage protection process to ensure core information is "zero damage" and ensure absolute safety of sensitive data during the cleaning process:

Pre-Shield: Uses regex engine to lock LaTeX formulas, code blocks and other areas, mapping them to fixed-width placeholders (like __CLEANLIT_SHIELD_001__).
Safe Cleaning: Performs high-strength chapter standardization and whitespace compression on "pure text areas" outside placeholders.
Precise Restoration: After cleaning, reversely restores placeholders to original literature content.

🔌 Modular Extension Plugins

Based on the Plugin Registry architecture, developers can easily extend business logic through decorators. This design achieves complete decoupling of cleaning rules from the core pipeline, supporting users to dynamically combine their own Pipeline according to different corpora and research needs.

@PluginRegistry.register("custom_academic_rule")
def my_rule(text: str) -> str:
    # Custom cleaning logic for specific fields (such as physics, biology)
    return processed_text

📊 Core Function Matrix

Function Module	Problem Solved	Example (Input -> Output)
Chinese-English Punctuation Fix	Fix mixed punctuation and extra spaces	`"Hello，world"` -> Standardized output
Reference Literature Standardization	Unify citation markers like `【1】`, `[ 2 ]`	`【1】` -> `[1]`
Roman Numeral Conversion	Unify number representation (like `II`, `III`)	`Chapter II` -> `Chapter 2`
Chapter Numbering Standardization	Fix chaotic document structure numbering	`第1章 1.1节` -> `Chapter 1, Section 1.1`
Greek Letter Conversion	Convert Greek characters to LaTeX academic symbols	`α + β` -> `\alpha + \beta`
Whitespace Optimization	Clear redundant spaces and illegal line breaks	`Text with spaces` -> Standardized spacing
LaTeX Formula Compression	Preprocess spaces in formulas to optimize token consumption	$ \alpha + \beta $ -> $\alpha+\beta$

⚙️ Installation and Naming Specifications

⚠️ Important Notice:

PyPI Installation Name: PorosData-Processor (using hyphen -)
Python Import Name: import porosdata_processor (using underscore _)
Command Line Tool: porosdata-processor

pip install porosdata-processor

Notice: Although the PyPI package name is PorosData-Processor (with hyphen -), it must be imported using underscore in code: import porosdata_processor

python -c "import porosdata_processor; print('porosdata_processor imported successfully')"

🚀 Quick Start

from porosdata_processor import TextCleaner

# Method 1: Use default pipeline (recommended)
cleaner = TextCleaner()

# Method 2: Custom plugin combination
cleaner = TextCleaner(pipeline=["patterns_cleaning", "greek_to_latex"])

# Execute cleaning
raw_text = "Identify α particles described in literature 【1】."
cleaned_text = cleaner.clean(raw_text)

# Enable advanced options: clean redundant spaces inside formulas
cleaner = TextCleaner(clean_options={"clean_latex_math_spaces": True})

# Enable only specific plugins
custom_cleaner = TextCleaner(pipeline=["citation_rules", "greek_to_latex"])

✅ Reliability Guarantee

This project adopts a strict unit testing framework, covering the following core dimensions:

Core Registration System Coverage: 100%
Encoding Compatibility Verification: Supports Windows/Linux/macOS full-platform UTF-8 immune processing

🗺️ Development Roadmap

Field-Specific Optimization: Deep cleaning for arXiv papers and programming language-specific formats.
AI Model Integration: Introduce lightweight LLM to assist in identifying cleaning quality.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.1

May 16, 2026

0.4.0

Apr 28, 2026

0.3.0

Apr 8, 2026

0.2.4

Feb 3, 2026

0.2.3

Feb 3, 2026

This version

0.2.2

Dec 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

porosdata_processor-0.2.2.tar.gz (31.3 kB view details)

Uploaded Dec 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

porosdata_processor-0.2.2-py3-none-any.whl (31.2 kB view details)

Uploaded Dec 25, 2025 Python 3

File details

Details for the file porosdata_processor-0.2.2.tar.gz.

File metadata

Download URL: porosdata_processor-0.2.2.tar.gz
Upload date: Dec 25, 2025
Size: 31.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for porosdata_processor-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`2e09a933c8a4c3030721b9df707fd4ecce3606fd96fe2dfe1dfe895236adbbb2`
MD5	`4c8b1e852d556da0784d148e64dde378`
BLAKE2b-256	`30f3af3ee8f99147a7246ec02ee1a2c20e366b2562c4a00a699e54ea400d2c22`

See more details on using hashes here.

File details

Details for the file porosdata_processor-0.2.2-py3-none-any.whl.

File metadata

Download URL: porosdata_processor-0.2.2-py3-none-any.whl
Upload date: Dec 25, 2025
Size: 31.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for porosdata_processor-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3020e42038f9fa099b6c417b40dbc324bf7c9233ec7f73be42fcbc6936bfcf29`
MD5	`18e04f71f3d7a6d8a83e44dd07f28a19`
BLAKE2b-256	`e3a72ee6ea193d84b12807508c46d489f05a9453bc631aa2911495d94feb3fe1`

See more details on using hashes here.

porosdata-processor 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

PorosData-Processor

📖 Project Motivation

🌟 Core Highlights

🛡️ Intelligent Shield Protection Mechanism

🔌 Modular Extension Plugins

📊 Core Function Matrix

⚙️ Installation and Naming Specifications

⚠️ Important Notice:

🚀 Quick Start

✅ Reliability Guarantee

🗺️ Development Roadmap

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes