Skip to main content

Professional text cleaning tool supporting Markdown code block and LaTeX formula protection, automatic citation normalization, and Chinese-English typesetting optimization

Project description

PorosData-Processor

Python 3.8+ License: MIT

PorosData-Processor is a deep text cleaning pipeline specifically designed for AI for Science scenarios. Currently, it focuses on fine-grained processing of structured JSON data output from document parsers (MinerU), addressing various "hard wounds" in scientific literature when converting to large language model (LLM) readable formats, aiming to ensure academic documents achieve format standardization, token minimization, and logical completeness before input to LLMs.

📖 Project Motivation

In the AI for Science field, high-quality data preprocessing is the foundation for models to understand academic literature. Although MinerU provides powerful PDF parsing capabilities, its raw output still faces the following challenges:

  • Formula Damage: Standard text cleaning rules often accidentally affect LaTeX formulas, leading to loss of scientific meaning.
  • Structural Fragmentation: Deep nested JSON and non-standard reference markers interfere with RAG system indexing quality.
  • Token Waste: Large amounts of redundant spaces and non-standard characters in academic documents increase LLM inference overhead.

🌟 Core Highlights

The emergence of PorosData-Processor is precisely to establish a perfect balance between "cleaning" and "protection".

  • LaTeX Formula Protection: Automatically identifies and locks inline $ ... $ and block $$...$$ formulas to avoid damaging key information during text cleaning.
  • Code Block Preservation: Protects Markdown code blocks and inline code.
  • Placeholder Mechanism: Uses fixed-width intelligent placeholders to prevent space compression from affecting layout.
  • Academic Standardization: Automatically fixes Greek letters ($\alpha, \beta$), Roman numerals, and chapter numbering.
  • Reference Literature Purification: Unifies diverse citation formats (like 【 1 】) to standardized [1].
  • Token Optimization: Cleans redundant spaces in LaTeX formulas to reduce LLM consumption.

🛡️ Intelligent Shield Protection Mechanism

For sensitive content in academic documents, we developed a "preprocessing-cleaning-restoration" three-stage protection process to ensure core information is "zero damage" and ensure absolute safety of sensitive data during the cleaning process:

  • Pre-Shield: Uses regex engine to lock LaTeX formulas, code blocks and other areas, mapping them to fixed-width placeholders (like __CLEANLIT_SHIELD_001__).
  • Safe Cleaning: Performs high-strength chapter standardization and whitespace compression on "pure text areas" outside placeholders.
  • Precise Restoration: After cleaning, reversely restores placeholders to original literature content.

🔌 Modular Extension Plugins

Based on the Plugin Registry architecture, developers can easily extend business logic through decorators. This design achieves complete decoupling of cleaning rules from the core pipeline, supporting users to dynamically combine their own Pipeline according to different corpora and research needs.

@PluginRegistry.register("custom_academic_rule")
def my_rule(text: str) -> str:
    # Custom cleaning logic for specific fields (such as physics, biology)
    return processed_text

📊 Core Function Matrix

Function Module Problem Solved Example (Input -> Output)
Chinese-English Punctuation Fix Fix mixed punctuation and extra spaces "Hello,world" -> Standardized output
Reference Literature Standardization Unify citation markers like 【1】, [ 2 ] 【1】 -> [1]
Roman Numeral Conversion Unify number representation (like II, III) Chapter II -> Chapter 2
Chapter Numbering Standardization Fix chaotic document structure numbering 第1章 1.1节 -> Chapter 1, Section 1.1
Greek Letter Conversion Convert Greek characters to LaTeX academic symbols α + β -> \alpha + \beta
Whitespace Optimization Clear redundant spaces and illegal line breaks Text with spaces -> Standardized spacing
LaTeX Formula Compression Preprocess spaces in formulas to optimize token consumption $ \alpha + \beta $ -> $\alpha+\beta$

⚙️ Installation and Naming Specifications

⚠️ Important Notice:

  • PyPI Installation Name: PorosData-Processor (using hyphen -)
  • Python Import Name: import porosdata_processor (using underscore _)
  • Command Line Tool: porosdata-processor
pip install porosdata-processor

Notice: Although the PyPI package name is PorosData-Processor (with hyphen -), it must be imported using underscore in code: import porosdata_processor

python -c "import porosdata_processor; print('porosdata_processor imported successfully')"

🚀 Quick Start

from porosdata_processor import TextCleaner

# Method 1: Use default pipeline (recommended)
cleaner = TextCleaner()

# Method 2: Custom plugin combination
cleaner = TextCleaner(pipeline=["patterns_cleaning", "greek_to_latex"])

# Execute cleaning
raw_text = "Identify α particles described in literature 【1】."
cleaned_text = cleaner.clean(raw_text)
# Enable advanced options: clean redundant spaces inside formulas
cleaner = TextCleaner(clean_options={"clean_latex_math_spaces": True})

# Enable only specific plugins
custom_cleaner = TextCleaner(pipeline=["citation_rules", "greek_to_latex"])

✅ Reliability Guarantee

This project adopts a strict unit testing framework, covering the following core dimensions:

  • Core Registration System Coverage: 100%
  • Encoding Compatibility Verification: Supports Windows/Linux/macOS full-platform UTF-8 immune processing

🗺️ Development Roadmap

  • Field-Specific Optimization: Deep cleaning for arXiv papers and programming language-specific formats.
  • AI Model Integration: Introduce lightweight LLM to assist in identifying cleaning quality.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

porosdata_processor-0.2.2.tar.gz (31.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

porosdata_processor-0.2.2-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file porosdata_processor-0.2.2.tar.gz.

File metadata

  • Download URL: porosdata_processor-0.2.2.tar.gz
  • Upload date:
  • Size: 31.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for porosdata_processor-0.2.2.tar.gz
Algorithm Hash digest
SHA256 2e09a933c8a4c3030721b9df707fd4ecce3606fd96fe2dfe1dfe895236adbbb2
MD5 4c8b1e852d556da0784d148e64dde378
BLAKE2b-256 30f3af3ee8f99147a7246ec02ee1a2c20e366b2562c4a00a699e54ea400d2c22

See more details on using hashes here.

File details

Details for the file porosdata_processor-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for porosdata_processor-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3020e42038f9fa099b6c417b40dbc324bf7c9233ec7f73be42fcbc6936bfcf29
MD5 18e04f71f3d7a6d8a83e44dd07f28a19
BLAKE2b-256 e3a72ee6ea193d84b12807508c46d489f05a9453bc631aa2911495d94feb3fe1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page