Skip to main content

Academic document intelligent cleaning pipeline for AI for Science, ensuring MinerU parsed data meets LLM input standards

Project description

PorosData-Processor

Python Version License: MIT

PorosData-Processor cleans MinerU-derived scientific document text for downstream LLM and data-mining workflows. It normalizes text without breaking LaTeX formulas, repairs OCR errors, cleans up citations and numbering, and optionally evaluates token efficiency.

Installation

pip install porosdata-processor

Optional extras:

# Token evaluation support
pip install "porosdata-processor[eval]"

# Streaming JSON parsing for large files
pip install "porosdata-processor[batch]"

Python 3.8+ is supported.

What The Package Does

  • Cleans scientific text while preserving LaTeX formulas via a placeholder-based protect/restore mechanism (Shield).
  • Normalizes whitespace, Unicode, citations, numbering, and repairs OCR errors.
  • Provides a Python API for single-text processing.
  • Provides a CLI for batch processing MinerU *_content_list.json files.
  • Optionally computes token-efficiency statistics when transformers is installed.

Python API

from porosdata_processor import TextCleaner

cleaner = TextCleaner()
result = cleaner.clean("Recent studies 【1】 show α-phase stability.")
print(result)

Example output:

Recent studies ref[1] show \alpha-phase stability.

Custom pipeline example:

from porosdata_processor import TextCleaner

cleaner = TextCleaner(
    pipeline=[
        "unicode_normalization",
        "patterns_cleaning",
        "normalize_whitespace",
    ]
)

result = cleaner.clean("Text   with   extra spaces")
print(result)

Optional evaluation mode:

from porosdata_processor import TextCleaner

cleaner = TextCleaner.with_tokenizer_evaluation("gpt2")
result = cleaner.clean("Recent studies 【1】 show α-phase stability.", eval_mode=True)
print(result["processed_text"])
print(result["evaluation"]["overall"]["compression_rate"])

CLI Batch Processing

The CLI recursively scans the input directory for MinerU *_content_list.json files and writes cleaned JSON outputs plus processing_report.json to the output directory.

porosdata-processor run \
    --input-dir data/raw \
    --output-dir data/processed \
    --max-workers 4

Enable optional features:

porosdata-processor run \
    --input-dir data/raw \
    --output-dir data/processed \
    --enable-evaluation \
    --max-workers 4

Common flags:

  • --input-dir: input directory containing MinerU outputs
  • --output-dir: output directory for cleaned files
  • --enable-evaluation: enable token-efficiency evaluation
  • --max-workers: set the number of worker processes
  • --force-reprocess: ignore existing outputs and re-run processing
  • --memory-limit: memory limit in MB
  • --log-level: DEBUG, INFO, WARNING, or ERROR
  • --heartbeat-seconds: emit runtime heartbeat logs every N seconds

Scope And Input Format

  • Batch processing is designed for MinerU-generated JSON content lists, not for generic JSONL, Parquet, or HDF5 datasets.
  • The primary public API is TextCleaner for string-based cleaning and the porosdata-processor CLI for directory-based processing.
  • Commands such as audit, sample-validate, and delivery-gate are also available from the CLI, but they are intended for internal data-governance workflows.

License

PorosData-Processor is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

porosdata_processor-0.3.0.tar.gz (110.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

porosdata_processor-0.3.0-py3-none-any.whl (124.2 kB view details)

Uploaded Python 3

File details

Details for the file porosdata_processor-0.3.0.tar.gz.

File metadata

  • Download URL: porosdata_processor-0.3.0.tar.gz
  • Upload date:
  • Size: 110.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for porosdata_processor-0.3.0.tar.gz
Algorithm Hash digest
SHA256 f520225ba3271b8e7cd9eca1a77e335ec045424fc0b86b9f0b4fc0957057683b
MD5 bf3c5626c387d4364ff6c360328f31a9
BLAKE2b-256 72bb34504f8a5cdfae5d444b792ec6518d80e1defcaf5bfca86305d508f1e2c5

See more details on using hashes here.

File details

Details for the file porosdata_processor-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for porosdata_processor-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec4e2b66de44dc8cf5df577d419f1d08f30cf97f13e18bef9d2a449d01bd1457
MD5 43cdd77c7bea15f680287d9292294abb
BLAKE2b-256 f4a217f50ed06833e385362071f80be86a3fff2eed95b01bd6681478aa05184c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page