Skip to main content

Structured delivery toolkit: MinerU content_list to per-doc layout (content/structure/assets under one doc_id directory).

Project description

PorosData-Designer

PorosData-Designer converts MinerU-generated document parses into a unified per-document layout under {output_root}/{doc_id}/: structure-aware training text (*.content.json / *.content.txt), a datamining view (*.structure.json), and multimodal delivery (*.assets.index.json and images/).

It is designed for scientific data processing and structure-aware preparation centered on paragraphs, formulas, chemical expressions, and figure assets.

What it does

  • Builds a structure-aware full-text view from *_content_list.json.
  • Maps document sections, formulas, chemical expressions, and asset references into *.structure.json.
  • Extracts image/caption/mention relationships and ships copied assets plus Markdown cards under images/.

Install

pip install porosdata-designer

After install, use import designer in Python and the designer CLI command. The PyPI distribution name is porosdata-designer.

Python requirement: >=3.8.

Quick start (CLI)

  1. --input_dir: a directory tree containing MinerU outputs (recursive *_content_list.json).
  2. --output_dir: where the structured delivery will be written (default is determined by the package config when omitted).

Run the full pipeline:

designer run all --input_dir "path/to/input_dir" --output_dir "path/to/output_dir" --log_dir "path/to/log_dir"

You can also run only one stage:

designer run text --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"
designer run multimodal --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"

Outputs

Each document is written to {output_root}/{doc_id}/:

path/to/output_root/
└── {doc_id}/
    ├── {doc_id}.content.json
    ├── {doc_id}.content.txt
    ├── {doc_id}.structure.json
    ├── {doc_id}.assets.index.json
    └── images/
        ├── fig_1.jpg
        └── fig_1.md

Validation / audit (CLI)

Examples (paths are relative to your local filesystem):

# Audit structured outputs (content/structure/*.json under each doc_id)
designer audit structured --root_dir "path/to/output_root"

# Validate structured outputs
designer validate structured --output_dir "path/to/output_root" --log_dir "path/to/log_dir"

# Validate multimodal index files
designer validate multimodal --output_dir "path/to/output_root" --log_dir "path/to/log_dir"

# Final acceptance validation
designer validate acceptance --output_dir "path/to/output_root" --log_dir "path/to/log_dir"

# Validate against the delivery standard
designer validate delivery --root_dir "path/to/output_root" --log_dir "path/to/log_dir"

Python usage

You can also use the package directly in Python:

from designer import DataMiningMapper, MultimodalInterleaver, TextAggregator

aggregator = TextAggregator()
mapper = DataMiningMapper()
interleaver = MultimodalInterleaver()

Text-side example:

from designer import DataMiningMapper, TextAggregator

content_list = [
    {"type": "text", "text_level": 1, "text": "Abstract", "page_idx": 0},
    {"type": "text", "text": "This work studies a Cu-Zr metallic glass system.", "page_idx": 0},
    {"type": "text", "text_level": 1, "text": "Results and Discussion", "page_idx": 1},
    {"type": "text", "text": "Figure 1 shows the microstructure evolution at 700 K.", "page_idx": 1},
]

aggregator = TextAggregator()
structured_text = aggregator.aggregate(content_list)

mapper = DataMiningMapper()
datamining_view = mapper.map(structured_text, {"doc_id": "demo-0001"})

print(structured_text)
print(datamining_view.pure_text_stream)
print(datamining_view.structured_json["sections"])

Expected outcome:

  • structured_text contains Poros tags such as <poros_doc>, <poros_section_*>, and <poros_paragraph>.
  • pure_text_stream removes the structure tags while keeping readable text.
  • structured_json exposes mined fields such as sections, formulas, chemical_formulas, and asset_refs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

porosdata_designer-0.1.3.tar.gz (62.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

porosdata_designer-0.1.3-py3-none-any.whl (68.0 kB view details)

Uploaded Python 3

File details

Details for the file porosdata_designer-0.1.3.tar.gz.

File metadata

  • Download URL: porosdata_designer-0.1.3.tar.gz
  • Upload date:
  • Size: 62.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for porosdata_designer-0.1.3.tar.gz
Algorithm Hash digest
SHA256 75027b4e349b2759c96f411fbafb6bf9bfb7a537924aa9342826ef13ba27aa64
MD5 8e0928fc01d9b5b974239621a87c9d38
BLAKE2b-256 65c9fac786f48fed041f1c5477c4e11a1815a6ac5396113df8bb7cab18f2c50f

See more details on using hashes here.

File details

Details for the file porosdata_designer-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for porosdata_designer-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0196a5e43611e42959a96a4e6039da10620adb3496ae414681e73714f28d18b5
MD5 e9cd08eeeaa4a6383bc6dee2c5c7b4b0
BLAKE2b-256 439ff243f03b2611c129f73d46f12817dd3d909f29cb796330478ea092cd0347

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page