Skip to main content

Structured delivery toolkit: MinerU content_list to per-doc layout (content/structure/assets under one doc_id directory).

Project description

PorosData-Designer

PorosData-Designer converts MinerU-generated document parses into a unified per-document layout under {output_root}/{doc_id}/: structure-aware training text (*.content.json / *.content.txt), a datamining view (*.structure.json), and multimodal delivery (*.assets.index.json and images/).

It is designed for scientific data processing and structure-aware preparation centered on paragraphs, formulas, chemical expressions, and figure assets.

What it does

  • Builds a structure-aware full-text view from *_content_list.json.
  • Maps document sections, formulas, chemical expressions, and asset references into *.structure.json.
  • Extracts image/caption/mention relationships and ships copied assets plus Markdown cards under images/.

Install

pip install designer

Python requirement: >=3.8.

Quick start (CLI)

  1. --input_dir: a directory tree containing MinerU outputs (recursive *_content_list.json).
  2. --output_dir: where the structured delivery will be written (default is determined by the package config when omitted).

Run the full pipeline:

designer run all --input_dir "path/to/input_dir" --output_dir "path/to/output_dir" --log_dir "path/to/log_dir"

You can also run only one stage:

designer run text --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"
designer run multimodal --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"

Outputs

Each document is written to {output_root}/{doc_id}/:

path/to/output_root/
└── {doc_id}/
    ├── {doc_id}.content.json
    ├── {doc_id}.content.txt
    ├── {doc_id}.structure.json
    ├── {doc_id}.assets.index.json
    └── images/
        ├── fig_1.jpg
        └── fig_1.md

Validation / audit (CLI)

Examples (paths are relative to your local filesystem):

# Audit structured outputs (content/structure/*.json under each doc_id)
designer audit structured --root_dir "path/to/output_root"

# Validate structured outputs
designer validate structured --output_dir "path/to/output_root" --log_dir "path/to/log_dir"

# Validate multimodal index files
designer validate multimodal --output_dir "path/to/output_root" --log_dir "path/to/log_dir"

# Final acceptance validation
designer validate acceptance --output_dir "path/to/output_root" --log_dir "path/to/log_dir"

# Validate against the delivery standard
designer validate delivery --root_dir "path/to/output_root" --log_dir "path/to/log_dir"

Python usage

You can also use the package directly in Python:

from designer import DataMiningMapper, MultimodalInterleaver, TextAggregator

aggregator = TextAggregator()
mapper = DataMiningMapper()
interleaver = MultimodalInterleaver()

Text-side example:

from designer import DataMiningMapper, TextAggregator

content_list = [
    {"type": "text", "text_level": 1, "text": "Abstract", "page_idx": 0},
    {"type": "text", "text": "This work studies a Cu-Zr metallic glass system.", "page_idx": 0},
    {"type": "text", "text_level": 1, "text": "Results and Discussion", "page_idx": 1},
    {"type": "text", "text": "Figure 1 shows the microstructure evolution at 700 K.", "page_idx": 1},
]

aggregator = TextAggregator()
structured_text = aggregator.aggregate(content_list)

mapper = DataMiningMapper()
datamining_view = mapper.map(structured_text, {"doc_id": "demo-0001"})

print(structured_text)
print(datamining_view.pure_text_stream)
print(datamining_view.structured_json["sections"])

Expected outcome:

  • structured_text contains Poros tags such as <poros_doc>, <poros_section_*>, and <poros_paragraph>.
  • pure_text_stream removes the structure tags while keeping readable text.
  • structured_json exposes mined fields such as sections, formulas, chemical_formulas, and asset_refs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

porosdata_designer-0.1.2.tar.gz (60.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

porosdata_designer-0.1.2-py3-none-any.whl (67.9 kB view details)

Uploaded Python 3

File details

Details for the file porosdata_designer-0.1.2.tar.gz.

File metadata

  • Download URL: porosdata_designer-0.1.2.tar.gz
  • Upload date:
  • Size: 60.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for porosdata_designer-0.1.2.tar.gz
Algorithm Hash digest
SHA256 9028065ffb8e95d17a4eee834724c27d119dcd22c24c5d66cb35a350d1e2ad7c
MD5 8a9c8fe58113bbf1162cbef4f0bb2a2d
BLAKE2b-256 840c58a7f4a535ea8fe8a84280bbe7a0283cd4d47d1744250f5b7e63849dd241

See more details on using hashes here.

File details

Details for the file porosdata_designer-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for porosdata_designer-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e3e9fb751ffd91bd7c557c83863a9502dde8d1d96958d37702a38bc01d5fc101
MD5 f7d6ecad34aaa1dda05d192bb70a8898
BLAKE2b-256 c85816c5caa76883390cb1ae99f2b4ce10d70570eabbc3527f12ca2b04e81795

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page