Structured delivery toolkit: MinerU content_list to per-doc layout (content/structure/assets under one doc_id directory).
Project description
PorosData-Designer
PorosData-Designer converts MinerU-generated document parses into a unified per-document layout under {output_root}/{doc_id}/: structure-aware training text (*.content.json / *.content.txt), a datamining view (*.structure.json), and multimodal delivery (*.assets.index.json and images/).
It is designed for scientific data processing and structure-aware preparation centered on paragraphs, formulas, chemical expressions, and figure assets.
What it does
- Builds a structure-aware full-text view from
*_content_list.json. - Maps document sections, formulas, chemical expressions, and asset references into
*.structure.json. - Extracts image/caption/mention relationships and ships copied assets plus Markdown cards under
images/.
Install
pip install designer
Python requirement: >=3.8.
Quick start (CLI)
--input_dir: a directory tree containing MinerU outputs (recursive*_content_list.json).--output_dir: where the structured delivery will be written (default is determined by the package config when omitted).
Run the full pipeline:
designer run all --input_dir "path/to/input_dir" --output_dir "path/to/output_dir" --log_dir "path/to/log_dir"
You can also run only one stage:
designer run text --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"
designer run multimodal --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"
Outputs
Each document is written to {output_root}/{doc_id}/:
path/to/output_root/
└── {doc_id}/
├── {doc_id}.content.json
├── {doc_id}.content.txt
├── {doc_id}.structure.json
├── {doc_id}.assets.index.json
└── images/
├── fig_1.jpg
└── fig_1.md
Validation / audit (CLI)
Examples (paths are relative to your local filesystem):
# Audit structured outputs (content/structure/*.json under each doc_id)
designer audit structured --root_dir "path/to/output_root"
# Validate structured outputs
designer validate structured --output_dir "path/to/output_root" --log_dir "path/to/log_dir"
# Validate multimodal index files
designer validate multimodal --output_dir "path/to/output_root" --log_dir "path/to/log_dir"
# Final acceptance validation
designer validate acceptance --output_dir "path/to/output_root" --log_dir "path/to/log_dir"
# Validate against the delivery standard
designer validate delivery --root_dir "path/to/output_root" --log_dir "path/to/log_dir"
Python usage
You can also use the package directly in Python:
from designer import DataMiningMapper, MultimodalInterleaver, TextAggregator
aggregator = TextAggregator()
mapper = DataMiningMapper()
interleaver = MultimodalInterleaver()
Text-side example:
from designer import DataMiningMapper, TextAggregator
content_list = [
{"type": "text", "text_level": 1, "text": "Abstract", "page_idx": 0},
{"type": "text", "text": "This work studies a Cu-Zr metallic glass system.", "page_idx": 0},
{"type": "text", "text_level": 1, "text": "Results and Discussion", "page_idx": 1},
{"type": "text", "text": "Figure 1 shows the microstructure evolution at 700 K.", "page_idx": 1},
]
aggregator = TextAggregator()
structured_text = aggregator.aggregate(content_list)
mapper = DataMiningMapper()
datamining_view = mapper.map(structured_text, {"doc_id": "demo-0001"})
print(structured_text)
print(datamining_view.pure_text_stream)
print(datamining_view.structured_json["sections"])
Expected outcome:
structured_textcontains Poros tags such as<poros_doc>,<poros_section_*>, and<poros_paragraph>.pure_text_streamremoves the structure tags while keeping readable text.structured_jsonexposes mined fields such assections,formulas,chemical_formulas, andasset_refs.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file porosdata_designer-0.1.2.tar.gz.
File metadata
- Download URL: porosdata_designer-0.1.2.tar.gz
- Upload date:
- Size: 60.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9028065ffb8e95d17a4eee834724c27d119dcd22c24c5d66cb35a350d1e2ad7c
|
|
| MD5 |
8a9c8fe58113bbf1162cbef4f0bb2a2d
|
|
| BLAKE2b-256 |
840c58a7f4a535ea8fe8a84280bbe7a0283cd4d47d1744250f5b7e63849dd241
|
File details
Details for the file porosdata_designer-0.1.2-py3-none-any.whl.
File metadata
- Download URL: porosdata_designer-0.1.2-py3-none-any.whl
- Upload date:
- Size: 67.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3e9fb751ffd91bd7c557c83863a9502dde8d1d96958d37702a38bc01d5fc101
|
|
| MD5 |
f7d6ecad34aaa1dda05d192bb70a8898
|
|
| BLAKE2b-256 |
c85816c5caa76883390cb1ae99f2b4ce10d70570eabbc3527f12ca2b04e81795
|