Structured delivery toolkit: MinerU content_list to per-doc layout (content/structure/assets under one doc_id directory).
Project description
PorosData-Designer
PorosData-Designer converts MinerU-generated document parses into a unified per-document layout under {output_root}/{doc_id}/: structure-aware training text (*.content.json / *.content.txt), a datamining view (*.structure.json), and multimodal delivery (*.assets.index.json and images/).
It is designed for scientific data processing, structure-aware training preparation, and atomic document design centered on paragraphs, formulas, chemical expressions, and figure assets.
What It Does
- Builds a structure-aware full-text view from
*_content_list.json. - Maps document sections, formulas, chemical expressions, and asset references into
*.structure.json. - Extracts image-caption-mention relationships with copied assets and Markdown cards under
images/.
Install
Recommended Python version: 3.12.6 (validated on Windows win32 10.0.26200).
Minimum supported version remains 3.8; Python 3.9-3.11 are syntax-compatible but not fully regression-tested in this repository.
pip install porosdata-designer
Runtime install now pulls only one direct dependency: loguru>=0.7.0.
For development in this repository:
pip install -e ".[dev]"
Quick Start
--input_dir must point to a directory tree that contains MinerU outputs (recursive *_content_list.json). In this repository the conventional input root is data/Processed Database.
Recommended in this repo (full pipeline):
./scripts/run_designeddataset.sh
# or explicit paths:
./scripts/run_designeddataset.sh "data/Processed Database" "data/Designed Database" logs
The script exports PYTHONPATH and runs python -m src.porosdata_designer.cli run all with the same defaults.
After editable install:
porosdata-designer run all \
--input_dir "data/Processed Database" \
--output_dir "data/Designed Database" \
--log_dir logs
Module mode (same as installed package):
python -m porosdata_designer run all \
--input_dir "data/Processed Database" \
--output_dir "data/Designed Database" \
--log_dir logs
Unpacked source without install (equivalent to the shell script):
export PYTHONPATH="${PWD}:${PWD}/src${PYTHONPATH:+:${PYTHONPATH}}"
python -m src.porosdata_designer.cli run all \
--input_dir "data/Processed Database" \
--output_dir "data/Designed Database" \
--log_dir logs
If you omit --output_dir, the default is data/Designed Database under the project root (see DEFAULT_DESIGNED_OUTPUT_DIR_NAME in src/porosdata_designer/runtime/config.py).
Stage Commands
Run text structuring only:
porosdata-designer run text --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"
Run multimodal extraction only (same input tree as text in this repo):
porosdata-designer run multimodal --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"
Outputs
Each document is written to {output_root}/{doc_id}/:
path/to/output_root/
└── {doc_id}/
├── {doc_id}.content.json
├── {doc_id}.content.txt
├── {doc_id}.structure.json
├── {doc_id}.assets.index.json
└── images/
├── fig_1.jpg
└── fig_1.md
Validation
Audit outputs (default root is data/Designed Database when --root_dir is omitted):
porosdata-designer audit structured --root_dir "path/to/output_root"
Validate *.content.json files:
porosdata-designer validate structured --output_dir "path/to/output_root"
Validate multimodal indexes:
porosdata-designer validate multimodal --output_dir "path/to/output_root"
Run final acceptance validation:
porosdata-designer validate acceptance --output_dir "path/to/output_root"
Python Usage
You can also use the package directly in Python:
from porosdata_designer import DataMiningMapper, MultimodalInterleaver, TextAggregator
aggregator = TextAggregator()
mapper = DataMiningMapper()
interleaver = MultimodalInterleaver()
A more complete text-side example:
from porosdata_designer import DataMiningMapper, TextAggregator
content_list = [
{"type": "text", "text_level": 1, "text": "Abstract", "page_idx": 0},
{"type": "text", "text": "This work studies a Cu-Zr metallic glass system.", "page_idx": 0},
{"type": "text", "text_level": 1, "text": "Results and Discussion", "page_idx": 1},
{"type": "text", "text": "Figure 1 shows the microstructure evolution at 700 K.", "page_idx": 1},
]
aggregator = TextAggregator()
structured_text = aggregator.aggregate(content_list)
mapper = DataMiningMapper()
datamining_view = mapper.map(structured_text, {"doc_id": "demo-0001"})
print(structured_text)
print(datamining_view.pure_text_stream)
print(datamining_view.structured_json["sections"])
Expected outcome:
structured_textcontains Poros tags such as<poros_doc>,<poros_section_*>, and<poros_paragraph>.pure_text_streamremoves the structure tags while keeping readable text.structured_jsonexposes mined fields such assections,formulas,chemical_formulas, andasset_refs.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file porosdata_designer-0.1.1.tar.gz.
File metadata
- Download URL: porosdata_designer-0.1.1.tar.gz
- Upload date:
- Size: 60.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
730524e3967b81573d058d08f0da6772ea4c17523e6b3d120ca106f3343ac4b6
|
|
| MD5 |
087bbdbd1d335e567a750a64683998d2
|
|
| BLAKE2b-256 |
72ead159df23a86171b958c172ca0ff1e64be3db9e153e555dadd530b6ec3c10
|
File details
Details for the file porosdata_designer-0.1.1-py3-none-any.whl.
File metadata
- Download URL: porosdata_designer-0.1.1-py3-none-any.whl
- Upload date:
- Size: 67.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a86fbae2153a33afc03eb3acc20376dc612ffc773009646bb1982036fc02b308
|
|
| MD5 |
80dfc5e3eabd4dd8321cdffcd118b14f
|
|
| BLAKE2b-256 |
9a1a4b97cb387b63287bedf10d9346d85f19f54c509ea3b58bfb96ae10cceb60
|