Skip to main content

Structured delivery toolkit: MinerU content_list to per-doc layout (content/structure/assets under one doc_id directory).

Project description

PorosData-Designer

PorosData-Designer converts MinerU-generated document parses into a unified per-document layout under {output_root}/{doc_id}/: structure-aware training text (*.content.json / *.content.txt), a datamining view (*.structure.json), and multimodal delivery (*.assets.index.json and images/).

It is designed for scientific data processing, structure-aware training preparation, and atomic document design centered on paragraphs, formulas, chemical expressions, and figure assets.

What It Does

  • Builds a structure-aware full-text view from *_content_list.json.
  • Maps document sections, formulas, chemical expressions, and asset references into *.structure.json.
  • Extracts image-caption-mention relationships with copied assets and Markdown cards under images/.

Install

Recommended Python version: 3.12.6 (validated on Windows win32 10.0.26200). Minimum supported version remains 3.8; Python 3.9-3.11 are syntax-compatible but not fully regression-tested in this repository.

pip install porosdata-designer

Runtime install now pulls only one direct dependency: loguru>=0.7.0.

For development in this repository:

pip install -e ".[dev]"

Quick Start

--input_dir must point to a directory tree that contains MinerU outputs (recursive *_content_list.json). In this repository the conventional input root is data/Processed Database.

Recommended in this repo (full pipeline):

./scripts/run_designeddataset.sh
# or explicit paths:
./scripts/run_designeddataset.sh "data/Processed Database" "data/Designed Database" logs

The script exports PYTHONPATH and runs python -m src.porosdata_designer.cli run all with the same defaults.

After editable install:

porosdata-designer run all \
  --input_dir "data/Processed Database" \
  --output_dir "data/Designed Database" \
  --log_dir logs

Module mode (same as installed package):

python -m porosdata_designer run all \
  --input_dir "data/Processed Database" \
  --output_dir "data/Designed Database" \
  --log_dir logs

Unpacked source without install (equivalent to the shell script):

export PYTHONPATH="${PWD}:${PWD}/src${PYTHONPATH:+:${PYTHONPATH}}"
python -m src.porosdata_designer.cli run all \
  --input_dir "data/Processed Database" \
  --output_dir "data/Designed Database" \
  --log_dir logs

If you omit --output_dir, the default is data/Designed Database under the project root (see DEFAULT_DESIGNED_OUTPUT_DIR_NAME in src/porosdata_designer/runtime/config.py).

Stage Commands

Run text structuring only:

porosdata-designer run text --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"

Run multimodal extraction only (same input tree as text in this repo):

porosdata-designer run multimodal --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"

Outputs

Each document is written to {output_root}/{doc_id}/:

path/to/output_root/
└── {doc_id}/
    ├── {doc_id}.content.json
    ├── {doc_id}.content.txt
    ├── {doc_id}.structure.json
    ├── {doc_id}.assets.index.json
    └── images/
        ├── fig_1.jpg
        └── fig_1.md

Validation

Audit outputs (default root is data/Designed Database when --root_dir is omitted):

porosdata-designer audit structured --root_dir "path/to/output_root"

Validate *.content.json files:

porosdata-designer validate structured --output_dir "path/to/output_root"

Validate multimodal indexes:

porosdata-designer validate multimodal --output_dir "path/to/output_root"

Run final acceptance validation:

porosdata-designer validate acceptance --output_dir "path/to/output_root"

Python Usage

You can also use the package directly in Python:

from porosdata_designer import DataMiningMapper, MultimodalInterleaver, TextAggregator

aggregator = TextAggregator()
mapper = DataMiningMapper()
interleaver = MultimodalInterleaver()

A more complete text-side example:

from porosdata_designer import DataMiningMapper, TextAggregator

content_list = [
    {"type": "text", "text_level": 1, "text": "Abstract", "page_idx": 0},
    {"type": "text", "text": "This work studies a Cu-Zr metallic glass system.", "page_idx": 0},
    {"type": "text", "text_level": 1, "text": "Results and Discussion", "page_idx": 1},
    {"type": "text", "text": "Figure 1 shows the microstructure evolution at 700 K.", "page_idx": 1},
]

aggregator = TextAggregator()
structured_text = aggregator.aggregate(content_list)

mapper = DataMiningMapper()
datamining_view = mapper.map(structured_text, {"doc_id": "demo-0001"})

print(structured_text)
print(datamining_view.pure_text_stream)
print(datamining_view.structured_json["sections"])

Expected outcome:

  • structured_text contains Poros tags such as <poros_doc>, <poros_section_*>, and <poros_paragraph>.
  • pure_text_stream removes the structure tags while keeping readable text.
  • structured_json exposes mined fields such as sections, formulas, chemical_formulas, and asset_refs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

porosdata_designer-0.1.1.tar.gz (60.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

porosdata_designer-0.1.1-py3-none-any.whl (67.6 kB view details)

Uploaded Python 3

File details

Details for the file porosdata_designer-0.1.1.tar.gz.

File metadata

  • Download URL: porosdata_designer-0.1.1.tar.gz
  • Upload date:
  • Size: 60.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for porosdata_designer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 730524e3967b81573d058d08f0da6772ea4c17523e6b3d120ca106f3343ac4b6
MD5 087bbdbd1d335e567a750a64683998d2
BLAKE2b-256 72ead159df23a86171b958c172ca0ff1e64be3db9e153e555dadd530b6ec3c10

See more details on using hashes here.

File details

Details for the file porosdata_designer-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for porosdata_designer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a86fbae2153a33afc03eb3acc20376dc612ffc773009646bb1982036fc02b308
MD5 80dfc5e3eabd4dd8321cdffcd118b14f
BLAKE2b-256 9a1a4b97cb387b63287bedf10d9346d85f19f54c509ea3b58bfb96ae10cceb60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page