Skip to main content

Python bindings for sdsconv — SDS ↔ MHLW standard JSON converter

Project description

sdsconv

Python-first, Rust-powered toolkit for converting Safety Data Sheets to Japan MHLW standard JSON — with schema validation, GHS/CAS checks, and corpus-scale quality evaluation.

日本語 | 中文


Install

pip install sdsconv                   # Python bindings
pip install "sdsconv[analysis]"       # + causasv quality analysis
cargo install sdsconv                 # CLI / GUI binary

Quick Start — Python

import sdsconv

# Extract raw text (no LLM)
text = sdsconv.extract_text("sample.pdf")

# Convert from URL
data, report = sdsconv.to_json_url_with_report(
    "https://example.com/sds.pdf", lang="ja",
)

# Convert SDS document → MHLW standard JSON
data, report = sdsconv.to_json_with_report(
    "sample.pdf",
    lang="ja",
    strict_mhlw=True,
)

# Validate and get structured findings
findings = sdsconv.validate(data, strict_mhlw=True)

print(f"Sections populated: {len(report['populated_sections'])}")
print(f"Findings: {len(findings)} ({sum(1 for f in findings if f['level']=='HIGH')} HIGH)")

# Save MHLW JSON
sdsconv.write_json(data, "output.json")

Corpus-scale evaluation (no manual review needed):

from sdsconv.eval import eval_corpus

df = eval_corpus(
    input_dir="data/sds_raw",
    output_dir="runs/eval_001",
    jobs=8,
)
print(df[["filename", "overall_score", "grade", "high_count"]].head(20))

Examples

MHLW official sample SDS — allyl chloride (塩化アリル):

export ANTHROPIC_API_KEY=sk-ant-...
python examples/mhlw_allyl_chloride/convert.py

See examples/mhlw_allyl_chloride/ for expected.json, expected_report.json, and source attribution.


Why sdsconv

  • MHLW-native: Converts directly to the Japanese Ministry of Health, Labour and Welfare SDS data exchange format v1.0 (SDS_Schema_v1.0.json), validated against the official schema.
  • Evidence-based extraction: Uses LLM to map free-form SDS text to ~200 nested schema fields. Source-text cross-checks detect hallucinations at the field level.
  • Corpus-scale quality evaluation: eval_corpus processes hundreds of SDS documents and outputs per-rule failure counts, section scores, and causasv_features.csv for root-cause analysis — without any human review.
  • No lock-in: Supports Anthropic Claude, OpenAI GPT, Google Gemini, Mistral, Groq, Cohere, and any OpenAI-compatible local endpoint. Bring your own model.
  • Rust core: Extraction, schema validation, GHS/CAS checks, and DOCX/HTML generation run in native code. Thin Python bindings on top.

MHLW Compliance

sdsconv targets the MHLW SDS data exchange format v1.0 published 2025-03-31.

Rule Behaviour
Schema validation Validates against SDS_Schema_v1.0.json
Empty-field removal Removes "", null, [], {} per §3.3
AdditionalInfo Content outside the official schema is written to AdditionalInfo.FullText
--strict-mhlw Exits 1 (CLI) / raises ValueError (Python) if any HIGH or CRIT finding
CRIT/HIGH/MED findings Structured validation report with rule ID, severity, path, message

Validation rules include: GHS H/P-code validity (GHS Rev.10), CAS format and check-digit, Section 2 GHS completeness (H-codes ↔ pictograms ↔ signal word), Section 3 component row alignment (name/CAS/concentration), UN number completeness, concentration range bounds, duplicate code detection, and more.

Quality baseline (30-file random sample, seed=42):

CRIT=0 · avg score 89.6 · top issues: S2-HAZARD-NO-PICTOGRAM, S15-ZHCN-NO-GB, S14-NO-SHIPPING-NAME

Full rule catalogue → docs/quality-check.md


Corpus Evaluation

Run without human review:

from sdsconv.eval import eval_corpus

df = eval_corpus("data/sds_raw", "runs/eval_001", jobs=8)

Outputs per file:

File Contents
generated/<stem>.json MHLW standard JSON
reports/<stem>.json ConversionReport (language, populated sections, warnings)
findings/<stem>.json Structured validation findings
summary.csv Per-file scores and grades
failures_by_rule.csv Rule frequency and affected file counts

Root-cause analysis with causasv:

from sdsconv.causasv_bridge import print_ranking
print_ranking("runs/eval_001/causasv_features.csv")

CLI

# PDF/DOCX/XLSX/HTML/URL → MHLW JSON
sdsconv to-json --input input.pdf --output output.json --lang ja

# With correction pass and PubChem enrichment
sdsconv to-json --input input.pdf --output output.json --correct --enrich

# JSON → Word document (16 JIS Z 7253 sections)
sdsconv to-docx --input output.json --output result.docx --lang ja

# JSON → HTML (printable, A4, inline CSS)
sdsconv to-html --input output.json --output result.html --lang ja

# Validate with strict MHLW mode
sdsconv validate --input output.json --strict-mhlw

# Batch: process a directory
sdsconv to-json --input-dir data/ --output-dir out/ --jobs 8

# Corpus evaluation
sdsconv eval-corpus --input-dir data/sds_raw --output-dir runs/eval_001 --jobs 8

Full CLI reference → sdsconv/README.md


REST API

# Start server (binds to 127.0.0.1:3000 by default)
SDS_SERVER_TOKEN=secret sdsconv-server

# Convert a PDF
curl -X POST http://localhost:3000/api/to-json \
  -H "Authorization: Bearer secret" \
  -F "file=@input.pdf"

Endpoints: POST /api/to-json · POST /api/to-docx · POST /api/to-html · POST /api/validate · GET /api/health


GUI

Run sdsconv without arguments to open the graphical interface:

sdsconv

Five tabs: Convert · Generate · Validate · Extract Text · Settings

Download the desktop app: macOS · Windows · brew install --cask sdsconv


Supported Inputs, Languages, and Backends

Input formats: PDF (text, CID/Shift-JIS, scanned) · DOCX · XLSX · TXT · HTML · URL

Source languages: ja (JIS Z 7253) · en (GHS/OSHA HazCom) · zh-cn (GB/T 16483) · zh-tw (CNS 15030)

LLM backends: Anthropic Claude · OpenAI GPT · Google Gemini · Mistral · Groq · Cohere · Local (any OpenAI-compatible endpoint)


For Developers

Rust library:

[dependencies]
sdsconv-core = "0.3"

See sdsconv_core/README.md for the Rust API.

Crates: sdsconv · sdsconv-core

Python package: sdsconv on PyPI — pip install sdsconv


Security & Privacy

  • Cloud LLM caution: When using a cloud LLM backend, SDS document text is sent to the API provider. Avoid sending confidential or trade-secret SDS documents to cloud APIs.
  • Local operation: Use --backend local with any OpenAI-compatible endpoint (e.g. Ollama, LM Studio) for fully offline operation. No data leaves your machine.
  • Raw SDS corpus: Add corpus/raw/ and data/sds_raw/ to .gitignore. Only corpus/manifest.jsonl (URLs + sha256 hashes) is safe to commit.
  • REST server: Bearer token authentication with timing-safe comparison, SSRF protection (full IPv6 coverage), redirect-disabled HTTP client, 50 MB upload cap.

Comparison

docs/comparison.md


References


License

MIT OR Apache-2.0 — at your option.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdsconv-0.1.5.tar.gz (178.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sdsconv-0.1.5-cp39-abi3-win_amd64.whl (5.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

sdsconv-0.1.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

sdsconv-0.1.5-cp39-abi3-macosx_11_0_arm64.whl (5.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file sdsconv-0.1.5.tar.gz.

File metadata

  • Download URL: sdsconv-0.1.5.tar.gz
  • Upload date:
  • Size: 178.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sdsconv-0.1.5.tar.gz
Algorithm Hash digest
SHA256 fa8f6349b55a0d3c3402d5d58b0e9eb11576298b98bce1248805673f9157770d
MD5 1ce30cf4737dc541c2ac27ae14b00cb8
BLAKE2b-256 cec566e51198e8affdfcbf9ec9ebdb4246582ede0d73827227dc47eaa6c39fce

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.5.tar.gz:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdsconv-0.1.5-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: sdsconv-0.1.5-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.4 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sdsconv-0.1.5-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3889840b0b958bac7581015479c8bae03abdd62db70564cacdf75239be9abd69
MD5 5211f1c6cb2de9765cef9c359bec3081
BLAKE2b-256 363416cb6ef533359f6f50307dbea9d2f7316e7d8f06ec02c3f82da3066f2775

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.5-cp39-abi3-win_amd64.whl:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdsconv-0.1.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sdsconv-0.1.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 43fbfed76add17809a9757e0b6c6193d250dc67a49b406a91e13e63dd08468c9
MD5 eb25b1830660641454bc5a86bf124e0a
BLAKE2b-256 c23d4c733a80612e11802d83c72f642babf2223c43f509e5f138378724603d94

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdsconv-0.1.5-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sdsconv-0.1.5-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 84fad254761e1977c568301f84f9c7c01debaca65a452ee88c384ee528975a2d
MD5 8b7d616117fab844f03321cbcdef6dc6
BLAKE2b-256 6f0bb1b967354dd3a07a476ecb8f74fd26bd12dc6ba595cf7844515bac49ed95

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.5-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page