Skip to main content

Python bindings for sdsconv — SDS ↔ MHLW standard JSON converter

Project description

sdsconv

Python-first, Rust-powered toolkit for converting Safety Data Sheets to Japan MHLW standard JSON — with schema validation, GHS/CAS checks, and corpus-scale quality evaluation.

日本語 | 中文


Install

pip install sdsconv                   # Python bindings
pip install "sdsconv[analysis]"       # + causasv quality analysis
cargo install sdsconv                 # CLI / GUI binary

Quick Start — Python

import sdsconv

# Extract raw text (no LLM)
text = sdsconv.extract_text("sample.pdf")

# Convert from URL
data, report = sdsconv.to_json_url_with_report(
    "https://example.com/sds.pdf", lang="ja",
)

# Convert SDS document → MHLW standard JSON
data, report = sdsconv.to_json_with_report(
    "sample.pdf",
    lang="ja",
    strict_mhlw=True,
)

# Validate and get structured findings
findings = sdsconv.validate(data, strict_mhlw=True)

print(f"Sections populated: {len(report['populated_sections'])}")
print(f"Findings: {len(findings)} ({sum(1 for f in findings if f['level']=='HIGH')} HIGH)")

# Save MHLW JSON
sdsconv.write_json(data, "output.json")

Corpus-scale evaluation (no manual review needed):

from sdsconv.eval import eval_corpus

df = eval_corpus(
    input_dir="data/sds_raw",
    output_dir="runs/eval_001",
    jobs=8,
)
print(df[["filename", "overall_score", "grade", "high_count"]].head(20))

Examples

MHLW official sample SDS — allyl chloride (塩化アリル):

export ANTHROPIC_API_KEY=sk-ant-...
python examples/mhlw_allyl_chloride/convert.py

See examples/mhlw_allyl_chloride/ for expected.json, expected_report.json, and source attribution.


Why sdsconv

  • MHLW-native: Converts directly to the Japanese Ministry of Health, Labour and Welfare SDS data exchange format v1.0 (SDS_Schema_v1.0.json), validated against the official schema.
  • Evidence-based extraction: Uses LLM to map free-form SDS text to ~200 nested schema fields. Source-text cross-checks detect hallucinations at the field level.
  • Corpus-scale quality evaluation: eval_corpus processes hundreds of SDS documents and outputs per-rule failure counts, section scores, and causasv_features.csv for root-cause analysis — without any human review.
  • No lock-in: Supports Anthropic Claude, OpenAI GPT, Google Gemini, Mistral, Groq, Cohere, and any OpenAI-compatible local endpoint. Bring your own model.
  • Rust core: Extraction, schema validation, GHS/CAS checks, and DOCX/HTML generation run in native code. Thin Python bindings on top.

MHLW Compliance

sdsconv targets the MHLW SDS data exchange format v1.0 published 2025-03-31.

Rule Behaviour
Schema validation Validates against SDS_Schema_v1.0.json
Empty-field removal Removes "", null, [], {} per §3.3
AdditionalInfo Content outside the official schema is written to AdditionalInfo.FullText
--strict-mhlw Exits 1 (CLI) / raises ValueError (Python) if any HIGH or CRIT finding
CRIT/HIGH/MED findings Structured validation report with rule ID, severity, path, message

Validation rules include: GHS H/P-code validity (GHS Rev.10), CAS format and check-digit, Section 2 GHS completeness (H-codes ↔ pictograms ↔ signal word), Section 3 component row alignment (name/CAS/concentration), UN number completeness, concentration range bounds, duplicate code detection, and more.

Quality baseline (30-file random sample, seed=42):

CRIT=0 · avg score 89.6 · top issues: S2-HAZARD-NO-PICTOGRAM, S15-ZHCN-NO-GB, S14-NO-SHIPPING-NAME

Full rule catalogue → docs/quality-check.md


Corpus Evaluation

Run without human review:

from sdsconv.eval import eval_corpus

df = eval_corpus("data/sds_raw", "runs/eval_001", jobs=8)

Outputs per file:

File Contents
generated/<stem>.json MHLW standard JSON
reports/<stem>.json ConversionReport (language, populated sections, warnings)
findings/<stem>.json Structured validation findings
summary.csv Per-file scores and grades
failures_by_rule.csv Rule frequency and affected file counts

Root-cause analysis with causasv:

from sdsconv.causasv_bridge import print_ranking
print_ranking("runs/eval_001/causasv_features.csv")

CLI

# PDF/DOCX/XLSX/HTML/URL → MHLW JSON
sdsconv to-json --input input.pdf --output output.json --lang ja

# With correction pass and PubChem enrichment
sdsconv to-json --input input.pdf --output output.json --correct --enrich

# JSON → Word document (16 JIS Z 7253 sections)
sdsconv to-docx --input output.json --output result.docx --lang ja

# JSON → HTML (printable, A4, inline CSS)
sdsconv to-html --input output.json --output result.html --lang ja

# Validate with strict MHLW mode
sdsconv validate --input output.json --strict-mhlw

# Batch: process a directory
sdsconv to-json --input-dir data/ --output-dir out/ --jobs 8

# Corpus evaluation
sdsconv eval-corpus --input-dir data/sds_raw --output-dir runs/eval_001 --jobs 8

Full CLI reference → sdsconv/README.md


REST API

# Start server (binds to 127.0.0.1:3000 by default)
SDS_SERVER_TOKEN=secret sdsconv-server

# Convert a PDF
curl -X POST http://localhost:3000/api/to-json \
  -H "Authorization: Bearer secret" \
  -F "file=@input.pdf"

Endpoints: POST /api/to-json · POST /api/to-docx · POST /api/to-html · POST /api/validate · GET /api/health


GUI

Run sdsconv without arguments to open the graphical interface:

sdsconv

Five tabs: Convert · Generate · Validate · Extract Text · Settings

Download the desktop app: macOS · Windows · brew install --cask sdsconv


Supported Inputs, Languages, and Backends

Input formats: PDF (text, CID/Shift-JIS, scanned) · DOCX · XLSX · TXT · HTML · URL

Source languages: ja (JIS Z 7253) · en (GHS/OSHA HazCom) · zh-cn (GB/T 16483) · zh-tw (CNS 15030)

LLM backends: Anthropic Claude · OpenAI GPT · Google Gemini · Mistral · Groq · Cohere · Local (any OpenAI-compatible endpoint)


For Developers

Rust library:

[dependencies]
sdsconv-core = "0.3"

See sdsconv_core/README.md for the Rust API.

Crates: sdsconv · sdsconv-core

Python package: sdsconv on PyPI — pip install sdsconv


Security & Privacy

  • Cloud LLM caution: When using a cloud LLM backend, SDS document text is sent to the API provider. Avoid sending confidential or trade-secret SDS documents to cloud APIs.
  • Local operation: Use --backend local with any OpenAI-compatible endpoint (e.g. Ollama, LM Studio) for fully offline operation. No data leaves your machine.
  • Raw SDS corpus: Add corpus/raw/ and data/sds_raw/ to .gitignore. Only corpus/manifest.jsonl (URLs + sha256 hashes) is safe to commit.
  • REST server: Bearer token authentication with timing-safe comparison, SSRF protection (full IPv6 coverage), redirect-disabled HTTP client, 50 MB upload cap.

Comparison

docs/comparison.md


References


License

MIT OR Apache-2.0 — at your option.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdsconv-0.1.6.tar.gz (179.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sdsconv-0.1.6-cp39-abi3-win_amd64.whl (5.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

sdsconv-0.1.6-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

sdsconv-0.1.6-cp39-abi3-macosx_11_0_arm64.whl (5.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file sdsconv-0.1.6.tar.gz.

File metadata

  • Download URL: sdsconv-0.1.6.tar.gz
  • Upload date:
  • Size: 179.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sdsconv-0.1.6.tar.gz
Algorithm Hash digest
SHA256 9ca687e0176efb363cb6d83fbece39acc3d0f7e7bcc3a01d39fe12dc32f3698a
MD5 80295284a50e720efdd8e93b7fcb1a49
BLAKE2b-256 629d27420849c1e83222b3cb95ad1900ef5bd4f8ffb3353c1ae737a501e07b72

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.6.tar.gz:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdsconv-0.1.6-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: sdsconv-0.1.6-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.4 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sdsconv-0.1.6-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f615013831b809ba71920a28c50567e7f0810a0a613162fc2ed24eb0c4ea4415
MD5 fb5527f59cc279c6fe4e35134c5b02f1
BLAKE2b-256 78760389cb3ace90b94542da856da417ec4ef1db5ef179500d8208e9c1e9de1a

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.6-cp39-abi3-win_amd64.whl:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdsconv-0.1.6-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sdsconv-0.1.6-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 be5d8109643a2e2c38732292d8cf25b5fa9057cb99ec37cadbe35a815e4c41c3
MD5 ef8074a57cfc140410abb4cd76dd565c
BLAKE2b-256 55c462b42dccab6539e3b029724f034a1acd4403d7e139a62045a193ecc59656

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.6-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdsconv-0.1.6-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sdsconv-0.1.6-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 66e08d4eff54fedbaf1f4bfe0f079a40f169fb45890bf98f6e1c22f765e5ff49
MD5 abed5aac6a5d481ea86817b43f50d32a
BLAKE2b-256 2fcb19f6340207bf5d2827b56b28c2e06a979fe7fbdae9bea99c517869fcc3a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.6-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page