Skip to main content

Python bindings for sdsconv — SDS ↔ MHLW standard JSON converter

Project description

sdsconv

Python-first, Rust-powered toolkit for converting Safety Data Sheets to Japan MHLW standard JSON — with schema validation, GHS/CAS checks, and corpus-scale quality evaluation.

日本語 | 中文


Install

pip install sdsconv                   # Python bindings
pip install "sdsconv[analysis]"       # + causasv quality analysis
cargo install sdsconv                 # CLI / GUI binary

Quick Start — Python

import sdsconv

# Extract raw text (no LLM)
text = sdsconv.extract_text("sample.pdf")

# Convert from URL
data, report = sdsconv.to_json_url_with_report(
    "https://example.com/sds.pdf", lang="ja",
)

# Convert SDS document → MHLW standard JSON
data, report = sdsconv.to_json_with_report(
    "sample.pdf",
    lang="ja",
    strict_mhlw=True,
)

# Validate and get structured findings
findings = sdsconv.validate(data, strict_mhlw=True)

print(f"Sections populated: {len(report['populated_sections'])}")
print(f"Findings: {len(findings)} ({sum(1 for f in findings if f['level']=='HIGH')} HIGH)")

# Save MHLW JSON
sdsconv.write_json(data, "output.json")

Corpus-scale evaluation (no manual review needed):

from sdsconv.eval import eval_corpus

df = eval_corpus(
    input_dir="data/sds_raw",
    output_dir="runs/eval_001",
    jobs=8,
)
print(df[["filename", "overall_score", "grade", "high_count"]].head(20))

Examples

MHLW official sample SDS — allyl chloride (塩化アリル):

export ANTHROPIC_API_KEY=sk-ant-...
python examples/mhlw_allyl_chloride/convert.py

See examples/mhlw_allyl_chloride/ for expected.json, expected_report.json, and source attribution.


Why sdsconv

  • MHLW-native: Converts directly to the Japanese Ministry of Health, Labour and Welfare SDS data exchange format v1.0 (SDS_Schema_v1.0.json), validated against the official schema.
  • Evidence-based extraction: Uses LLM to map free-form SDS text to ~200 nested schema fields. Source-text cross-checks detect hallucinations at the field level.
  • Corpus-scale quality evaluation: eval_corpus processes hundreds of SDS documents and outputs per-rule failure counts, section scores, and causasv_features.csv for root-cause analysis — without any human review.
  • No lock-in: Supports Anthropic Claude, OpenAI GPT, Google Gemini, Mistral, Groq, Cohere, and any OpenAI-compatible local endpoint. Bring your own model.
  • Rust core: Extraction, schema validation, GHS/CAS checks, and DOCX/HTML generation run in native code. Thin Python bindings on top.

MHLW Compliance

sdsconv targets the MHLW SDS data exchange format v1.0 published 2025-03-31.

Rule Behaviour
Schema validation Validates against SDS_Schema_v1.0.json
Empty-field removal Removes "", null, [], {} per §3.3
AdditionalInfo Content outside the official schema is written to AdditionalInfo.FullText
--strict-mhlw Exits 1 (CLI) / raises ValueError (Python) if any HIGH or CRIT finding
CRIT/HIGH/MED findings Structured validation report with rule ID, severity, path, message

Validation rules include: GHS H/P-code validity (GHS Rev.10), CAS format and check-digit, Section 2 GHS completeness (H-codes ↔ pictograms ↔ signal word), Section 3 component row alignment (name/CAS/concentration), UN number completeness, concentration range bounds, duplicate code detection, and more.

Quality baseline (30-file random sample, seed=42):

CRIT=0 · avg score 89.6 · top issues: S2-HAZARD-NO-PICTOGRAM, S15-ZHCN-NO-GB, S14-NO-SHIPPING-NAME

Full rule catalogue → docs/quality-check.md


Corpus Evaluation

Run without human review:

from sdsconv.eval import eval_corpus

df = eval_corpus("data/sds_raw", "runs/eval_001", jobs=8)

Outputs per file:

File Contents
generated/<stem>.json MHLW standard JSON
reports/<stem>.json ConversionReport (language, populated sections, warnings)
findings/<stem>.json Structured validation findings
summary.csv Per-file scores and grades
failures_by_rule.csv Rule frequency and affected file counts

Root-cause analysis with causasv:

from sdsconv.causasv_bridge import print_ranking
print_ranking("runs/eval_001/causasv_features.csv")

CLI

# PDF/DOCX/XLSX/HTML/URL → MHLW JSON
sdsconv to-json --input input.pdf --output output.json --lang ja

# With correction pass and PubChem enrichment
sdsconv to-json --input input.pdf --output output.json --correct --enrich

# JSON → Word document (16 JIS Z 7253 sections)
sdsconv to-docx --input output.json --output result.docx --lang ja

# JSON → HTML (printable, A4, inline CSS)
sdsconv to-html --input output.json --output result.html --lang ja

# Validate with strict MHLW mode
sdsconv validate --input output.json --strict-mhlw

# Batch: process a directory
sdsconv to-json --input-dir data/ --output-dir out/ --jobs 8

# Corpus evaluation
sdsconv eval-corpus --input-dir data/sds_raw --output-dir runs/eval_001 --jobs 8

Full CLI reference → sdsconv/README.md


REST API

# Start server (binds to 127.0.0.1:3000 by default)
SDS_SERVER_TOKEN=secret sdsconv-server

# Convert a PDF
curl -X POST http://localhost:3000/api/to-json \
  -H "Authorization: Bearer secret" \
  -F "file=@input.pdf"

Endpoints: POST /api/to-json · POST /api/to-docx · POST /api/to-html · POST /api/validate · GET /api/health


GUI

Run sdsconv without arguments to open the graphical interface:

sdsconv

Five tabs: Convert · Generate · Validate · Extract Text · Settings

Download the desktop app: macOS · Windows · brew install --cask sdsconv


Supported Inputs, Languages, and Backends

Input formats: PDF (text, CID/Shift-JIS, scanned) · DOCX · XLSX · TXT · HTML · URL

Source languages: ja (JIS Z 7253) · en (GHS/OSHA HazCom) · zh-cn (GB/T 16483) · zh-tw (CNS 15030)

LLM backends: Anthropic Claude · OpenAI GPT · Google Gemini · Mistral · Groq · Cohere · Local (any OpenAI-compatible endpoint)


For Developers

Rust library:

[dependencies]
sdsconv-core = "0.3"

See sdsconv_core/README.md for the Rust API.

Crates: sdsconv · sdsconv-core

Python package: sdsconv on PyPI — pip install sdsconv


Security & Privacy

  • Cloud LLM caution: When using a cloud LLM backend, SDS document text is sent to the API provider. Avoid sending confidential or trade-secret SDS documents to cloud APIs.
  • Local operation: Use --backend local with any OpenAI-compatible endpoint (e.g. Ollama, LM Studio) for fully offline operation. No data leaves your machine.
  • Raw SDS corpus: Add corpus/raw/ and data/sds_raw/ to .gitignore. Only corpus/manifest.jsonl (URLs + sha256 hashes) is safe to commit.
  • REST server: Bearer token authentication with timing-safe comparison, SSRF protection (full IPv6 coverage), redirect-disabled HTTP client, 50 MB upload cap.

Comparison

docs/comparison.md


References


License

MIT OR Apache-2.0 — at your option.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdsconv-0.1.4.tar.gz (177.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sdsconv-0.1.4-cp39-abi3-win_amd64.whl (5.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

sdsconv-0.1.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

sdsconv-0.1.4-cp39-abi3-macosx_11_0_arm64.whl (5.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file sdsconv-0.1.4.tar.gz.

File metadata

  • Download URL: sdsconv-0.1.4.tar.gz
  • Upload date:
  • Size: 177.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sdsconv-0.1.4.tar.gz
Algorithm Hash digest
SHA256 21eaadd46482df927a619b2f02d545147bf30b1ed659bf20d97cf24b84c14ca1
MD5 df65904857d4302c6467be8351fd5e00
BLAKE2b-256 cbc4131c016c54f24d15dc541833c81c97b742e15abe89290d569555f982c03a

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.4.tar.gz:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdsconv-0.1.4-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: sdsconv-0.1.4-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.4 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sdsconv-0.1.4-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 beb985b551ed62b7f2b229d1c05ccd588f9c94913ee4803e5e2853120dfd9c9b
MD5 d91e926f632d364b86ce44831c13e356
BLAKE2b-256 8ec33501468f84f2534404e3ea3a838bb357013df8c02686c868529695b7db0c

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.4-cp39-abi3-win_amd64.whl:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdsconv-0.1.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sdsconv-0.1.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 772e511bbf8501be4680c0032cf5080260dad71e6c9290d39ff9f810f95f9ece
MD5 d00a88d9d336771953ad7451870b4007
BLAKE2b-256 2ffd89b3a76b75f40d1450eee8b69cfe91433bd8cb6180ec06b764801c41d049

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdsconv-0.1.4-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sdsconv-0.1.4-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f5d34908a0161ebf1480d70b56a285c221e6ab2754e1206cf9d4218ce6b7217d
MD5 9776e935888f490231d8e1c186330540
BLAKE2b-256 c4b541aef17917203c89c8807d6c6ca3be22fc23489713bcbc1d57b5c5de4411

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.4-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page