Skip to main content

Python bindings for sdsconv — SDS ↔ MHLW standard JSON converter

Project description

sdsconv

Python-first, Rust-powered toolkit for converting Safety Data Sheets to Japan MHLW standard JSON — with schema validation, GHS/CAS checks, and corpus-scale quality evaluation.

日本語 | 中文


Install

pip install sdsconv                   # Python bindings
pip install "sdsconv[analysis]"       # + causasv quality analysis
cargo install sdsconv                 # CLI / GUI binary

Quick Start — Python

import sdsconv

# Extract raw text (no LLM)
text = sdsconv.extract_text("sample.pdf")

# Convert from URL
data, report = sdsconv.to_json_url_with_report(
    "https://example.com/sds.pdf", lang="ja",
)

# Convert SDS document → MHLW standard JSON
data, report = sdsconv.to_json_with_report(
    "sample.pdf",
    lang="ja",
    strict_mhlw=True,
)

# Validate and get structured findings
findings = sdsconv.validate(data, strict_mhlw=True)

print(f"Sections populated: {len(report['populated_sections'])}")
print(f"Findings: {len(findings)} ({sum(1 for f in findings if f['level']=='HIGH')} HIGH)")

# Save MHLW JSON
sdsconv.write_json(data, "output.json")

Corpus-scale evaluation (no manual review needed):

from sdsconv.eval import eval_corpus

df = eval_corpus(
    input_dir="data/sds_raw",
    output_dir="runs/eval_001",
    jobs=8,
)
print(df[["filename", "overall_score", "grade", "high_count"]].head(20))

Examples

MHLW official sample SDS — allyl chloride (塩化アリル):

export ANTHROPIC_API_KEY=sk-ant-...
python examples/mhlw_allyl_chloride/convert.py

See examples/mhlw_allyl_chloride/ for expected.json, expected_report.json, and source attribution.


Why sdsconv

  • MHLW-native: Converts directly to the Japanese Ministry of Health, Labour and Welfare SDS data exchange format v1.0 (SDS_Schema_v1.0.json), validated against the official schema.
  • Evidence-based extraction: Uses LLM to map free-form SDS text to ~200 nested schema fields. Source-text cross-checks detect hallucinations at the field level.
  • Corpus-scale quality evaluation: eval_corpus processes hundreds of SDS documents and outputs per-rule failure counts, section scores, and causasv_features.csv for root-cause analysis — without any human review.
  • No lock-in: Supports Anthropic Claude, OpenAI GPT, Google Gemini, Mistral, Groq, Cohere, and any OpenAI-compatible local endpoint. Bring your own model.
  • Rust core: Extraction, schema validation, GHS/CAS checks, and DOCX/HTML generation run in native code. Thin Python bindings on top.

MHLW Compliance

sdsconv targets the MHLW SDS data exchange format v1.0 published 2025-03-31.

Rule Behaviour
Schema validation Validates against SDS_Schema_v1.0.json
Empty-field removal Removes "", null, [], {} per §3.3
AdditionalInfo Content outside the official schema is written to AdditionalInfo.FullText
--strict-mhlw Exits 1 (CLI) / raises ValueError (Python) if any HIGH or CRIT finding
CRIT/HIGH/MED findings Structured validation report with rule ID, severity, path, message

Validation rules include: GHS H/P-code validity (GHS Rev.10), CAS format and check-digit, Section 2 GHS completeness (H-codes ↔ pictograms ↔ signal word), Section 3 component row alignment (name/CAS/concentration), UN number completeness, concentration range bounds, duplicate code detection, and more.

Quality baseline (30-file random sample, seed=42):

CRIT=0 · avg score 89.6 · top issues: S2-HAZARD-NO-PICTOGRAM, S15-ZHCN-NO-GB, S14-NO-SHIPPING-NAME

Full rule catalogue → docs/quality-check.md


Corpus Evaluation

Run without human review:

from sdsconv.eval import eval_corpus

df = eval_corpus("data/sds_raw", "runs/eval_001", jobs=8)

Outputs per file:

File Contents
generated/<stem>.json MHLW standard JSON
reports/<stem>.json ConversionReport (language, populated sections, warnings)
findings/<stem>.json Structured validation findings
summary.csv Per-file scores and grades
failures_by_rule.csv Rule frequency and affected file counts

Root-cause analysis with causasv:

from sdsconv.causasv_bridge import print_ranking
print_ranking("runs/eval_001/causasv_features.csv")

CLI

# PDF/DOCX/XLSX/HTML/URL → MHLW JSON
sdsconv to-json --input input.pdf --output output.json --lang ja

# With correction pass and PubChem enrichment
sdsconv to-json --input input.pdf --output output.json --correct --enrich

# JSON → Word document (16 JIS Z 7253 sections)
sdsconv to-docx --input output.json --output result.docx --lang ja

# JSON → HTML (printable, A4, inline CSS)
sdsconv to-html --input output.json --output result.html --lang ja

# Validate with strict MHLW mode
sdsconv validate --input output.json --strict-mhlw

# Batch: process a directory
sdsconv to-json --input-dir data/ --output-dir out/ --jobs 8

# Corpus evaluation
sdsconv eval-corpus --input-dir data/sds_raw --output-dir runs/eval_001 --jobs 8

Full CLI reference → sdsconv/README.md


REST API

# Start server (binds to 127.0.0.1:3000 by default)
SDS_SERVER_TOKEN=secret sdsconv-server

# Convert a PDF
curl -X POST http://localhost:3000/api/to-json \
  -H "Authorization: Bearer secret" \
  -F "file=@input.pdf"

Endpoints: POST /api/to-json · POST /api/to-docx · POST /api/to-html · POST /api/validate · GET /api/health


GUI

Run sdsconv without arguments to open the graphical interface:

sdsconv

Five tabs: Convert · Generate · Validate · Extract Text · Settings

Download the desktop app: macOS · Windows · brew install --cask sdsconv


Supported Inputs, Languages, and Backends

Input formats: PDF (text, CID/Shift-JIS, scanned) · DOCX · XLSX · TXT · HTML · URL

Source languages: ja (JIS Z 7253) · en (GHS/OSHA HazCom) · zh-cn (GB/T 16483) · zh-tw (CNS 15030)

LLM backends: Anthropic Claude · OpenAI GPT · Google Gemini · Mistral · Groq · Cohere · Local (any OpenAI-compatible endpoint)


For Developers

Rust library:

[dependencies]
sdsconv-core = "0.3"

See sdsconv_core/README.md for the Rust API.

Crates: sdsconv · sdsconv-core

Python package: sdsconv on PyPI — pip install sdsconv


Security & Privacy

  • Cloud LLM caution: When using a cloud LLM backend, SDS document text is sent to the API provider. Avoid sending confidential or trade-secret SDS documents to cloud APIs.
  • Local operation: Use --backend local with any OpenAI-compatible endpoint (e.g. Ollama, LM Studio) for fully offline operation. No data leaves your machine.
  • Raw SDS corpus: Add corpus/raw/ and data/sds_raw/ to .gitignore. Only corpus/manifest.jsonl (URLs + sha256 hashes) is safe to commit.
  • REST server: Bearer token authentication with timing-safe comparison, SSRF protection (full IPv6 coverage), redirect-disabled HTTP client, 50 MB upload cap.

Comparison

docs/comparison.md


References


License

MIT OR Apache-2.0 — at your option.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdsconv-0.1.7.tar.gz (179.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sdsconv-0.1.7-cp39-abi3-win_amd64.whl (5.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

sdsconv-0.1.7-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

sdsconv-0.1.7-cp39-abi3-macosx_11_0_arm64.whl (5.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file sdsconv-0.1.7.tar.gz.

File metadata

  • Download URL: sdsconv-0.1.7.tar.gz
  • Upload date:
  • Size: 179.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sdsconv-0.1.7.tar.gz
Algorithm Hash digest
SHA256 2f4ea7ce06756452566aea9c2aafe31da879c588d6714f0c4d7276195fb8b0ca
MD5 a594c4388584dbdc6513616ff99dae21
BLAKE2b-256 fc4d4e8f8efdbeb8622e2d00e8926e527a5f6c0a4824e989e006774cdf5029e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.7.tar.gz:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdsconv-0.1.7-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: sdsconv-0.1.7-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.4 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sdsconv-0.1.7-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 e2c8dac40d39e25efa7e5e3d00d1d04bb573c4eef717b710e30b565b11ee435f
MD5 cc6ef01aad99fb54cf03d9eff2356dae
BLAKE2b-256 8715a7558680b99d6d461eba53ed94b716f6dcd48c0d58f7950d28b0a66b6900

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.7-cp39-abi3-win_amd64.whl:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdsconv-0.1.7-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sdsconv-0.1.7-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b0916e472f3dbc2915822dc7b4105ffc6df2ec3a7df043acca375311ad149e62
MD5 28a10eda794590e49c7ef9fe9cf7831a
BLAKE2b-256 b20b4969f60be09cb385d3601d43aca098cb80639e365cbcfad80f0e501953ad

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.7-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdsconv-0.1.7-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sdsconv-0.1.7-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d772c4a6b2bf26936c796a2b1c1ee0000415be4e9d87ce66600db04324bd68db
MD5 354ef6d529e30ada4f547fc3c40ce8e4
BLAKE2b-256 a6fc39d55e32b69a451952d53f88b45a8ddecf50c92fa6ba136aaca4aaec1313

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdsconv-0.1.7-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: python-wheels.yml on kent-tokyo/sdsconv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page