Skip to main content

Python bindings for sdsconv — SDS ↔ MHLW standard JSON converter

Project description

sdsconv

Python-first, Rust-powered toolkit for converting Safety Data Sheets to Japan MHLW standard JSON — with schema validation, GHS/CAS checks, and corpus-scale quality evaluation.

日本語 | 中文


Install

pip install sdsconv                   # Python bindings
pip install "sdsconv[analysis]"       # + causasv quality analysis
cargo install sdsconv                 # CLI / GUI binary

Quick Start — Python

import sdsconv

# Extract raw text (no LLM)
text = sdsconv.extract_text("sample.pdf")

# Convert from URL
data, report = sdsconv.to_json_url_with_report(
    "https://example.com/sds.pdf", lang="ja",
)

# Convert SDS document → MHLW standard JSON
data, report = sdsconv.to_json_with_report(
    "sample.pdf",
    lang="ja",
    strict_mhlw=True,
)

# Validate and get structured findings
findings = sdsconv.validate(data, strict_mhlw=True)

print(f"Sections populated: {len(report['populated_sections'])}")
print(f"Findings: {len(findings)} ({sum(1 for f in findings if f['level']=='HIGH')} HIGH)")

# Save MHLW JSON
sdsconv.write_json(data, "output.json")

Corpus-scale evaluation (no manual review needed):

from sdsconv.eval import eval_corpus

df = eval_corpus(
    input_dir="data/sds_raw",
    output_dir="runs/eval_001",
    jobs=8,
)
print(df[["filename", "overall_score", "grade", "high_count"]].head(20))

Examples

MHLW official sample SDS — allyl chloride (塩化アリル):

export ANTHROPIC_API_KEY=sk-ant-...
python examples/mhlw_allyl_chloride/convert.py

See examples/mhlw_allyl_chloride/ for expected.json, expected_report.json, and source attribution.


Why sdsconv

  • MHLW-native: Converts directly to the Japanese Ministry of Health, Labour and Welfare SDS data exchange format v1.0 (SDS_Schema_v1.0.json), validated against the official schema.
  • Evidence-based extraction: Uses LLM to map free-form SDS text to ~200 nested schema fields. Source-text cross-checks detect hallucinations at the field level.
  • Corpus-scale quality evaluation: eval_corpus processes hundreds of SDS documents and outputs per-rule failure counts, section scores, and causasv_features.csv for root-cause analysis — without any human review.
  • No lock-in: Supports Anthropic Claude, OpenAI GPT, Google Gemini, Mistral, Groq, Cohere, and any OpenAI-compatible local endpoint. Bring your own model.
  • Rust core: Extraction, schema validation, GHS/CAS checks, and DOCX/HTML generation run in native code. Thin Python bindings on top.

MHLW Compliance

sdsconv targets the MHLW SDS data exchange format v1.0 published 2025-03-31.

Rule Behaviour
Schema validation Validates against SDS_Schema_v1.0.json
Empty-field removal Removes "", null, [], {} per §3.3
AdditionalInfo Content outside the official schema is written to AdditionalInfo.FullText
--strict-mhlw Exits 1 (CLI) / raises ValueError (Python) if any HIGH or CRIT finding
CRIT/HIGH/MED findings Structured validation report with rule ID, severity, path, message

Validation rules include: GHS H/P-code validity (GHS Rev.10), CAS format and check-digit, Section 2 GHS completeness (H-codes ↔ pictograms ↔ signal word), Section 3 component row alignment (name/CAS/concentration), UN number completeness, concentration range bounds, duplicate code detection, and more.

Quality baseline (30-file random sample, seed=42):

CRIT=0 · avg score 89.6 · top issues: S2-HAZARD-NO-PICTOGRAM, S15-ZHCN-NO-GB, S14-NO-SHIPPING-NAME

Full rule catalogue → docs/quality-check.md


Corpus Evaluation

Run without human review:

from sdsconv.eval import eval_corpus

df = eval_corpus("data/sds_raw", "runs/eval_001", jobs=8)

Outputs per file:

File Contents
generated/<stem>.json MHLW standard JSON
reports/<stem>.json ConversionReport (language, populated sections, warnings)
findings/<stem>.json Structured validation findings
summary.csv Per-file scores and grades
failures_by_rule.csv Rule frequency and affected file counts

Root-cause analysis with causasv:

from sdsconv.causasv_bridge import print_ranking
print_ranking("runs/eval_001/causasv_features.csv")

CLI

# PDF/DOCX/XLSX/HTML/URL → MHLW JSON
sdsconv to-json --input input.pdf --output output.json --lang ja

# With correction pass and PubChem enrichment
sdsconv to-json --input input.pdf --output output.json --correct --enrich

# JSON → Word document (16 JIS Z 7253 sections)
sdsconv to-docx --input output.json --output result.docx --lang ja

# JSON → HTML (printable, A4, inline CSS)
sdsconv to-html --input output.json --output result.html --lang ja

# Validate with strict MHLW mode
sdsconv validate --input output.json --strict-mhlw

# Batch: process a directory
sdsconv to-json --input-dir data/ --output-dir out/ --jobs 8

# Corpus evaluation
sdsconv eval-corpus --input-dir data/sds_raw --output-dir runs/eval_001 --jobs 8

Full CLI reference → sdsconv/README.md


REST API

# Start server (binds to 127.0.0.1:3000 by default)
SDS_SERVER_TOKEN=secret sdsconv-server

# Convert a PDF
curl -X POST http://localhost:3000/api/to-json \
  -H "Authorization: Bearer secret" \
  -F "file=@input.pdf"

Endpoints: POST /api/to-json · POST /api/to-docx · POST /api/to-html · POST /api/validate · GET /api/health


GUI

Run sdsconv without arguments to open the graphical interface:

sdsconv

Five tabs: Convert · Generate · Validate · Extract Text · Settings

Download the desktop app: macOS · Windows · brew install --cask sdsconv


Supported Inputs, Languages, and Backends

Input formats: PDF (text, CID/Shift-JIS, scanned) · DOCX · XLSX · TXT · HTML · URL

Source languages: ja (JIS Z 7253) · en (GHS/OSHA HazCom) · zh-cn (GB/T 16483) · zh-tw (CNS 15030)

LLM backends: Anthropic Claude · OpenAI GPT · Google Gemini · Mistral · Groq · Cohere · Local (any OpenAI-compatible endpoint)


For Developers

Rust library:

[dependencies]
sdsconv-core = "0.3"

See sdsconv_core/README.md for the Rust API.

Crates: sdsconv · sdsconv-core

Python package: sdsconv on PyPI — pip install sdsconv


Security & Privacy

  • Cloud LLM caution: When using a cloud LLM backend, SDS document text is sent to the API provider. Avoid sending confidential or trade-secret SDS documents to cloud APIs.
  • Local operation: Use --backend local with any OpenAI-compatible endpoint (e.g. Ollama, LM Studio) for fully offline operation. No data leaves your machine.
  • Raw SDS corpus: Add corpus/raw/ and data/sds_raw/ to .gitignore. Only corpus/manifest.jsonl (URLs + sha256 hashes) is safe to commit.
  • REST server: Bearer token authentication with timing-safe comparison, SSRF protection (full IPv6 coverage), redirect-disabled HTTP client, 50 MB upload cap.

Comparison

docs/comparison.md


References


License

MIT OR Apache-2.0 — at your option.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdsconv-0.1.3.tar.gz (177.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sdsconv-0.1.3-cp39-abi3-win_amd64.whl (5.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

sdsconv-0.1.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

sdsconv-0.1.3-cp39-abi3-macosx_11_0_arm64.whl (5.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file sdsconv-0.1.3.tar.gz.

File metadata

  • Download URL: sdsconv-0.1.3.tar.gz
  • Upload date:
  • Size: 177.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sdsconv-0.1.3.tar.gz
Algorithm Hash digest
SHA256 ccde5e770e423efaf7a701cc58d8b3187aed49074bd88c739adbcedb20310932
MD5 dbdcca7de22646818e8905b15f8583a4
BLAKE2b-256 9a4319b60454442fce41f5a9ad72472171fa027243b3662faede909a344f67b9

See more details on using hashes here.

File details

Details for the file sdsconv-0.1.3-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: sdsconv-0.1.3-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 5.4 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sdsconv-0.1.3-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 2a37c0889176fe6f524e9cf3c6627eedf9d8835da351211537e9ece8228934a8
MD5 ca56a0029f5552f0e7702e90cacb71fa
BLAKE2b-256 e074c27acebdd9afb138899828025105c8915a02fa67da7fb395c52e847887ee

See more details on using hashes here.

File details

Details for the file sdsconv-0.1.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sdsconv-0.1.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 af0fb448f081911af3dda9de3c4755bb4e8d0b757fa7f7b28d8efc919bae93e1
MD5 fde74c5e5de8a9ce7ba0ebcff862ff4a
BLAKE2b-256 2958e0295730ed21b4e023004813429be2b3fdad394c42d0714c1174e84bf304

See more details on using hashes here.

File details

Details for the file sdsconv-0.1.3-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sdsconv-0.1.3-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0a0799a2571c8b2bddc54a9a657aa9dd7f77dc5fb353961bf78203b1b0e7e424
MD5 bc27c02b73eca87ce2b1eb3db7d24bf3
BLAKE2b-256 0010b055df5ab9c5ed2191ccd7f7fb831defa0fc56c6f3411f2a1a734cda46e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page