Python bindings for sdsconv — SDS ↔ MHLW standard JSON converter
Project description
sdsconv
Python-first, Rust-powered toolkit for converting Safety Data Sheets to Japan MHLW standard JSON — with schema validation, GHS/CAS checks, and corpus-scale quality evaluation.
Install
pip install sdsconv # Python bindings
pip install "sdsconv[analysis]" # + causasv quality analysis
cargo install sdsconv # CLI / GUI binary
Quick Start — Python
import sdsconv
# Extract raw text (no LLM)
text = sdsconv.extract_text("sample.pdf")
# Convert from URL
data, report = sdsconv.to_json_url_with_report(
"https://example.com/sds.pdf", lang="ja",
)
# Convert SDS document → MHLW standard JSON
data, report = sdsconv.to_json_with_report(
"sample.pdf",
lang="ja",
strict_mhlw=True,
)
# Validate and get structured findings
findings = sdsconv.validate(data, strict_mhlw=True)
print(f"Sections populated: {len(report['populated_sections'])}")
print(f"Findings: {len(findings)} ({sum(1 for f in findings if f['level']=='HIGH')} HIGH)")
# Save MHLW JSON
sdsconv.write_json(data, "output.json")
Corpus-scale evaluation (no manual review needed):
from sdsconv.eval import eval_corpus
df = eval_corpus(
input_dir="data/sds_raw",
output_dir="runs/eval_001",
jobs=8,
)
print(df[["filename", "overall_score", "grade", "high_count"]].head(20))
Examples
MHLW official sample SDS — allyl chloride (塩化アリル):
export ANTHROPIC_API_KEY=sk-ant-...
python examples/mhlw_allyl_chloride/convert.py
See examples/mhlw_allyl_chloride/ for
expected.json, expected_report.json, and source attribution.
Why sdsconv
- MHLW-native: Converts directly to the Japanese Ministry of Health, Labour and Welfare SDS data exchange format v1.0 (
SDS_Schema_v1.0.json), validated against the official schema. - Evidence-based extraction: Uses LLM to map free-form SDS text to ~200 nested schema fields. Source-text cross-checks detect hallucinations at the field level.
- Corpus-scale quality evaluation:
eval_corpusprocesses hundreds of SDS documents and outputs per-rule failure counts, section scores, andcausasv_features.csvfor root-cause analysis — without any human review. - No lock-in: Supports Anthropic Claude, OpenAI GPT, Google Gemini, Mistral, Groq, Cohere, and any OpenAI-compatible local endpoint. Bring your own model.
- Rust core: Extraction, schema validation, GHS/CAS checks, and DOCX/HTML generation run in native code. Thin Python bindings on top.
MHLW Compliance
sdsconv targets the MHLW SDS data exchange format v1.0 published 2025-03-31.
| Rule | Behaviour |
|---|---|
| Schema validation | Validates against SDS_Schema_v1.0.json |
| Empty-field removal | Removes "", null, [], {} per §3.3 |
| AdditionalInfo | Content outside the official schema is written to AdditionalInfo.FullText |
--strict-mhlw |
Exits 1 (CLI) / raises ValueError (Python) if any HIGH or CRIT finding |
| CRIT/HIGH/MED findings | Structured validation report with rule ID, severity, path, message |
Validation rules include: GHS H/P-code validity (GHS Rev.10), CAS format and check-digit, Section 2 GHS completeness (H-codes ↔ pictograms ↔ signal word), Section 3 component row alignment (name/CAS/concentration), UN number completeness, concentration range bounds, duplicate code detection, and more.
Quality baseline (30-file random sample, seed=42):
CRIT=0 · avg score 89.6 · top issues:
S2-HAZARD-NO-PICTOGRAM,S15-ZHCN-NO-GB,S14-NO-SHIPPING-NAME
Full rule catalogue → docs/quality-check.md
Corpus Evaluation
Run without human review:
from sdsconv.eval import eval_corpus
df = eval_corpus("data/sds_raw", "runs/eval_001", jobs=8)
Outputs per file:
| File | Contents |
|---|---|
generated/<stem>.json |
MHLW standard JSON |
reports/<stem>.json |
ConversionReport (language, populated sections, warnings) |
findings/<stem>.json |
Structured validation findings |
summary.csv |
Per-file scores and grades |
failures_by_rule.csv |
Rule frequency and affected file counts |
Root-cause analysis with causasv:
from sdsconv.causasv_bridge import print_ranking
print_ranking("runs/eval_001/causasv_features.csv")
CLI
# PDF/DOCX/XLSX/HTML/URL → MHLW JSON
sdsconv to-json --input input.pdf --output output.json --lang ja
# With correction pass and PubChem enrichment
sdsconv to-json --input input.pdf --output output.json --correct --enrich
# JSON → Word document (16 JIS Z 7253 sections)
sdsconv to-docx --input output.json --output result.docx --lang ja
# JSON → HTML (printable, A4, inline CSS)
sdsconv to-html --input output.json --output result.html --lang ja
# Validate with strict MHLW mode
sdsconv validate --input output.json --strict-mhlw
# Batch: process a directory
sdsconv to-json --input-dir data/ --output-dir out/ --jobs 8
# Corpus evaluation
sdsconv eval-corpus --input-dir data/sds_raw --output-dir runs/eval_001 --jobs 8
Full CLI reference → sdsconv/README.md
REST API
# Start server (binds to 127.0.0.1:3000 by default)
SDS_SERVER_TOKEN=secret sdsconv-server
# Convert a PDF
curl -X POST http://localhost:3000/api/to-json \
-H "Authorization: Bearer secret" \
-F "file=@input.pdf"
Endpoints: POST /api/to-json · POST /api/to-docx · POST /api/to-html · POST /api/validate · GET /api/health
GUI
Run sdsconv without arguments to open the graphical interface:
sdsconv
Five tabs: Convert · Generate · Validate · Extract Text · Settings
Download the desktop app: macOS · Windows · brew install --cask sdsconv
Supported Inputs, Languages, and Backends
Input formats: PDF (text, CID/Shift-JIS, scanned) · DOCX · XLSX · TXT · HTML · URL
Source languages: ja (JIS Z 7253) · en (GHS/OSHA HazCom) · zh-cn (GB/T 16483) · zh-tw (CNS 15030)
LLM backends: Anthropic Claude · OpenAI GPT · Google Gemini · Mistral · Groq · Cohere · Local (any OpenAI-compatible endpoint)
For Developers
Rust library:
[dependencies]
sdsconv-core = "0.3"
See sdsconv_core/README.md for the Rust API.
Crates: sdsconv · sdsconv-core
Python package: sdsconv on PyPI — pip install sdsconv
Security & Privacy
- Cloud LLM caution: When using a cloud LLM backend, SDS document text is sent to the API provider. Avoid sending confidential or trade-secret SDS documents to cloud APIs.
- Local operation: Use
--backend localwith any OpenAI-compatible endpoint (e.g. Ollama, LM Studio) for fully offline operation. No data leaves your machine. - Raw SDS corpus: Add
corpus/raw/anddata/sds_raw/to.gitignore. Onlycorpus/manifest.jsonl(URLs + sha256 hashes) is safe to commit. - REST server: Bearer token authentication with timing-safe comparison, SSRF protection (full IPv6 coverage), redirect-disabled HTTP client, 50 MB upload cap.
Comparison
References
- MHLW — SDS Standard Data Exchange Format (official) (Japanese)
- SDS Data Exchange Format Developer Manual (PDF) (Japanese)
- Quality Check Rule Catalogue — 53 rules by section (日本語 / 中文)
- CHANGELOG
License
MIT OR Apache-2.0 — at your option.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sdsconv-0.1.3.tar.gz.
File metadata
- Download URL: sdsconv-0.1.3.tar.gz
- Upload date:
- Size: 177.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccde5e770e423efaf7a701cc58d8b3187aed49074bd88c739adbcedb20310932
|
|
| MD5 |
dbdcca7de22646818e8905b15f8583a4
|
|
| BLAKE2b-256 |
9a4319b60454442fce41f5a9ad72472171fa027243b3662faede909a344f67b9
|
File details
Details for the file sdsconv-0.1.3-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: sdsconv-0.1.3-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 5.4 MB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a37c0889176fe6f524e9cf3c6627eedf9d8835da351211537e9ece8228934a8
|
|
| MD5 |
ca56a0029f5552f0e7702e90cacb71fa
|
|
| BLAKE2b-256 |
e074c27acebdd9afb138899828025105c8915a02fa67da7fb395c52e847887ee
|
File details
Details for the file sdsconv-0.1.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: sdsconv-0.1.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 5.6 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af0fb448f081911af3dda9de3c4755bb4e8d0b757fa7f7b28d8efc919bae93e1
|
|
| MD5 |
fde74c5e5de8a9ce7ba0ebcff862ff4a
|
|
| BLAKE2b-256 |
2958e0295730ed21b4e023004813429be2b3fdad394c42d0714c1174e84bf304
|
File details
Details for the file sdsconv-0.1.3-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: sdsconv-0.1.3-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 5.0 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a0799a2571c8b2bddc54a9a657aa9dd7f77dc5fb353961bf78203b1b0e7e424
|
|
| MD5 |
bc27c02b73eca87ce2b1eb3db7d24bf3
|
|
| BLAKE2b-256 |
0010b055df5ab9c5ed2191ccd7f7fb831defa0fc56c6f3411f2a1a734cda46e7
|