Accurately reading, writing, and validating Dataset-JSON (.json, .ndjson, and .dsjc) with Apache Arrow.
Project description
dsjframe
dsjframe is a Python library for accurately reading, writing, and validating Dataset-JSON with Apache Arrow.
It supports plain JSON, NDJSON, and DSJC (.json, .ndjson, and .dsjc).
Installation
Install the base package:
pip install dsjframe
Install the optional dependencies to support additional file formats:
pip install dsjframe[file-support]
Python 3.10 or newer is required.
Quick Start
import dsjframe
# Read
table = dsjframe.read_dataset("adsl.json")
# Write
metadata = {
"datasetJSONVersion": "1.1.0",
"label": "Subject Level Analysis Dataset",
}
dsjframe.write_dataset(table, "adsl.ndjson", metadata)
# validate
report = dsjframe.validate_dataset("adsl.dsjc")
See example.ipynb for a longer walkthrough.
Metadata for Writing
Dataset-JSON allows extensions, but dsjframe targets the standard structure by default.
Unexpected metadata fields are treated as errors.
When writing, metadata is merged in this priority order:
- Explicit
metadata define.xml- Embedded Arrow schema metadata
readstat_meta- Library defaults such as
datasetJSONVersionanditemGroupOID
datasetJSONCreationDateTime is filled automatically.
In many cases, you can omit most or all of metadata if define.xml, embedded schema metadata, or readstat_meta already provide the required fields.
When define.xml is used, metaDataRef is set to define.xml automatically.
If you need a different path or reference value, set it explicitly in metadata.
When writing from a pyarrow.Table or pandas DataFrame, column dataType is derived from the actual column type in the frame except for text-backed temporal columns declared as date, time, or datetime.
Provided column metadata is still used for fields such as label, length, displayFormat, and keySequence, but it does not override the real data type.
Compatible targetDataType values are preserved where allowed, decimal exports are normalized to "targetDataType": "decimal", and targetDataType: "integer" is ignored for text-backed temporal columns so partial dates and other ISO 8601 text values remain text-backed.
Missing Values
dsjframe follows Arrow conventions and represents missing values as null.
In practice, especially when Dataset-JSON is used as an XPT replacement, character missing values are often written as "" instead of null.
Because Arrow is better aligned with nulls for missing data, empty strings in string-like columns are converted to null by default when reading.
If you need to preserve the distinction between null and an empty string, you can disable that conversion with an option.
More Than Metadata Validation
Because Dataset-JSON is a text format, its readability and editability are often treated as advantages. Those same properties can also make type drift, malformed values, or accidental edits harder to catch.
To address that, dsjframe includes strict validation for both metadata and row data.
In addition to JSON Schema validation for metadata, dsjframe also checks combinations of dataType and targetDataType, row structure, value conversion, record counts, and consistency between file content and file extension.
String-backed temporal values are also checked as ISO 8601 text.
Reduced-precision partial dates such as YYYY and YYYY-MM are accepted for temporal text values.
DSJC Support
DSJC is treated as gzip-compressed NDJSON.
The current implementation follows the available examples rather than the still-evolving compressed Dataset-JSON v1.1 wording. The specification may change, so DSJC behavior may need to change with it.
API Reference
read_dataset(source, *, as_pandas=False, out_metadata=False, empty_to_null=True)
Read Dataset-JSON, NDJSON, or DSJC from a path, bytes object, or file-like object.
It returns a pyarrow.Table by default, or a pandas DataFrame when as_pandas=True.
Set out_metadata=True to also receive pyreadstat-compatible metadata.
By default, empty strings in string-like columns are converted to null on read; pass empty_to_null=False to keep them as empty strings.
write_dataset(frame, destination, metadata=None, *, output_format=None, define_xml=None, readstat_meta=None, compression_level=6, json_indent=2)
Write a pyarrow.Table or pandas DataFrame as JSON, NDJSON, or DSJC.
The output format is inferred from the file suffix unless you pass output_format explicitly.
Metadata can come from metadata, define.xml, Arrow schema metadata, or readstat_meta.
detect_format(source)
Inspect a source and return a lightweight report describing the detected format. This is useful when you want to check the input before reading or validating it.
validate_dataset(source, *, validate_metadata=True, validate_data=True)
Validate a dataset and return a diagnostic report. You can validate only metadata, or validate both metadata and row data.
build_metadata(*, frame=None, metadata=None, define_xml=None, readstat_meta=None)
Merge metadata from the available sources and return a validated Dataset-JSON metadata dictionary. Use this when you want to inspect or prepare export metadata before writing rows.
build_schema(source_or_dataset)
Build a pyarrow.Schema from a Dataset-JSON source or from a Dataset-JSON payload dictionary.
Use this when you need the inferred Arrow schema without loading the dataset rows.
Common Errors
| Case | Example message |
|---|---|
| Unsupported input source | unsupported input source |
| Empty file or payload | empty input payload |
| Format detection failed | could not determine dataset format |
| Invalid DSJC payload | invalid DSJC payload |
| Missing required metadata | missing required field: label |
| JSON Schema validation failed | validation failed at records: ... |
Decimal column missing required targetDataType |
targetDataType is required for dataType decimal |
Invalid targetDataType and dataType combination |
targetDataType is not allowed for dataType float |
| Invalid ISO 8601 temporal text value during read or validation | failed to convert value for column AESTDTC |
| Row shape does not match columns | row does not match columns schema |
records does not match the actual row count |
records does not match actual row count |
| Value conversion failed | failed to convert value for column TRTSDT |
Unsupported Dataset-JSON dataType |
unsupported column type: binary |
| pandas output requested without pandas installed | pandas support is not installed |
Common Export Errors
| Case | Example message |
|---|---|
metadata is not a dictionary |
metadata must be a dictionary |
| Required export metadata is missing | missing required export metadata: label |
| Unexpected metadata key | unexpected metadata fields: unexpected |
Invalid datasetJSONCreationDateTime |
datasetJSONCreationDateTime has invalid format |
Invalid datasetJSONVersion |
datasetJSONVersion must be 1.1.x |
Incomplete sourceSystem object |
sourceSystem requires name and version |
| Unexpected column metadata key | unexpected column metadata fields: badField |
Unsupported targetDataType |
unsupported targetDataType: float |
| Invalid ISO 8601 temporal text value during export | failed to convert value for column AESTDTC |
Decimal value exceeds configured length |
decimal value exceeds configured length |
| Unsupported Arrow or pandas type | unsupported Arrow type |
| Invalid Define-XML syntax | failed to parse define.xml |
Missing itemGroupOID in Define-XML |
itemGroupOID not found in define.xml |
Development
See AGENTS.md and CHANGELOG.md.
License
This project is licensed under the AGPL.
Code from this repository that is provided to an AI system, and code produced from that input, is treated as a derivative work. Redistributing an AI-based reimplementation without preserving this license is considered copyright infringement.
This project is developed and maintained independently.
To help keep it maintained, consider supporting it through sponsorship or by engaging me for contract work.
Contact: info@knworx.com.
tests/data/official_example and dsjframe/schema are taken from https://github.com/cdisc-org/DataExchange-DatasetJson (Copyright (c) 2022 cdisc) under the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dsjframe-0.9.1.tar.gz.
File metadata
- Download URL: dsjframe-0.9.1.tar.gz
- Upload date:
- Size: 289.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50e9d984ee5de9259d9f010dbfa4611fd435ff35bc8497e13809cffcefc53025
|
|
| MD5 |
672620f35a1a625e704584fe87a5e796
|
|
| BLAKE2b-256 |
5e314d587c03b9d73ce8419cf2e2fcfd1cb393de1a863910ead5ce0aa677dcc9
|
File details
Details for the file dsjframe-0.9.1-py3-none-any.whl.
File metadata
- Download URL: dsjframe-0.9.1-py3-none-any.whl
- Upload date:
- Size: 42.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
257a650c19417f072a8e97161b6b86c2c1902e28a593d98eef41f68c01c53ed0
|
|
| MD5 |
b90cc06711b03a4fb33963f090af141e
|
|
| BLAKE2b-256 |
1c20ee3ea6859d5e01e26bdd3f81b6e6f1514fa8dd7e6da1b9c8bd04d5c5985d
|