Accurately reading, writing, and validating Dataset-JSON (.json, .ndjson, and .dsjc) with Apache Arrow.

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

k-nkmt

These details have not been verified by PyPI

Project description

dsjframe

dsjframe is a Python library for accurately reading, writing, and validating Dataset-JSON with Apache Arrow. It supports plain JSON, NDJSON, and DSJC (.json, .ndjson, and .dsjc).

Installation

Install the base package:

pip install dsjframe

Install the optional dependencies to support additional file formats:

pip install dsjframe[file-support]

Python 3.10 or newer is required.

Quick Start

import dsjframe
# Read
table = dsjframe.read_dataset("adsl.json")

# Write
metadata = {
    "datasetJSONVersion": "1.1.0",
    "label": "Subject Level Analysis Dataset",
}

dsjframe.write_dataset(table, "adsl.ndjson", metadata)

# validate
report = dsjframe.validate_dataset("adsl.dsjc")

See example.ipynb for a longer walkthrough.

Metadata for Writing

Dataset-JSON allows extensions, but dsjframe targets the standard structure by default. Unexpected metadata fields are treated as errors.

When writing, metadata is merged in this priority order:

Explicit metadata
define.xml
Embedded Arrow schema metadata
readstat_meta
Library defaults such as datasetJSONVersion and itemGroupOID

datasetJSONCreationDateTime is filled automatically. In many cases, you can omit most or all of metadata if define.xml, embedded schema metadata, or readstat_meta already provide the required fields. When define.xml is used, metaDataRef is set to define.xml automatically. If you need a different path or reference value, set it explicitly in metadata.

When writing from a pyarrow.Table or pandas DataFrame, column dataType is derived from the actual column type in the frame except for text-backed temporal columns declared as date, time, or datetime. Provided column metadata is still used for fields such as label, length, displayFormat, and keySequence, but it does not override the real data type. Compatible targetDataType values are preserved where allowed, decimal exports are normalized to "targetDataType": "decimal", and targetDataType: "integer" is ignored for text-backed temporal columns so partial dates and other ISO 8601 text values remain text-backed.

Missing Values

dsjframe follows Arrow conventions and represents missing values as null. In practice, especially when Dataset-JSON is used as an XPT replacement, character missing values are often written as "" instead of null. Because Arrow is better aligned with nulls for missing data, empty strings in string-like columns are converted to null by default when reading. If you need to preserve the distinction between null and an empty string, you can disable that conversion with an option.

More Than Metadata Validation

Because Dataset-JSON is a text format, its readability and editability are often treated as advantages. Those same properties can also make type drift, malformed values, or accidental edits harder to catch.

To address that, dsjframe includes strict validation for both metadata and row data.

In addition to JSON Schema validation for metadata, dsjframe also checks combinations of dataType and targetDataType, row structure, value conversion, record counts, and consistency between file content and file extension. String-backed temporal values are also checked as ISO 8601 text. Reduced-precision partial dates such as YYYY and YYYY-MM are accepted for temporal text values.

DSJC Support

DSJC is treated as gzip-compressed NDJSON.

The current implementation follows the available examples rather than the still-evolving compressed Dataset-JSON v1.1 wording. The specification may change, so DSJC behavior may need to change with it.

API Reference

`read_dataset(source, *, as_pandas=False, out_metadata=False, empty_to_null=True)`

Read Dataset-JSON, NDJSON, or DSJC from a path, bytes object, or file-like object. It returns a pyarrow.Table by default, or a pandas DataFrame when as_pandas=True. Set out_metadata=True to also receive pyreadstat-compatible metadata. By default, empty strings in string-like columns are converted to null on read; pass empty_to_null=False to keep them as empty strings.

`write_dataset(frame, destination, metadata=None, *, output_format=None, define_xml=None, readstat_meta=None, compression_level=6, json_indent=2)`

Write a pyarrow.Table or pandas DataFrame as JSON, NDJSON, or DSJC. The output format is inferred from the file suffix unless you pass output_format explicitly. Metadata can come from metadata, define.xml, Arrow schema metadata, or readstat_meta.

`detect_format(source)`

Inspect a source and return a lightweight report describing the detected format. This is useful when you want to check the input before reading or validating it.

`validate_dataset(source, *, validate_metadata=True, validate_data=True)`

Validate a dataset and return a diagnostic report. You can validate only metadata, or validate both metadata and row data.

`build_metadata(*, frame=None, metadata=None, define_xml=None, readstat_meta=None)`

Merge metadata from the available sources and return a validated Dataset-JSON metadata dictionary. Use this when you want to inspect or prepare export metadata before writing rows.

`build_schema(source_or_dataset)`

Build a pyarrow.Schema from a Dataset-JSON source or from a Dataset-JSON payload dictionary. Use this when you need the inferred Arrow schema without loading the dataset rows.

Common Errors

Case	Example message
Unsupported input source	`unsupported input source`
Empty file or payload	`empty input payload`
Format detection failed	`could not determine dataset format`
Invalid DSJC payload	`invalid DSJC payload`
Missing required metadata	`missing required field: label`
JSON Schema validation failed	`validation failed at records: ...`
Decimal column missing required `targetDataType`	`targetDataType is required for dataType decimal`
Invalid `targetDataType` and `dataType` combination	`targetDataType is not allowed for dataType float`
Invalid ISO 8601 temporal text value during read or validation	`failed to convert value for column AESTDTC`
Row shape does not match columns	`row does not match columns schema`
`records` does not match the actual row count	`records does not match actual row count`
Value conversion failed	`failed to convert value for column TRTSDT`
Unsupported Dataset-JSON `dataType`	`unsupported column type: binary`
pandas output requested without pandas installed	`pandas support is not installed`

Common Export Errors

Case	Example message
`metadata` is not a dictionary	`metadata must be a dictionary`
Required export metadata is missing	`missing required export metadata: label`
Unexpected metadata key	`unexpected metadata fields: unexpected`
Invalid `datasetJSONCreationDateTime`	`datasetJSONCreationDateTime has invalid format`
Invalid `datasetJSONVersion`	`datasetJSONVersion must be 1.1.x`
Incomplete `sourceSystem` object	`sourceSystem requires name and version`
Unexpected column metadata key	`unexpected column metadata fields: badField`
Unsupported `targetDataType`	`unsupported targetDataType: float`
Invalid ISO 8601 temporal text value during export	`failed to convert value for column AESTDTC`
Decimal value exceeds configured `length`	`decimal value exceeds configured length`
Unsupported Arrow or pandas type	`unsupported Arrow type`
Invalid Define-XML syntax	`failed to parse define.xml`
Missing `itemGroupOID` in Define-XML	`itemGroupOID not found in define.xml`

Development

See AGENTS.md and CHANGELOG.md.

License

This project is licensed under the AGPL.

Code from this repository that is provided to an AI system, and code produced from that input, is treated as a derivative work. Redistributing an AI-based reimplementation without preserving this license is considered copyright infringement.

This project is developed and maintained independently.
To help keep it maintained, consider supporting it through sponsorship or by engaging me for contract work.
Contact: info@knworx.com.

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

k-nkmt

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.9.1

Apr 10, 2026

0.9.0

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsjframe-0.9.1.tar.gz (289.6 kB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dsjframe-0.9.1-py3-none-any.whl (42.3 kB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file dsjframe-0.9.1.tar.gz.

File metadata

Download URL: dsjframe-0.9.1.tar.gz
Upload date: Apr 10, 2026
Size: 289.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dsjframe-0.9.1.tar.gz
Algorithm	Hash digest
SHA256	`50e9d984ee5de9259d9f010dbfa4611fd435ff35bc8497e13809cffcefc53025`
MD5	`672620f35a1a625e704584fe87a5e796`
BLAKE2b-256	`5e314d587c03b9d73ce8419cf2e2fcfd1cb393de1a863910ead5ce0aa677dcc9`

See more details on using hashes here.

File details

Details for the file dsjframe-0.9.1-py3-none-any.whl.

File metadata

Download URL: dsjframe-0.9.1-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 42.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dsjframe-0.9.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`257a650c19417f072a8e97161b6b86c2c1902e28a593d98eef41f68c01c53ed0`
MD5	`b90cc06711b03a4fb33963f090af141e`
BLAKE2b-256	`1c20ee3ea6859d5e01e26bdd3f81b6e6f1514fa8dd7e6da1b9c8bd04d5c5985d`

See more details on using hashes here.

dsjframe 0.9.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

dsjframe

Installation

Quick Start

Metadata for Writing

Missing Values

More Than Metadata Validation

DSJC Support

API Reference

read_dataset(source, *, as_pandas=False, out_metadata=False, empty_to_null=True)

write_dataset(frame, destination, metadata=None, *, output_format=None, define_xml=None, readstat_meta=None, compression_level=6, json_indent=2)

detect_format(source)

validate_dataset(source, *, validate_metadata=True, validate_data=True)

build_metadata(*, frame=None, metadata=None, define_xml=None, readstat_meta=None)

build_schema(source_or_dataset)

Common Errors

Common Export Errors

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`read_dataset(source, *, as_pandas=False, out_metadata=False, empty_to_null=True)`

`write_dataset(frame, destination, metadata=None, *, output_format=None, define_xml=None, readstat_meta=None, compression_level=6, json_indent=2)`

`detect_format(source)`

`validate_dataset(source, *, validate_metadata=True, validate_data=True)`

`build_metadata(*, frame=None, metadata=None, define_xml=None, readstat_meta=None)`

`build_schema(source_or_dataset)`