Skip to main content

Accurately reading, writing, and validating Dataset-JSON (.json, .ndjson, and .dsjc) with Apache Arrow.

Project description

dsjframe

dsjframe is a Python library for accurately reading, writing, and validating Dataset-JSON with Apache Arrow. It supports plain JSON, NDJSON, and DSJC (.json, .ndjson, and .dsjc).

Installation

Install the base package:

pip install dsjframe

Install the optional dependencies to support additional file formats:

pip install dsjframe[file-support]

Python 3.10 or newer is required.

Quick Start

import dsjframe
# Read
table = dsjframe.read_dataset("adsl.json")

# Write
metadata = {
    "datasetJSONVersion": "1.1.0",
    "label": "Subject Level Analysis Dataset",
}

dsjframe.write_dataset(table, "adsl.ndjson", metadata)

# validate
report = dsjframe.validate_dataset("adsl.dsjc")

See example.ipynb for a longer walkthrough.

Metadata for Writing

Dataset-JSON allows extensions, but dsjframe targets the standard structure by default. Unexpected metadata fields are treated as errors.

When writing, metadata is merged in this priority order:

  1. Explicit metadata
  2. define.xml
  3. Embedded Arrow schema metadata
  4. readstat_meta
  5. Library defaults such as datasetJSONVersion and itemGroupOID

datasetJSONCreationDateTime is filled automatically. In many cases, you can omit most or all of metadata if define.xml, embedded schema metadata, or readstat_meta already provide the required fields. When define.xml is used, metaDataRef is set to define.xml automatically. If you need a different path or reference value, set it explicitly in metadata.

When writing from a pyarrow.Table or pandas DataFrame, column dataType is derived from the actual column type in the frame. Provided column metadata is still used for fields such as label, length, displayFormat, and keySequence, but it does not override the real data type. Compatible targetDataType values are preserved where allowed, and decimal exports are normalized to "targetDataType": "decimal".

Missing Values

dsjframe follows Arrow conventions and represents missing values as null. In practice, especially when Dataset-JSON is used as an XPT replacement, character missing values are often written as "" instead of null. Because Arrow is better aligned with nulls for missing data, empty strings in string-like columns are converted to null by default when reading. If you need to preserve the distinction between null and an empty string, you can disable that conversion with an option.

More Than Metadata Validation

Because Dataset-JSON is a text format, its readability and editability are often treated as advantages. Those same properties can also make type drift, malformed values, or accidental edits harder to catch.

To address that, dsjframe includes strict validation for both metadata and row data.

In addition to JSON Schema validation for metadata, dsjframe also checks combinations of dataType and targetDataType, row structure, value conversion, record counts, and consistency between file content and file extension.

DSJC Support

DSJC is treated as gzip-compressed NDJSON.

The current implementation follows the available examples rather than the still-evolving compressed Dataset-JSON v1.1 wording. The specification may change, so DSJC behavior may need to change with it.

API Reference

read_dataset(source, *, as_pandas=False, out_metadata=False, empty_to_null=True)

Read Dataset-JSON, NDJSON, or DSJC from a path, bytes object, or file-like object. It returns a pyarrow.Table by default, or a pandas DataFrame when as_pandas=True. Set out_metadata=True to also receive pyreadstat-compatible metadata. By default, empty strings in string-like columns are converted to null on read; pass empty_to_null=False to keep them as empty strings.

write_dataset(frame, destination, metadata=None, *, output_format=None, define_xml=None, readstat_meta=None, compression_level=6, json_indent=2)

Write a pyarrow.Table or pandas DataFrame as JSON, NDJSON, or DSJC. The output format is inferred from the file suffix unless you pass output_format explicitly. Metadata can come from metadata, define.xml, Arrow schema metadata, or readstat_meta.

detect_format(source)

Inspect a source and return a lightweight report describing the detected format. This is useful when you want to check the input before reading or validating it.

validate_dataset(source, *, validate_metadata=True, validate_data=True)

Validate a dataset and return a diagnostic report. You can validate only metadata, or validate both metadata and row data.

build_metadata(*, frame=None, metadata=None, define_xml=None, readstat_meta=None)

Merge metadata from the available sources and return a validated Dataset-JSON metadata dictionary. Use this when you want to inspect or prepare export metadata before writing rows.

build_schema(source_or_dataset)

Build a pyarrow.Schema from a Dataset-JSON source or from a Dataset-JSON payload dictionary. Use this when you need the inferred Arrow schema without loading the dataset rows.

Common Errors

Case Example message
Unsupported input source unsupported input source
Empty file or payload empty input payload
Format detection failed could not determine dataset format
Invalid DSJC payload invalid DSJC payload
Missing required metadata missing required field: label
JSON Schema validation failed validation failed at records: ...
Decimal column missing required targetDataType targetDataType is required for dataType decimal
Invalid targetDataType and dataType combination targetDataType is not allowed for dataType float
Row shape does not match columns row does not match columns schema
records does not match the actual row count records does not match actual row count
Value conversion failed failed to convert value for column TRTSDT
Unsupported Dataset-JSON dataType unsupported column type: binary
pandas output requested without pandas installed pandas support is not installed

Common Export Errors

Case Example message
metadata is not a dictionary metadata must be a dictionary
Required export metadata is missing missing required export metadata: label
Unexpected metadata key unexpected metadata fields: unexpected
Invalid datasetJSONCreationDateTime datasetJSONCreationDateTime has invalid format
Invalid datasetJSONVersion datasetJSONVersion must be 1.1.x
Incomplete sourceSystem object sourceSystem requires name and version
Unexpected column metadata key unexpected column metadata fields: badField
Unsupported targetDataType unsupported targetDataType: float
Decimal value exceeds configured length decimal value exceeds configured length
Unsupported Arrow or pandas type unsupported Arrow type
Invalid Define-XML syntax failed to parse define.xml
Missing itemGroupOID in Define-XML itemGroupOID not found in define.xml

Development

See AGENTS.md.

License

This project is licensed under the AGPL.

Code from this repository that is provided to an AI system, and code produced from that input, is treated as a derivative work. Redistributing an AI-based reimplementation without preserving this license is considered copyright infringement.

This project is developed and maintained independently.
To help keep it maintained, consider supporting it through sponsorship or by engaging me for contract work.
Contact: info@knworx.com.

tests/data/official_example and dsjframe/schema are taken from https://github.com/cdisc-org/DataExchange-DatasetJson (Copyright (c) 2022 cdisc) under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsjframe-0.9.0.tar.gz (287.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dsjframe-0.9.0-py3-none-any.whl (41.4 kB view details)

Uploaded Python 3

File details

Details for the file dsjframe-0.9.0.tar.gz.

File metadata

  • Download URL: dsjframe-0.9.0.tar.gz
  • Upload date:
  • Size: 287.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dsjframe-0.9.0.tar.gz
Algorithm Hash digest
SHA256 3aa72fff63bff45ed31287cf30728158c4cc631bc123af236e768e0cf5efde2c
MD5 1f398d5145f5c18954afd786fe148439
BLAKE2b-256 807766bc6f9b95e470cf927d52de31e6210dad3060d12acf61fe0c8a2e479c5e

See more details on using hashes here.

File details

Details for the file dsjframe-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: dsjframe-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 41.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dsjframe-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 911895ae3aae234201c2db35024a91f21f05b2f6b2f6b92aa750e84bd382087f
MD5 3f30474e9f552def072edeb91199a645
BLAKE2b-256 05b999543cc7a16e8c8d3da4084deb74ef501ed463736ae158b14de16db5a9ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page