Skip to main content

Convert SAS datasets to Dataset-JSON v1.1 format

Project description

dsjconvert

dsjconvert is a Python package and CLI tool for bidirectional conversion between SAS V5 XPORT (XPT) and Dataset-JSON v1.1 format. It supports both JSON and NDJSON (newline-delimited JSON) formats, with NDJSON as the default for optimal streaming performance.

Features

  • Bidirectional Conversion:
    • Convert XPT and SAS7BDAT files to Dataset-JSON (forward)
    • Convert Dataset-JSON back to XPT format (reverse)
  • Multiple Input Formats: XPT, SAS7BDAT, JSON, and NDJSON
  • Dual JSON Formats: JSON and NDJSON (default)
  • Flexible Metadata: Use Define-XML metadata or auto-infer from source data
  • Schema Validation: Built-in validation against Dataset-JSON schemas
  • Roundtrip Support: Full XPT → JSON → XPT conversion cycle
  • Comprehensive Logging: Configurable logging levels for debugging
  • Python Package: Use as a library in your Python code
  • CLI Tool: Command-line interface for batch conversions
  • Object-Oriented Design: Clean, maintainable codebase with single responsibility

Installation

From PyPI (when published)

pip install dsjconvert

From Source

git clone https://github.com/swhume/dataset-json.git
cd dataset-json
pip install -e .

Dependencies

  • Python 3.7+
  • pandas
  • pyreadstat
  • linkml
  • jsonschema

Quick Start

As a CLI Tool

Forward Conversion (SAS to Dataset-JSON)

Convert XPT files using defaults (NDJSON format):

dsjconvert -v -x

Convert SAS7BDAT files to JSON format:

dsjconvert -v -b --format json

Convert without Define-XML (auto-infer metadata):

dsjconvert -v -x --no-define

Reverse Conversion (Dataset-JSON to XPT)

Convert NDJSON files to XPT:

dsjconvert -v --to-xpt --input-format ndjson

Convert JSON files to XPT:

dsjconvert -v --to-xpt --input-format json

Roundtrip Conversion

# Step 1: XPT to NDJSON
dsjconvert -v -x --format ndjson -s data/xpt -p data/json

# Step 2: NDJSON back to XPT
dsjconvert -v --to-xpt --input-format ndjson -s data/json -p data/xpt_roundtrip

As a Python Library

Forward Conversion (SAS to Dataset-JSON)

from dsjconvert import XPTConverter, MetadataExtractor

# With Define-XML metadata
extractor = MetadataExtractor('path/to/define.xml')
converter = XPTConverter(
    metadata_extractor=extractor,
    output_format='ndjson',
    skip_validation=True
)
converter.convert_dataset('input.xpt', 'output_dir')

# Without Define-XML (auto-infer metadata)
converter = XPTConverter(output_format='ndjson')
converter.convert_dataset('input.xpt', 'output_dir')

Reverse Conversion (Dataset-JSON to XPT)

from dsjconvert import DatasetJSONToXPTConverter

# Convert NDJSON to XPT
converter = DatasetJSONToXPTConverter(input_format='ndjson')
converter.convert_dataset('input.ndjson', 'output_dir')

# Convert JSON to XPT
converter = DatasetJSONToXPTConverter(input_format='json')
converter.convert_dataset('input.json', 'output_dir')

# Using convenience function
from dsjconvert import convert_json_to_xpt

convert_json_to_xpt('input.ndjson', 'output_dir')

Roundtrip Conversion

from dsjconvert import XPTConverter, DatasetJSONToXPTConverter

# Step 1: XPT → Dataset-JSON
forward = XPTConverter(output_format='ndjson')
json_path = forward.convert_dataset('data/dm.xpt', 'output/json')

# Step 2: Dataset-JSON → XPT
reverse = DatasetJSONToXPTConverter(input_format='ndjson')
xpt_path = reverse.convert_dataset(json_path, 'output/xpt')

CLI Usage

Command-Line Options

Flag Name Description
-h --help Show help message and exit
-p --dsj-path Directory for output files (default: ./data)
-d --define Path to Define-XML file (optional, forward only)
-s --sas-path Directory containing source files (default: ./data)
--to-xpt Reverse conversion: Dataset-JSON to XPT
-x --xpt Process XPT files (forward conversion)
-b --sas Process SAS7BDAT files (forward conversion)
-f, --format Output format for forward conversion: 'json' or 'ndjson' (default: ndjson)
--input-format Input format for reverse conversion: 'json' or 'ndjson' (default: ndjson)
--no-define Skip Define-XML and infer metadata from data
--validate Enable schema validation (default)
--no-validate Disable schema validation
-v --verbose Enable verbose output (DEBUG level)
--log-level Set log level: DEBUG, INFO, WARNING, ERROR

Examples

Forward Conversion (SAS to Dataset-JSON)

Basic conversion with verbose output:

dsjconvert -v

Convert XPT files with Define-XML:

dsjconvert -v -x -d /path/to/define.xml

Convert SAS7BDAT to JSON format:

dsjconvert -v -b --format json

Custom paths:

dsjconvert -v -x \
  -d /path/to/define.xml \
  -s /path/to/sas/files \
  -p /path/to/output

Convert without Define-XML:

dsjconvert -v -x --no-define

Reverse Conversion (Dataset-JSON to XPT)

Convert NDJSON files to XPT:

dsjconvert -v --to-xpt --input-format ndjson

Convert JSON files to XPT with custom paths:

dsjconvert -v --to-xpt \
  --input-format json \
  -s /path/to/json/files \
  -p /path/to/xpt/output

Disable validation during reverse conversion:

dsjconvert -v --to-xpt --input-format ndjson --no-validate

Roundtrip Example

# Convert XPT to NDJSON
dsjconvert -v -x --format ndjson -s data/xpt -p data/json

# Convert NDJSON back to XPT
dsjconvert -v --to-xpt --input-format ndjson -s data/json -p data/xpt_roundtrip

# Compare original and roundtrip files
# Both should contain identical data

Output Formats

JSON Format

Traditional JSON format with all data in a single object:

{
  "datasetJSONCreationDateTime": "2025-01-04T16:23:52",
  "datasetJSONVersion": "1.1.0",
  "name": "DM",
  "label": "Demographics",
  "columns": [{"...": "..."}],
  "rows": [
    ["value1", "value2", "..."],
    ["value1", "value2", "..."]
  ]
}

NDJSON Format (Default)

Newline-delimited JSON optimized for streaming:

{"datasetJSONCreationDateTime":"2025-01-04T16:23:52","datasetJSONVersion":"1.1.0","name":"DM","columns":[...]}
[value1, value2, ...]
[value1, value2, ...]

Line 1 contains metadata, subsequent lines contain one row each as a JSON array. This format allows streaming large datasets without loading everything into memory.

Working Without Define-XML

If Define-XML is not available, dsjconvert will automatically infer metadata from the source dataset:

  • Column names: Extracted from the dataset
  • Column labels: From SAS variable labels (if available)
  • Data types: Inferred from actual data values
  • Dataset name: Derived from filename

To explicitly skip Define-XML:

dsjconvert -v -x --no-define

Library Usage

Basic Conversion

from dsjconvert import XPTConverter

# Create converter
converter = XPTConverter(output_format='ndjson')

# Convert a single file
output_path = converter.convert_dataset(
    input_path='data/dm.xpt',
    output_dir='output',
    dataset_name='DM'  # Optional, inferred from filename if omitted
)

With Define-XML Metadata

from dsjconvert import XPTConverter, MetadataExtractor

# Initialize metadata extractor
extractor = MetadataExtractor('data/define.xml')

# Create converter with metadata
converter = XPTConverter(
    metadata_extractor=extractor,
    output_format='ndjson',
    skip_validation=False
)

# Convert
output_path = converter.convert_dataset('data/dm.xpt', 'output')

Convert Multiple Files

import os
from dsjconvert import SAS7BDATConverter

converter = SAS7BDATConverter(output_format='json')

# Get all SAS files
sas_dir = 'data'
sas_files = [f for f in os.listdir(sas_dir) if f.endswith('.sas7bdat')]

# Convert each file
for sas_file in sas_files:
    input_path = os.path.join(sas_dir, sas_file)
    output_path = converter.convert_dataset(input_path, 'output')
    print(f"Converted: {output_path}")

Architecture

The dsjconvert package follows object-oriented design principles:

Core Classes

Forward Conversion (SAS to Dataset-JSON)

  • DatasetConverter: Abstract base class for all converters
  • XPTConverter: Converts SAS V5 XPORT files to Dataset-JSON
  • SAS7BDATConverter: Converts SAS7BDAT files to Dataset-JSON
  • MetadataExtractor: Extracts/infers metadata from Define-XML or data
  • WriterFactory: Creates format-specific writers
  • JSONWriter: Writes traditional JSON format
  • NDJSONWriter: Writes NDJSON format

Reverse Conversion (Dataset-JSON to XPT)

  • DatasetJSONToXPTConverter: Converts Dataset-JSON files to XPT
  • ReaderFactory: Creates format-specific readers
  • JSONReader: Reads traditional JSON format
  • NDJSONReader: Reads NDJSON format
  • XPTWriter: Writes SAS V5 XPORT files using pyreadstat

Common Components

  • DatasetValidator: Validates output against schemas

Data Type Conversion

SAS dates are converted to Dataset-JSON format:

SAS Type Representation Dataset-JSON Type
Date Days since 1960-01-01 double
DateTime Days + fractional day double
Time Fractional day double
Integer Integer value integer
Numeric Float value double
Character String value string

Example:

  • SAS date 0 = 1960-01-01
  • SAS datetime 0.5 = 1960-01-01 12:00:00
  • SAS time 0.5 = 12:00:00

Logging

Control logging verbosity:

# Verbose mode (DEBUG level)
dsjconvert -v -x

# Explicit log level
dsjconvert --log-level INFO -x

In Python:

import logging
logging.basicConfig(level=logging.DEBUG)

Error Handling

The package provides detailed error messages:

  • DatasetReadError: Cannot read source file
  • DefineXMLParseError: Invalid Define-XML
  • SchemaValidationError: Output doesn't match schema
  • DatasetConversionError: General conversion failure

Errors are logged with context for debugging.

Testing

Run tests with existing test datasets:

# Test XPT conversion
dsjconvert -v -x -s tests -p output/test

# Test SAS7BDAT conversion
dsjconvert -v -b -s tests -p output/test

Project Structure

dataset-json/
├── src/
│   └── dsjconvert/
│       ├── __init__.py          # Package initialization
│       ├── __main__.py          # Module entry point
│       ├── cli.py               # Command-line interface
│       ├── converter.py         # Dataset converters
│       ├── metadata.py          # Metadata extraction
│       ├── writers.py           # Output writers
│       ├── validators.py        # Schema validation
│       ├── utils.py             # Utility functions
│       ├── exceptions.py        # Custom exceptions
│       └── schemas/             # JSON schemas
│           ├── dataset.schema.json
│           └── dataset-ndjson-schema.json
├── setup.py                     # Package setup
├── requirements.txt             # Dependencies
├── README.md                    # This file
├── data/                        # Default data directory
│   └── define.xml               # Define-XML metadata
├── tests/                       # Test datasets
    └── unit                     # unit tests
└── docs/                        # Documentation

Reverse Conversion Details

Dataset-JSON to XPT Conversion

The reverse conversion process reads Dataset-JSON files (JSON or NDJSON format) and creates SAS V5 XPORT files:

  1. Read Dataset-JSON: Parses JSON or NDJSON file to extract metadata and row data
  2. Validate (optional): Validates against Dataset-JSON v1.1 schema
  3. Convert to DataFrame: Creates pandas DataFrame from row data with proper column names
  4. Write XPT: Uses pyreadstat to write XPT file with metadata (table name, labels, etc.)

Metadata Preservation

The following metadata is preserved during reverse conversion:

  • Dataset name: Used as XPT table name
  • Dataset label: Used as XPT file label
  • Column names: Preserved exactly as in Dataset-JSON
  • Column labels: Preserved as variable labels in XPT
  • Data values: All data values are preserved with type integrity

Data Type Handling

Dataset-JSON Type XPT Storage
string Character variable
integer Numeric variable
double Numeric variable
float Numeric variable

Note: SAS date/time conversions (if needed) can be handled by the metadata or post-processing.

Roundtrip Fidelity

The package supports full roundtrip conversions (XPT → JSON → XPT) with high fidelity:

  • ✅ Row data is preserved exactly
  • ✅ Column names and order are preserved
  • ✅ Column labels are preserved
  • ✅ Null values are preserved
  • ✅ Numeric precision is preserved (within XPT format limitations)
  • ✅ String data is preserved
  • ⚠️ Some XPT-specific metadata may not roundtrip (e.g., formats, informats)

See the roundtrip tests in tests/unit/test_roundtrip.py for detailed examples.

Limitations

  • No support for ADaM targetDataType integer dates (coming soon)
  • Not optimized for very large datasets, >1GB (coming soon)
  • XPT format-specific metadata (formats, informats) may not be preserved in roundtrip

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

MIT License - see LICENSE.md for details

References

Changelog

Version 0.9.1 (Current)

  • Refactored to object-oriented design
  • Added NDJSON format support (now default)
  • Replaced XSLT with Python code
  • Added comprehensive logging
  • Made Define-XML optional
  • Improved error handling
  • Runs as a Python package or CLI tool
  • Added CLI enhancements
  • Reduced method complexity and nesting
  • Bidirectional conversion - Dataset-JSON to XPT reverse conversion
  • Roundtrip support (XPT → JSON → XPT)
  • Added comprehensive unit tests

Version 0.8.0

  • Initial release
  • Basic XPT/SAS7BDAT to JSON conversion
  • XSLT-based metadata extraction
  • Require Define-XML

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsjconvert-0.9.1.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dsjconvert-0.9.1-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file dsjconvert-0.9.1.tar.gz.

File metadata

  • Download URL: dsjconvert-0.9.1.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dsjconvert-0.9.1.tar.gz
Algorithm Hash digest
SHA256 76526873152f31f72c12c18a608e6e3b0855fcb8802368102ce7e847ab8d5393
MD5 f0e2d3421a72b26963716da84353a827
BLAKE2b-256 4cf7626a900ea9ce40eb0639c28b6613026ced6d0b8d9d984f5a3013af6d318b

See more details on using hashes here.

File details

Details for the file dsjconvert-0.9.1-py3-none-any.whl.

File metadata

  • Download URL: dsjconvert-0.9.1-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dsjconvert-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 aa9e6a7a26d9a42d2ea9c695cea0a9d9a8e2b84775d75398db7eecc34716d614
MD5 e1c8585572c5f29d78ee3c39174d2422
BLAKE2b-256 2dbeb1a97785faa1acb869e77644037d7d258187dfdf6c333ca0a86190d41054

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page