A schema conversion toolkit for JSON, Spark, PyIceberg and SQL formats.

These details have not been verified by PyPI

Project links

Project description

SchemaWorks

SchemaWorks is a Python library for converting between different schema definitions, such as JSON Schema, Spark DataTypes, SQL type strings, and more. It aims to simplify working with structured data across multiple data engineering and analytics platforms.

📣 New in 1.2.0

Added support to create Iceberg schemas to be used with PyIceberg.

🔧 Features

Convert JSON Schema to:
- Apache Spark StructType
- SQL column type strings
- Python dtypes dictionaries
- Iceberg types (using PyIceberg)
Convert Spark schemas and dtypes to JSON Schema
Generate JSON Schemas from example data
Flatten nested schemas for easier inspection or mapping
Utilities for handling Decimal encoding and schema inference

🚀 Use Cases

Building pipelines that consume or produce data in multiple formats
Ensuring schema consistency across Spark, SQL, and data validation layers
Automating schema generation from sample data for prototyping
Simplifying developer tooling with schema introspection

🔍 Validation Support

SchemaWorks includes custom schema validation support through extended JSON Schema validators. It supports standard types like string, integer, array, and also recognises additional types common in data engineering workflows:

Extended support for:
- float, bool, long
- date, datetime, time
- map

Validation is performed using an enhanced version of jsonschema.Draft202012Validator that integrates these type checks.

🚚 Installation

You can install SchemaWorks using pip or poetry, depending on your preference.

Using pip

Make sure you’re using Python 3.10 or later.

pip install schemaworks

This will install the package along with its core dependencies.

Using Poetry

If you use Poetry for dependency management:

poetry add schemaworks

To install development dependencies as well (for testing and linting):

poetry install --with dev

Cloning the Repository (For Development)

If you want to clone and develop the package locally:

git clone https://github.com/anatol-ju/schemaworks.git
cd schemaworks
poetry install --with dev
pre-commit install  # optional: enable linting and formatting checks

To run the test suite:

poetry run pytest

🧱 Quick Example

from schemaworks.converter import JsonSchemaConverter

# Load a JSON schema
schema = {
    "type": "object",
    "properties": {
        "user_id": {"type": "integer"},
        "purchase": {
            "type": "object",
            "properties": {
                "item": {"type": "string"},
                "price": {"type": "number"}
            }
        }
    }
}

converter = JsonSchemaConverter(schema=schema)

# Convert to Spark schema
spark_schema = converter.to_spark_schema()
print(spark_schema)

# Convert to SQL string
sql_schema = converter.to_sql_string()
print(sql_schema)

📖 Documentation

JSON ↔ Spark conversions Map JSON schema types to Spark StructTypes and back.
Schema flattening Flatten nested schemas into dot notation for easier access and mapping.
Data-driven schema inference Automatically generate JSON schemas from raw data samples.
Decimal compatibility Custom JSON encoder to handle decimal.Decimal values safely.
Schema validation Validate schemas and make data conform if needed.

🧪 Testing

Run unit tests using pytest:

poetry run pytest

⭐ Examples

✅ Convert JSON schema to Spark StructType

When working with data pipelines, it’s common to receive schemas in JSON format — whether from APIs, data contracts, or auto-generated metadata. But tools like Apache Spark and PySpark require their own schema definitions in the form of StructType. Manually translating between these formats is error-prone, time-consuming, and doesn’t scale. This function bridges that gap by automatically converting standard JSON Schemas into Spark-compatible schemas, saving hours of manual effort and reducing the risk of type mismatches in production pipelines.

from schemaworks import JsonSchemaConverter

json_schema = {
    "type": "object",
    "properties": {
        "id": {"type": "integer"},
        "name": {"type": "string"},
        "price": {"type": "number"}
    }
}

converter = JsonSchemaConverter(schema=json_schema)
spark_schema = converter.to_spark_schema()
print(spark_schema)

✅ Infer schema from example JSON data

When working with dynamic or loosely structured data sources, manually writing a schema can be tedious and error-prone—especially when dealing with deeply nested or inconsistent inputs. This function allows you to infer a valid JSON Schema directly from real example data, making it much faster to prototype, validate, or document your datasets. It’s particularly useful when onboarding new datasets or integrating third-party APIs, where a formal schema may be missing or outdated.

import json
from pprint import pprint
from schemaworks.utils import generate_schema

example_data = {}
with open("example_data.json", "r") as f:
    example_data = f.read()

example_data = json.loads(example_data)

schema = generate_schema(example_data, add_required=True)
pprint(schema)

✅ Flatten a nested schema

Flattening a nested JSON schema makes it easier to map fields to flat tabular structures, such as SQL tables or Spark DataFrames. It simplifies downstream processing, column selection, and validation—especially when working with deeply nested APIs or hierarchical datasets.

converter.json_schema = {
    "type": "object",
    "properties": {
        "user_id": {"type": "integer"},
        "contact": {
            "type": "object",
            "properties": {
                "email": {"type": "string"},
                "phone": {"type": "string"}
            },
        },
        "active": {"type": "boolean"},
    },
    "required": ["user_id", "email"],
}
flattened = converter.to_flat()
pprint(flattened)

✅ Convert inferred schema to SQL column types

After inferring or converting a schema, it's often necessary to express it in SQL-friendly syntax—for example, when creating tables or validating incoming data. This method translates a JSON schema into a SQL column type definition string, which is especially helpful for building integration scripts, automating ETL jobs, or generating documentation.

pprint(converter.to_sql_string())

✅ Convert to Apache Iceberg Schema

You can now (as of version 1.2.0) convert a JSON Schema directly into an Iceberg-compatible schema using PyIceberg:

from schemaworks.converter import JsonSchemaConverter

json_schema = {
    "type": "object",
    "properties": {
        "uid": {"type": "string"},
        "details": {
            "type": "object",
            "properties": {
                "score": {"type": "number"},
                "active": {"type": "boolean"}
            },
            "required": ["score"]
        }
    },
    "required": ["uid"]
}

converter = JsonSchemaConverter(json_schema)
iceberg_schema = converter.to_iceberg_schema()

✅ Handle decimals in JSON safely

Custom encoder to convert Decimal objects to int or float for JSON serialization.

This avoids serialization errors caused by unsupported Decimal types. It does not preserve full precision—conversion uses built-in float or int types.

from schemaworks.utils import DecimalEncoder
from decimal import Decimal
import json

data = {"price": Decimal("19.99")}
print(json.dumps(data, cls=DecimalEncoder))  # Output: {"price": 19.99}

✅ Validate data

from schemaworks.validators import PythonTypeValidator

schema = {
    "type": "object",
    "properties": {
        "created_at": {"type": "datetime"},
        "price": {"type": "float"},
        "active": {"type": "bool"}
    }
}

data = {
    "created_at": "2023-01-01T00:00:00",
    "price": 10.5,
    "active": True
}

validator = PythonTypeValidator()
validator.validate(data, schema)

✅ Make data conform to schema

You can also use .conform() to enforce schema types and fill in missing values with sensible defaults:

conformed_data = validator.conform(data, schema, fill_missing=True)

📄 License

This project is licensed under the MIT License.

You are free to use, modify, and distribute this software, provided that you include the original copyright notice and this permission notice in all copies or substantial portions of the software.

For full terms, see the MIT license.

🧑‍💻 Author

Anatol Jurenkow

Cloud Data Engineer | AWS Enthusiast | Iceberg Fan

GitHub · LinkedIn

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.2.2

Jul 22, 2025

1.2.1

Jul 16, 2025

1.2.0

Jul 15, 2025

1.1.0

Jun 13, 2025

1.0.1

Jun 10, 2025

1.0.0

Jun 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schemaworks-1.2.2.tar.gz (18.9 kB view details)

Uploaded Jul 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

schemaworks-1.2.2-py3-none-any.whl (18.2 kB view details)

Uploaded Jul 22, 2025 Python 3

File details

Details for the file schemaworks-1.2.2.tar.gz.

File metadata

Download URL: schemaworks-1.2.2.tar.gz
Upload date: Jul 22, 2025
Size: 18.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.8 Darwin/24.5.0

File hashes

Hashes for schemaworks-1.2.2.tar.gz
Algorithm	Hash digest
SHA256	`9900e15b49b9feae9f249a4d16fe7423d553ab58ef8af0b85bf479b99c93e2bd`
MD5	`0f1604acc40fa39ca3f9dbbd493ba565`
BLAKE2b-256	`6b8cbdd25e8a97dd4b5096170170f1f4d7a90ec94bdf04c4fedd44a9e67534bf`

See more details on using hashes here.

File details

Details for the file schemaworks-1.2.2-py3-none-any.whl.

File metadata

Download URL: schemaworks-1.2.2-py3-none-any.whl
Upload date: Jul 22, 2025
Size: 18.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.8 Darwin/24.5.0

File hashes

Hashes for schemaworks-1.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bbba396eab04a0ef63c7cbac116739e83d2dcf0aa7e7fabcc3ba41cdedc949a2`
MD5	`ae85ba80e6fec1b849c2840c954cdc4d`
BLAKE2b-256	`e4593ecadf69ad7d2f03bc90027cab8424053354e05712b67cb44a31bd715f1b`

See more details on using hashes here.

schemaworks 1.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SchemaWorks

📣 New in 1.2.0

🔧 Features

🚀 Use Cases

🔍 Validation Support

🚚 Installation

Using pip

Using Poetry

Cloning the Repository (For Development)

🧱 Quick Example

📖 Documentation

🧪 Testing

⭐ Examples

✅ Convert JSON schema to Spark StructType

✅ Infer schema from example JSON data

✅ Flatten a nested schema

✅ Convert inferred schema to SQL column types

✅ Convert to Apache Iceberg Schema

✅ Handle decimals in JSON safely

✅ Validate data

✅ Make data conform to schema

📄 License

🧑‍💻 Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes