Skip to main content

A Polars plugin for JSON schema inference using genson-rs.

Project description

Polars Genson

crates.io PyPI Supported Python versions MIT/Apache-2.0 licensed pre-commit.ci status

A Polars plugin for JSON schema inference from string columns using genson-rs. Infer both JSON schemas and Polars schemas directly from JSON data.

Installation

pip install polars-genson[polars]

On older CPUs run:

pip install polars-genson[polars-lts-cpu]

Features

  • JSON Schema Inference: Generate JSON schemas from JSON strings in Polars columns
  • Polars Schema Inference: Directly infer Polars data types and schemas from JSON data
  • Multiple JSON Objects: Handle columns with varying JSON schemas across rows
  • Complex Types: Support for nested objects, arrays, and mixed types
  • Flexible Input: Support for both single JSON objects and arrays of objects
  • Polars Integration: Native Polars plugin with familiar API

Usage

The plugin adds a genson namespace to Polars DataFrames for schema inference.

Quick Start

import polars as pl
import polars_genson
import json

# Create a DataFrame with JSON strings
df = pl.DataFrame({
    "json_data": [
        '{"name": "Alice", "age": 30, "scores": [95, 87]}',
        '{"name": "Bob", "age": 25, "city": "NYC", "active": true}',
        '{"name": "Charlie", "age": 35, "metadata": {"role": "admin"}}'
    ]
})

print("Input DataFrame:")
print(df)
shape: (3, 1)
┌─────────────────────────────────┐
 json_data                       
 ---                             
 str                             
╞═════════════════════════════════╡
 {"name": "Alice", "age": 30, "… │
 {"name": "Bob", "age": 25, "ci… │
 {"name": "Charlie", "age": 35, 
└─────────────────────────────────┘

JSON Schema Inference

# Infer JSON schema from the JSON column
schema = df.genson.infer_json_schema("json_data")

print("Inferred JSON schema:")
print(json.dumps(schema, indent=2))
{
  "$schema": "http://json-schema.org/schema#",
  "properties": {
    "active": {
      "type": "boolean"
    },
    "age": {
      "type": "integer"
    },
    "city": {
      "type": "string"
    },
    "metadata": {
      "properties": {
        "role": {
          "type": "string"
        }
      },
      "required": [
        "role"
      ],
      "type": "object"
    },
    "name": {
      "type": "string"
    },
    "scores": {
      "items": {
        "type": "integer"
      },
      "type": "array"
    }
  },
  "required": [
    "age",
    "name"
  ],
  "type": "object"
}

Note that the fields you get back in both the properties and required subkeys are alphabetised.

Polars Schema Inference

New! Directly infer Polars data types and schemas:

# Infer Polars schema from the JSON column
polars_schema = df.genson.infer_polars_schema("json_data")

print("Inferred Polars schema:")
print(polars_schema)
Schema({
    'active': Boolean,
    'age': Int64,
    'city': String,
    'metadata': Struct({'role': String}),
    'name': String,
    'scores': List(Int64),
})

The Polars schema inference automatically handles:

  • Complex nested structures with proper Struct types
  • Typed arrays like List(Int64), List(String)
  • Mixed data types (integers, floats, booleans, strings)
  • Optional fields present in some but not all objects
  • Deep nesting with multiple levels of structure

Advanced Usage

JSON Schema Options

# Use the expression directly for more control
result = df.select(
    polars_genson.infer_json_schema(
        pl.col("json_data"),
        merge_schemas=False,  # Get individual schemas instead of merged
    ).alias("individual_schemas")
)

# Or use with different options
schema = df.genson.infer_json_schema(
    "json_data",
    ignore_outer_array=False,  # Treat top-level arrays as arrays
    ndjson=True,              # Handle newline-delimited JSON
    schema_uri="AUTO",        # Specify a schema URI
    merge_schemas=True        # Merge all schemas (default)
)

Polars Schema Options

# Infer Polars schema with options
polars_schema = df.genson.infer_polars_schema(
    "json_data",
    ignore_outer_array=True,  # Treat top-level arrays as streams of objects
    ndjson=False,            # Not newline-delimited JSON
    debug=False              # Disable debug output
)

# Note: merge_schemas=False not yet supported for Polars schemas

Method Reference

The genson namespace provides two main methods:

infer_json_schema(column, **kwargs) -> dict

Returns a JSON schema (as a Python dict) following the JSON Schema specification.

Parameters:

  • column: Name of the column containing JSON strings
  • ignore_outer_array: Whether to treat top-level arrays as streams of objects (default: True)
  • ndjson: Whether to treat input as newline-delimited JSON (default: False)
  • merge_schemas: Whether to merge schemas from all rows (default: True)
  • debug: Whether to print debug information (default: False)

infer_polars_schema(column, **kwargs) -> pl.Schema

Returns a Polars schema with native data types for direct use with Polars operations.

Parameters:

  • column: Name of the column containing JSON strings
  • ignore_outer_array: Whether to treat top-level arrays as streams of objects (default: True)
  • ndjson: Whether to treat input as newline-delimited JSON (default: False)
  • debug: Whether to print debug information (default: False)

Note: merge_schemas=False is not yet supported for Polars schema inference.

Examples

Working with Complex JSON

# Complex nested JSON with arrays of objects
df = pl.DataFrame({
    "complex_json": [
        '{"user": {"profile": {"name": "Alice", "preferences": {"theme": "dark"}}}, "posts": [{"title": "Hello", "likes": 5}]}',
        '{"user": {"profile": {"name": "Bob", "preferences": {"theme": "light"}}}, "posts": [{"title": "World", "likes": 3}, {"title": "Test", "likes": 1}]}'
    ]
})

schema = df.genson.infer_polars_schema("complex_json")
print(schema)
Schema({
    'posts': List(Struct({'likes': Int64, 'title': String})),
    'user': Struct({
        'profile': Struct({
            'name': String, 
            'preferences': Struct({'theme': String})
        })
    }),
})

Using Inferred Schema

# You can use the inferred schema for validation or DataFrame operations
inferred_schema = df.genson.infer_polars_schema("json_data")

# Use with other Polars operations
print(f"Schema has {len(inferred_schema)} fields:")
for name, dtype in inferred_schema.items():
    print(f"  {name}: {dtype}")

Standalone CLI Tool

The project also includes a standalone command-line tool for JSON schema inference:

cd genson-cli
cargo run -- input.json

Or from stdin:

echo '{"name": "test", "value": 42}' | cargo run

Development

To build the project:

  1. Build the core library:

    cd genson-core
    cargo build
    
  2. Build the CLI tool:

    cd genson-cli
    cargo build
    
  3. Build the Python bindings:

    cd polars-genson-py
    maturin develop
    

See DEVELOPMENT.md for specifics.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_genson-0.1.1.tar.gz (23.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_genson-0.1.1-cp39-abi3-win_amd64.whl (5.0 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_genson-0.1.1-cp39-abi3-manylinux_2_34_ppc64le.whl (5.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.34+ ppc64le

polars_genson-0.1.1-cp39-abi3-manylinux_2_28_aarch64.whl (4.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ ARM64

polars_genson-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_genson-0.1.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (5.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

polars_genson-0.1.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (4.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

polars_genson-0.1.1-cp39-abi3-macosx_11_0_arm64.whl (4.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_genson-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file polars_genson-0.1.1.tar.gz.

File metadata

  • Download URL: polars_genson-0.1.1.tar.gz
  • Upload date:
  • Size: 23.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.2

File hashes

Hashes for polars_genson-0.1.1.tar.gz
Algorithm Hash digest
SHA256 974f3d791bba5b456cbce99a9c1681fa3dba45f1c79cd6865e2276136e8d2b64
MD5 567cd5ea0001faa0b8bcf607f93bb2e3
BLAKE2b-256 287b9796cfede416de2f1e9db86db277acf861c0d8fa8c1b12ce6bbd234b2195

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.1-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 e28de800db499423241f33dda9ce99414aa192254a8b12dcb41a947e58312e13
MD5 364e1185ebe70d5b7af213717c9b3e45
BLAKE2b-256 d00bbc15e52ebf3c6a54971f959f173144a29fab4f06e916ee98f9680292003a

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.1-cp39-abi3-manylinux_2_34_ppc64le.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.1-cp39-abi3-manylinux_2_34_ppc64le.whl
Algorithm Hash digest
SHA256 8c3002987e3a65f5f07cd1774c8dc69f93085da39ee980bb10d8a8bd3182d0f0
MD5 e141f5da2b8289db475f4fad4de01a47
BLAKE2b-256 8d48b7f115764fb91d43c0cdaaf3c1d867ee16706cccb117ccb50c4709e61266

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.1-cp39-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.1-cp39-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 0677a13edc552e3a9c571a55d352e112a6433ef19c44c2cace8153e119d58b22
MD5 b520130faf240371f39f34bb363395b4
BLAKE2b-256 08aff702f3852a507d4876904b34e37f050221261bb6305bd9678641d56db8c0

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c50a120de1bdfefa6dd2fed72ed828e1d1b729c923962d92bc23bccbd458af83
MD5 b43fefd3a83e70afea7f12b2f82747a3
BLAKE2b-256 e051b598c686c952a914536ef0b335ffad5039f6025a10882a32f36daeb77dd9

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 c98882c6f165e414a030e40a56c85229d762fa27c060ba93c4441893c656c247
MD5 ed2c6aeb0b72f4f77635217c4cc7d686
BLAKE2b-256 028d7382534be843ecdb32d62aed014ecd35eff48e06eca3fdb1ce0e6036f6c2

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 de4c0015aa97ff816a6d33228510932e225f103deb83434f811f16e0fc1a2285
MD5 0726925825e33c8beaac2113531dcc04
BLAKE2b-256 fce3beefd9732a21abbc18e69022252a9d35e1775c98ed67f522187c50f4a556

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e2d04b20e3fa1120e2c9dc1751bfd58d093c9ed09313344aaa7048441e0aa66e
MD5 d4c7f7bbca96339fa8441b71474e0385
BLAKE2b-256 bced773b9fef8be963d4d280c85ed01cf530d873cfe621a851ede3dc2ba5740c

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4b14c13d61e25bf8a1ea8087252d5243d9e3b40bae7e269bc51af06b33e07315
MD5 01d16c97122be47a8e1fdafb7cd8ed31
BLAKE2b-256 3aa7bb172fe1394e72695d3ffdabd2cb22f9c41f88b8f89cfae01f454ec61e13

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page