Skip to main content

A Polars plugin for JSON schema inference using genson-rs.

Project description

Polars Genson

PyPI crates.io: genson-core crates.io: polars-jsonschema-bridge Supported Python versions pre-commit.ci status

A Polars plugin for JSON schema inference from string columns using genson-rs. Infer both JSON schemas and Polars schemas directly from JSON data.

Installation

pip install polars-genson[polars]

On older CPUs run:

pip install polars-genson[polars-lts-cpu]

Features

  • JSON Schema Inference: Generate JSON schemas from JSON strings in Polars columns
  • Polars Schema Inference: Directly infer Polars data types and schemas from JSON data
  • Multiple JSON Objects: Handle columns with varying JSON schemas across rows
  • Complex Types: Support for nested objects, arrays, and mixed types
  • Flexible Input: Support for both single JSON objects and arrays of objects
  • Polars Integration: Native Polars plugin with familiar API

Usage

The plugin adds a genson namespace to Polars DataFrames for schema inference.

Quick Start

import polars as pl
import polars_genson
import json

# Create a DataFrame with JSON strings
df = pl.DataFrame({
    "json_data": [
        '{"name": "Alice", "age": 30, "scores": [95, 87]}',
        '{"name": "Bob", "age": 25, "city": "NYC", "active": true}',
        '{"name": "Charlie", "age": 35, "metadata": {"role": "admin"}}'
    ]
})

print("Input DataFrame:")
print(df)
shape: (3, 1)
┌─────────────────────────────────┐
 json_data                       
 ---                             
 str                             
╞═════════════════════════════════╡
 {"name": "Alice", "age": 30, "… │
 {"name": "Bob", "age": 25, "ci… │
 {"name": "Charlie", "age": 35, 
└─────────────────────────────────┘

JSON Schema Inference

# Infer JSON schema from the JSON column
schema = df.genson.infer_json_schema("json_data")

print("Inferred JSON schema:")
print(json.dumps(schema, indent=2))
{
  "$schema": "http://json-schema.org/schema#",
  "properties": {
    "name": {
      "type": "string"
    },
    "age": {
      "type": "integer"
    },
    "scores": {
      "items": {
        "type": "integer"
      },
      "type": "array"
    }
    "city": {
      "type": "string"
    },
    "active": {
      "type": "boolean"
    },
    "metadata": {
      "properties": {
        "role": {
          "type": "string"
        }
      },
      "required": [
        "role"
      ],
      "type": "object"
    },
  },
  "required": [
    "age",
    "name"
  ],
  "type": "object"
}

Polars Schema Inference

Directly infer Polars data types and schemas:

# Infer Polars schema from the JSON column
polars_schema = df.genson.infer_polars_schema("json_data")

print("Inferred Polars schema:")
print(polars_schema)
Schema({
    'name': String,
    'age': Int64,
    'scores': List(Int64),
    'city': String,
    'active': Boolean,
    'metadata': Struct({'role': String}),
})

The Polars schema inference automatically handles:

  • Complex nested structures with proper Struct types
  • Typed arrays like List(Int64), List(String)
  • Mixed data types (integers, floats, booleans, strings)
  • Optional fields present in some but not all objects
  • Deep nesting with multiple levels of structure

Advanced Usage

JSON Schema Options

# Use the expression directly for more control
result = df.select(
    polars_genson.infer_json_schema(
        pl.col("json_data"),
        merge_schemas=False,  # Get individual schemas instead of merged
    ).alias("individual_schemas")
)

# Or use with different options
schema = df.genson.infer_json_schema(
    "json_data",
    ignore_outer_array=False,  # Treat top-level arrays as arrays
    ndjson=True,              # Handle newline-delimited JSON
    schema_uri="AUTO",        # Specify a schema URI
    merge_schemas=True        # Merge all schemas (default)
)

Polars Schema Options

# Infer Polars schema with options
polars_schema = df.genson.infer_polars_schema(
    "json_data",
    ignore_outer_array=True,  # Treat top-level arrays as streams of objects
    ndjson=False,            # Not newline-delimited JSON
    debug=False              # Disable debug output
)

# Note: merge_schemas=False not yet supported for Polars schemas

Method Reference

The genson namespace provides two main methods:

infer_json_schema(column, **kwargs) -> dict

Returns a JSON schema (as a Python dict) following the JSON Schema specification.

Parameters:

  • column: Name of the column containing JSON strings
  • ignore_outer_array: Whether to treat top-level arrays as streams of objects (default: True)
  • ndjson: Whether to treat input as newline-delimited JSON (default: False)
  • merge_schemas: Whether to merge schemas from all rows (default: True)
  • debug: Whether to print debug information (default: False)

infer_polars_schema(column, **kwargs) -> pl.Schema

Returns a Polars schema with native data types for direct use with Polars operations.

Parameters:

  • column: Name of the column containing JSON strings
  • ignore_outer_array: Whether to treat top-level arrays as streams of objects (default: True)
  • ndjson: Whether to treat input as newline-delimited JSON (default: False)
  • debug: Whether to print debug information (default: False)

Note: merge_schemas=False is not yet supported for Polars schema inference.

Examples

Working with Complex JSON

# Complex nested JSON with arrays of objects
df = pl.DataFrame({
    "complex_json": [
        '{"user": {"profile": {"name": "Alice", "preferences": {"theme": "dark"}}}, "posts": [{"title": "Hello", "likes": 5}]}',
        '{"user": {"profile": {"name": "Bob", "preferences": {"theme": "light"}}}, "posts": [{"title": "World", "likes": 3}, {"title": "Test", "likes": 1}]}'
    ]
})

schema = df.genson.infer_polars_schema("complex_json")
print(schema)
Schema({
    'user': Struct({
        'profile': Struct({
            'name': String, 
            'preferences': Struct({'theme': String})
        })
    }),
    'posts': List(Struct({'likes': Int64, 'title': String})),
})

Using Inferred Schema

# You can use the inferred schema for validation or DataFrame operations
inferred_schema = df.genson.infer_polars_schema("json_data")

# Use with other Polars operations
print(f"Schema has {len(inferred_schema)} fields:")
for name, dtype in inferred_schema.items():
    print(f"  {name}: {dtype}")

Standalone CLI Tool

The project also includes a standalone command-line tool for JSON schema inference:

cd genson-cli
cargo run -- input.json

Or from stdin:

echo '{"name": "test", "value": 42}' | cargo run

Contributing

This crate is part of the polars-genson project. See the main repository for the contribution and development docs.

License

MIT License

  • Contains vendored and slightly adapted copy of the Apache 2.0 licensed fork of genson-rs crate

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_genson-0.1.2.tar.gz (23.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_genson-0.1.2-cp39-abi3-win_amd64.whl (5.0 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_genson-0.1.2-cp39-abi3-manylinux_2_34_ppc64le.whl (5.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.34+ ppc64le

polars_genson-0.1.2-cp39-abi3-manylinux_2_28_aarch64.whl (4.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ ARM64

polars_genson-0.1.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_genson-0.1.2-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (5.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

polars_genson-0.1.2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (4.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

polars_genson-0.1.2-cp39-abi3-macosx_11_0_arm64.whl (4.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_genson-0.1.2-cp39-abi3-macosx_10_12_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file polars_genson-0.1.2.tar.gz.

File metadata

  • Download URL: polars_genson-0.1.2.tar.gz
  • Upload date:
  • Size: 23.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.2

File hashes

Hashes for polars_genson-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7c2dda09eca271e328a29e1c568369a956b659f5d69328f9fd83757b2a55c194
MD5 b8e6a40d68d8c972c54c029ca4ebfeee
BLAKE2b-256 4f85c2bf9a17c5f25a0daa9d432ccc862a6ca94dfd2a287e3155359f308e2f61

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.2-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.2-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d7a910d57cfef95712e66cd1c9847ccee5f4be80c48413d81e37db97805ac811
MD5 06570ff366cdec875c3624dd56d93b92
BLAKE2b-256 61dd73581a84e6d210a3b2d799b7713cda6fdfdfff4003460cdda2739a559d73

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.2-cp39-abi3-manylinux_2_34_ppc64le.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.2-cp39-abi3-manylinux_2_34_ppc64le.whl
Algorithm Hash digest
SHA256 f24363188cd266b20d4b2ef9ae420504c7d19ef2d55ac644268e45a8687b44b7
MD5 bcf4bbba5f0f011ebbaf32e30e818352
BLAKE2b-256 d1fe5155c93c365217cd3d88380a67bc1c567ce4fb59abc1146cdd695008dd00

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.2-cp39-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.2-cp39-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 c4a61ff4a7441764500cf40f7237e39114092cafb89ddde3255ce9196286dd66
MD5 b1156d7c69ca3d9c326c89e3fd1ed444
BLAKE2b-256 5560b18538b51819bc49a218b19d8bc9896a4e1fdaad65fbfac2b04c922d790d

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1e52533a75502022eb6369b5fb5ba9144606d9127b35214b3dbcba5c40718329
MD5 a8fca382c598ab296fd29a54bcea3bb5
BLAKE2b-256 9398dce325cf05020addefa231c890cd7568f0ff0a3555765f5af29fdf8d105c

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.2-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.2-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 d1ab2a4aa0f8cfc55604b51cc088aca195586e34f004e60efc9d7574867eac07
MD5 f99bcb574236248314feea6ff869110a
BLAKE2b-256 fe86ab84509625f15a059c48b9ab41fe54b35af44bf58e1148cef4da19b4ab58

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 82117debe1a48ca1b8e0c92c47b1ca5df30bfa1615144bd98e911bbe1392395c
MD5 85a2bce9e9d95d64f1667588e86f1eac
BLAKE2b-256 a8b9979a58b2f8dbe2c7ca04602c9526b4dcb4f1814cf084ca526a70c6de9b67

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.2-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.2-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7303688412bb77e0f350cbf403ffa24d6112013b352010769c23b3dc680a6acb
MD5 17baee475fdd49b0a3e8ec951dca17c4
BLAKE2b-256 994fccb6c1efb93165c6d1c235c9e067cc5a4c8b461abf3b74f7455468e0a56d

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.2-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.2-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f10b884ed9c1bccc9bf6ad6c8379f059acfaaa090e5bfae43cb40d4264272f6d
MD5 9da29b17776880dc357630deb0ad8143
BLAKE2b-256 b00ca2b9777d6d2e552de9e983771326b9dfd9fe15bc841c529050aa8c26e6c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page