Skip to main content

A Polars plugin for JSON schema inference using genson-rs.

Project description

Polars Genson

crates.io PyPI Supported Python versions MIT/Apache-2.0 licensed pre-commit.ci status

A Polars plugin for JSON schema inference from string columns using genson-rs. Infer both JSON schemas and Polars schemas directly from JSON data.

Installation

pip install polars-genson[polars]

On older CPUs run:

pip install polars-genson[polars-lts-cpu]

Features

  • JSON Schema Inference: Generate JSON schemas from JSON strings in Polars columns
  • Polars Schema Inference: Directly infer Polars data types and schemas from JSON data
  • Multiple JSON Objects: Handle columns with varying JSON schemas across rows
  • Complex Types: Support for nested objects, arrays, and mixed types
  • Flexible Input: Support for both single JSON objects and arrays of objects
  • Polars Integration: Native Polars plugin with familiar API

Usage

The plugin adds a genson namespace to Polars DataFrames for schema inference.

Quick Start

import polars as pl
import polars_genson
import json

# Create a DataFrame with JSON strings
df = pl.DataFrame({
    "json_data": [
        '{"name": "Alice", "age": 30, "scores": [95, 87]}',
        '{"name": "Bob", "age": 25, "city": "NYC", "active": true}',
        '{"name": "Charlie", "age": 35, "metadata": {"role": "admin"}}'
    ]
})

print("Input DataFrame:")
print(df)
shape: (3, 1)
┌─────────────────────────────────┐
 json_data                       
 ---                             
 str                             
╞═════════════════════════════════╡
 {"name": "Alice", "age": 30, "… │
 {"name": "Bob", "age": 25, "ci… │
 {"name": "Charlie", "age": 35, 
└─────────────────────────────────┘

JSON Schema Inference

# Infer JSON schema from the JSON column
schema = df.genson.infer_json_schema("json_data")

print("Inferred JSON schema:")
print(json.dumps(schema, indent=2))
{
  "$schema": "http://json-schema.org/schema#",
  "properties": {
    "active": {
      "type": "boolean"
    },
    "age": {
      "type": "integer"
    },
    "city": {
      "type": "string"
    },
    "metadata": {
      "properties": {
        "role": {
          "type": "string"
        }
      },
      "required": [
        "role"
      ],
      "type": "object"
    },
    "name": {
      "type": "string"
    },
    "scores": {
      "items": {
        "type": "integer"
      },
      "type": "array"
    }
  },
  "required": [
    "age",
    "name"
  ],
  "type": "object"
}

Note that the fields you get back in both the properties and required subkeys are alphabetised.

Polars Schema Inference

New! Directly infer Polars data types and schemas:

# Infer Polars schema from the JSON column
polars_schema = df.genson.infer_polars_schema("json_data")

print("Inferred Polars schema:")
print(polars_schema)
Schema({
    'active': Boolean,
    'age': Int64,
    'city': String,
    'metadata': Struct({'role': String}),
    'name': String,
    'scores': List(Int64),
})

The Polars schema inference automatically handles:

  • Complex nested structures with proper Struct types
  • Typed arrays like List(Int64), List(String)
  • Mixed data types (integers, floats, booleans, strings)
  • Optional fields present in some but not all objects
  • Deep nesting with multiple levels of structure

Advanced Usage

JSON Schema Options

# Use the expression directly for more control
result = df.select(
    polars_genson.infer_json_schema(
        pl.col("json_data"),
        merge_schemas=False,  # Get individual schemas instead of merged
    ).alias("individual_schemas")
)

# Or use with different options
schema = df.genson.infer_json_schema(
    "json_data",
    ignore_outer_array=False,  # Treat top-level arrays as arrays
    ndjson=True,              # Handle newline-delimited JSON
    schema_uri="AUTO",        # Specify a schema URI
    merge_schemas=True        # Merge all schemas (default)
)

Polars Schema Options

# Infer Polars schema with options
polars_schema = df.genson.infer_polars_schema(
    "json_data",
    ignore_outer_array=True,  # Treat top-level arrays as streams of objects
    ndjson=False,            # Not newline-delimited JSON
    debug=False              # Disable debug output
)

# Note: merge_schemas=False not yet supported for Polars schemas

Method Reference

The genson namespace provides two main methods:

infer_json_schema(column, **kwargs) -> dict

Returns a JSON schema (as a Python dict) following the JSON Schema specification.

Parameters:

  • column: Name of the column containing JSON strings
  • ignore_outer_array: Whether to treat top-level arrays as streams of objects (default: True)
  • ndjson: Whether to treat input as newline-delimited JSON (default: False)
  • merge_schemas: Whether to merge schemas from all rows (default: True)
  • debug: Whether to print debug information (default: False)

infer_polars_schema(column, **kwargs) -> pl.Schema

Returns a Polars schema with native data types for direct use with Polars operations.

Parameters:

  • column: Name of the column containing JSON strings
  • ignore_outer_array: Whether to treat top-level arrays as streams of objects (default: True)
  • ndjson: Whether to treat input as newline-delimited JSON (default: False)
  • debug: Whether to print debug information (default: False)

Note: merge_schemas=False is not yet supported for Polars schema inference.

Examples

Working with Complex JSON

# Complex nested JSON with arrays of objects
df = pl.DataFrame({
    "complex_json": [
        '{"user": {"profile": {"name": "Alice", "preferences": {"theme": "dark"}}}, "posts": [{"title": "Hello", "likes": 5}]}',
        '{"user": {"profile": {"name": "Bob", "preferences": {"theme": "light"}}}, "posts": [{"title": "World", "likes": 3}, {"title": "Test", "likes": 1}]}'
    ]
})

schema = df.genson.infer_polars_schema("complex_json")
print(schema)
Schema({
    'posts': List(Struct({'likes': Int64, 'title': String})),
    'user': Struct({
        'profile': Struct({
            'name': String, 
            'preferences': Struct({'theme': String})
        })
    }),
})

Using Inferred Schema

# You can use the inferred schema for validation or DataFrame operations
inferred_schema = df.genson.infer_polars_schema("json_data")

# Use with other Polars operations
print(f"Schema has {len(inferred_schema)} fields:")
for name, dtype in inferred_schema.items():
    print(f"  {name}: {dtype}")

Standalone CLI Tool

The project also includes a standalone command-line tool for JSON schema inference:

cd genson-cli
cargo run -- input.json

Or from stdin:

echo '{"name": "test", "value": 42}' | cargo run

Development

To build the project:

  1. Build the core library:

    cd genson-core
    cargo build
    
  2. Build the CLI tool:

    cd genson-cli
    cargo build
    
  3. Build the Python bindings:

    cd polars-genson-py
    maturin develop
    

See DEVELOPMENT.md for specifics.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_genson-0.1.0.tar.gz (23.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_genson-0.1.0-cp39-abi3-win_amd64.whl (5.0 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_genson-0.1.0-cp39-abi3-manylinux_2_34_ppc64le.whl (5.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.34+ ppc64le

polars_genson-0.1.0-cp39-abi3-manylinux_2_28_aarch64.whl (4.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ ARM64

polars_genson-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_genson-0.1.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (5.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

polars_genson-0.1.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (4.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

polars_genson-0.1.0-cp39-abi3-macosx_11_0_arm64.whl (4.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_genson-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file polars_genson-0.1.0.tar.gz.

File metadata

  • Download URL: polars_genson-0.1.0.tar.gz
  • Upload date:
  • Size: 23.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.2

File hashes

Hashes for polars_genson-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bd3d970507ab6104a3cee5113b41af03e2d6d59a19f50df969e80a6eefeb3b39
MD5 b381fa6266020694591ce368172c29cc
BLAKE2b-256 3f728b2188cc1b548ed752bbc942ad2f05cc25a3f85adf7a0c7d8bdabc9fd83c

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c94cbbd70753ffd86d44d755aad5ecbc69696ce994f3a2647635f50c53d2ed91
MD5 cce682c1a29fb0b4322f105782d86708
BLAKE2b-256 3faacdb5fe55b244c5a36550eed9a716209af90801d4f13971c0e1e6045a875d

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.0-cp39-abi3-manylinux_2_34_ppc64le.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.0-cp39-abi3-manylinux_2_34_ppc64le.whl
Algorithm Hash digest
SHA256 e31f1ff701578bd02950ab3759dd7295df2b478f4391413a7f326fef89341b58
MD5 a6cc43732bcfe0679d0efc00634889c2
BLAKE2b-256 d1e6ec1f95821cd3f79c468ffb2f7e1ec965e627012ad68b5aec0fafff8bac1f

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.0-cp39-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.0-cp39-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 53e8c48891a9d87ec2cebe18fdc814f02d77b3b21e91919986f7a31f43bbb2ea
MD5 ca787798c9c54da0ef0c918fd98ce625
BLAKE2b-256 56cda5c3654533a016769975cdb3f75826aec5a3b953fc8651a32316d50572cd

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6555902b20ca1bbebad088bf116c03c494ace82aab89dceeecf872edbba04aaa
MD5 dd956ced42b8d33d6f8ec9036734b1ab
BLAKE2b-256 7279b079637ff7cdc84981edb4bd55908399512659df33c78318bdb70ebd7ec5

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 39c45fd96c0a1d785cfb62d3e1a57067b819320328b08921c824651f300fb2c6
MD5 e996ecaf9780354d0b7c6dd080f6ec12
BLAKE2b-256 8eef295ce94385d18e48c77e93dcfddfc6c7720d19f39987a14fad2db6d57638

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 52cd009c3478830fa628e48494566ead0931aec5be0abd5fafa2573b24162185
MD5 a749db187dbf81bcb81aa253515d8052
BLAKE2b-256 512ad023483d8fe13e0bcaa0122befea6c9779b117688d4f7e3c022c96242b60

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6b1b5ac92a3b2f4052abf96ce60273b59105b3d164749dd0cae99f2c602a7579
MD5 c27de45104f8e939156312a74250c2bd
BLAKE2b-256 50e8a8a21b268e4cbf2adce9c23d23367c7092e3db1448caefcfeda25031720f

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3259bca26ebd52068664457cf190216bb88841e226b0882b7330a3f7362de436
MD5 c78a135fdcbbd8f4e7b0e247b0f89b70
BLAKE2b-256 6a0d7c06ffbfad0423ebd296a119f8e18fa5f8ee995b7872073ad231ad72635c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page