Skip to main content

A Polars plugin for JSON schema inference using genson-rs.

Project description

Polars Genson

PyPI crates.io: genson-core crates.io: polars-jsonschema-bridge Supported Python versions pre-commit.ci status

A Polars plugin for working with JSON schemas. Infer schemas from JSON data and convert between JSON Schema and Polars schema formats.

Installation

pip install polars-genson[polars]

On older CPUs run:

pip install polars-genson[polars-lts-cpu]

Features

Schema Inference

  • JSON Schema Inference: Generate JSON schemas from JSON strings in Polars columns
  • Polars Schema Inference: Directly infer Polars data types and schemas from JSON data
  • Multiple JSON Objects: Handle columns with varying JSON schemas across rows
  • Complex Types: Support for nested objects, arrays, and mixed types
  • Flexible Input: Support for both single JSON objects and arrays of objects

Schema Conversion

  • Polars → JSON Schema: Convert existing DataFrame schemas to JSON Schema format
  • JSON Schema → Polars: Convert JSON schemas to equivalent Polars schemas
  • Round-trip Support: Full bidirectional conversion with validation
  • Schema Manipulation: Validate, transform, and standardize schemas

Usage

The plugin adds a genson namespace to Polars DataFrames for schema inference and conversion.

import polars as pl
import polars_genson
import json

# Create a DataFrame with JSON strings
df = pl.DataFrame({
    "json_data": [
        '{"name": "Alice", "age": 30, "scores": [95, 87]}',
        '{"name": "Bob", "age": 25, "city": "NYC", "active": true}',
        '{"name": "Charlie", "age": 35, "metadata": {"role": "admin"}}'
    ]
})

print("Input DataFrame:")
print(df)
shape: (3, 1)
┌─────────────────────────────────┐
 json_data                       
 ---                             
 str                             
╞═════════════════════════════════╡
 {"name": "Alice", "age": 30, "… │
 {"name": "Bob", "age": 25, "ci… │
 {"name": "Charlie", "age": 35, 
└─────────────────────────────────┘

JSON Schema Inference

# Infer JSON schema from the JSON column
schema = df.genson.infer_json_schema("json_data")

print("Inferred JSON schema:")
print(json.dumps(schema, indent=2))
{
  "$schema": "http://json-schema.org/schema#",
  "properties": {
    "name": {
      "type": "string"
    },
    "age": {
      "type": "integer"
    },
    "scores": {
      "items": {
        "type": "integer"
      },
      "type": "array"
    }
    "city": {
      "type": "string"
    },
    "active": {
      "type": "boolean"
    },
    "metadata": {
      "properties": {
        "role": {
          "type": "string"
        }
      },
      "required": [
        "role"
      ],
      "type": "object"
    },
  },
  "required": [
    "age",
    "name"
  ],
  "type": "object"
}

Polars Schema Inference

Directly infer Polars data types and schemas:

# Infer Polars schema from the JSON column
polars_schema = df.genson.infer_polars_schema("json_data")

print("Inferred Polars schema:")
print(polars_schema)
Schema({
    'name': String,
    'age': Int64,
    'scores': List(Int64),
    'city': String,
    'active': Boolean,
    'metadata': Struct({'role': String}),
})

The Polars schema inference automatically handles:

  • Complex nested structures with proper Struct types
  • Typed arrays like List(Int64), List(String)
  • Mixed data types (integers, floats, booleans, strings)
  • Optional fields present in some but not all objects
  • Deep nesting with multiple levels of structure

Map vs Record Inference Control

For objects with varying keys, you can control whether they're inferred as Maps (dynamic key-value pairs) or Records (fixed fields) using the map_threshold and map_max_required_keys parameters:

# Data with different key patterns
df = pl.DataFrame({
    "json_data": [
        '{"user": {"id": 1, "name": "Alice"}, "attributes": {"source": "web", "campaign": "summer"}}',
        '{"user": {"id": 2, "name": "Bob"}, "attributes": {"source": "mobile"}}'
    ]
})

# Default: both user and attributes become Records
schema_default = df.genson.infer_json_schema("json_data")

# Lower thresholds: distinguish structured Records from dynamic Maps
schema_controlled = df.genson.infer_json_schema("json_data", 
    map_threshold=2,           # Objects with ≥2 keys can be Maps
    map_max_required_keys=1    # Maps can have ≤1 required key
)

In the controlled example:

  • user has 2 required keys (id, name) > 1 → Record (structured)
  • attributes has 1 required key (source) ≤ 1 → Map (dynamic)

This gives you fine-grained control over how objects with different key stability patterns are classified.

Root Wrapping (wrap_root)

By default, inferred schemas treat each JSON object as the root.
Sometimes you may want to wrap the schema in an extra record layer — for example, to make Avro schemas compatible with systems that require a named top-level record.

You can control this behavior with the wrap_root option:

  • wrap_root="true" → Wraps using the column name as the record name
  • wrap_root="<string>" → Wraps using the given string as the record name
  • wrap_root=None (default) → No wrapping (root is just "document" for Avro)

Example: Avro schema with wrap_root

df = pl.DataFrame({
    "json_data": [
        '{"value": "A"}',
        '{"value": "B"}'
    ]
})

schema = df.genson.infer_json_schema("json_data", avro=True, wrap_root="payload")

print(json.dumps(schema, indent=2))
{
  "type": "record",
  "name": "document",
  "namespace": "genson",
  "fields": [
    {
      "name": "payload",
      "type": {
        "type": "record",
        "name": "payload",
        "namespace": "genson.document_types",
        "fields": [
          {
            "name": "value",
            "type": "string"
          }
        ]
      }
    }
  ]
}

This is especially useful when:

  • Exporting Avro to systems that require a named top-level record
  • Keeping schema names consistent with your column names or domain models

Normalisation

In addition to schema inference, polars-genson can normalise JSON columns so that every row conforms to a single, consistent Avro schema.

This is especially useful for semi-structured data where fields may be missing, empty arrays/maps may need to collapse to null, or numeric/boolean values may sometimes be encoded as strings.

Features

  • Converts empty arrays/maps to null (default)
  • Preserves empties with empty_as_null=False
  • Ensures missing fields are inserted with null
  • Supports per-field coercion of numeric/boolean strings via coerce_strings=True
  • Supports top-level schema evolution with wrap_root

Example: Map Encoding in Polars

By default, Polars cannot store a dynamic JSON object ({"en":"Hello","fr":"Bonjour"}) without exploding it into a struct with fixed fields padded with nulls.
polars-genson solves this by normalising maps to a list of key/value structs:

This representation is schema-stable and preserves all map keys without null-padding. It matches how Arrow/Parquet model Avro map types internally.

import polars as pl
import polars_genson

df = pl.DataFrame({
    "json_data": [
        '{"id": 123, "tags": [], "labels": {}, "active": true}',
        '{"id": 456, "tags": ["x","y"], "labels": {"fr":"Bonjour"}, "active": false}',
        '{"id": 789, "labels": {"en": "Hi", "es": "Hola"}}'
    ]
})

print(df.genson.normalise_json("json_data", map_threshold=0))

Output:

shape: (3, 4)
┌─────┬────────────┬──────────────────────────────┬────────┐
│ id  ┆ tags       ┆ labels                       ┆ active │
│ --- ┆ ---        ┆ ---                          ┆ ---    │
│ i64 ┆ list[str]  ┆ list[struct[2]]              ┆ bool   │
╞═════╪════════════╪══════════════════════════════╪════════╡
│ 123 ┆ null       ┆ null                         ┆ true   │
│ 456 ┆ ["x", "y"] ┆ [{"fr","Bonjour"}]           ┆ false  │
│ 789 ┆ null       ┆ [{"en","Hi"}, {"es","Hola"}] ┆ null   │
└─────┴────────────┴──────────────────────────────┴────────┘

In the example above, normalise_json reshaped jagged JSON into a consistent, schema-aligned form:

  • Row 1

    • tags was present but empty ([]) → normalised to null (this prevents row elimination when exploding the column)
    • labels was present but empty ({}) → normalised to null
    • active stayed true
  • Row 2

    • tags had two values (["x","y"]) → preserved as a list of strings
    • labels had one entry ({"fr":"Bonjour"}) → normalised to a list of one key:value struct
    • active stayed false
  • Row 3

    • tags was missing entirely → injected as null
    • labels had two entries ({"en":"Hi","es":"Hola"}) → normalised to a list of two key:value structs
    • active was missing → injected as null

Example: Empty Arrays

df = pl.DataFrame({"json_data": ['{"labels": []}', '{"labels": {"en": "Hello"}}']})

out = df.genson.normalise_json("json_data")
print(out)

Output:

shape: (2, 1)
┌─────────────────────────────┐
│ normalised                  │
│ ---                         │
│ str                         │
╞═════════════════════════════╡
│ {"labels": null}            │
│ {"labels": {"en": "Hello"}} │
└─────────────────────────────┘

Example: Preserving Empty Arrays

out = df.genson.normalise_json("json_data", empty_as_null=False)
print(out)

Output:

┌─────────────────────────────┐
│ normalised                  │
╞═════════════════════════════╡
│ {"labels": []}              │
│ {"labels": {"en": "Hello"}} │
└─────────────────────────────┘

Example: String Coercion

df = pl.DataFrame({
    "json_data": [
        '{"id": "42", "active": "true"}',
        '{"id": 7, "active": false}'
    ]
})

# Default: no coercion
print(df.genson.normalise_json("json_data").to_list())
# ['{"id": null, "active": null}', '{"id": 7, "active": false}']

# With coercion
print(df.genson.normalise_json("json_data", coerce_strings=True).to_list())
# ['{"id": 42, "active": true}', '{"id": 7, "active": false}']

Advanced Usage

Per-Row Schema Processing

  • Only available with JSON schema currently (per-row/unmerged Polars schemas TODO)
# Get individual schemas and process them
df = pl.DataFrame({
    "ABCs": [
        '{"a": 1, "b": 2}',
        '{"a": 1, "c": true}',
    ]
})

# Analyze schema variations
individual_schemas = df.genson.infer_json_schema("ABCs", merge_schemas=False)

The result is a list of one schema per row. With merge_schemas=True you would get all 3 keys (a, b, c) in a single schema.

[{'$schema': 'http://json-schema.org/schema#',
  'properties': {'a': {'type': 'integer'}, 'b': {'type': 'integer'}},
  'required': ['a', 'b'],
  'type': 'object'},
 {'$schema': 'http://json-schema.org/schema#',
  'properties': {'a': {'type': 'integer'}, 'c': {'type': 'boolean'}},
  'required': ['a', 'c'],
  'type': 'object'}]

JSON Schema Options

# Use the expression directly for more control
result = df.select(
    polars_genson.infer_json_schema(
        pl.col("json_data"),
        merge_schemas=False,  # Get individual schemas instead of merged
    ).alias("individual_schemas")
)

# Or use with different options
schema = df.genson.infer_json_schema(
    "json_data",
    ignore_outer_array=False,  # Treat top-level arrays as arrays
    ndjson=True,               # Handle newline-delimited JSON
    schema_uri="https://json-schema.org/draft/2020-12/schema",  # Specify a schema URI
    merge_schemas=True         # Merge all schemas (default)
)

Polars Schema Options

# Infer Polars schema with options
polars_schema = df.genson.infer_polars_schema(
    "json_data",
    ignore_outer_array=True,  # Treat top-level arrays as streams of objects
    ndjson=False,            # Not newline-delimited JSON
    debug=False              # Disable debug output
)

# Note: merge_schemas=False not yet supported for Polars schemas

Method Reference

The genson namespace provides three main methods:

infer_json_schema(column, **kwargs) -> dict | list[dict]

Infers a JSON Schema (or Avro, if requested) from a string column.

Parameters:

  • column: Name of the column containing JSON strings

  • ignore_outer_array: Treat top-level arrays as streams of objects (default: True)

  • ndjson: Treat input as newline-delimited JSON (default: False)

  • schema_uri: Schema URI to embed in the output (default: "http://json-schema.org/schema#"). Ignored by some consumers when avro=True.

  • merge_schemas: Merge schemas from all rows (default: True). If False, returns one schema per row as a list.

  • debug: Print debug information (default: False)

  • map_threshold: Detect maps when object has more than N keys (default: 20)

  • map_max_required_keys: Maximum required keys for Map inference (default: None). Objects with more required keys will be forced to Record type. If None, no gating based on required key count.

  • force_field_types: Dict of per-field overrides, values must be "map" or "record". Example: {"labels": "map", "claims": "record"}

  • avro: Output Avro schema instead of JSON Schema (default: False)

  • wrap_root: Control root wrapping.

    • True → wrap using the column name
    • str → wrap using the given name
    • None → no wrapping (default)

Returns:

  • dict when merge_schemas=True
  • list[dict] when merge_schemas=False

infer_polars_schema(column, **kwargs) -> pl.Schema

Infers a native Polars schema from a string column.

Parameters:

  • column: Name of the column containing JSON strings

  • ignore_outer_array: Treat top-level arrays as streams of objects (default: True)

  • ndjson: Treat input as newline-delimited JSON (default: False)

  • merge_schemas: Merge schemas from all rows (default: True). (Currently the only supported mode.)

  • debug: Print debug information (default: False)

  • map_threshold: Detect maps when object has more than N keys (default: 20)

  • map_max_required_keys: Maximum required keys for Map inference (default: None). Objects with more required keys will be forced to Record type. If None, no gating based on required key count.

  • force_field_types: Dict of per-field overrides, values must be "map" or "record"

  • avro: Infer using Avro semantics (unions, maps, nullability) instead of pure JSON Schema semantics (default: False)

  • wrap_root: Control root wrapping.

    • True → wrap using the column name
    • str → wrap using the given name
    • None → no wrapping (default)

Returns:

  • pl.Schema

Note: merge_schemas=False is not supported for Polars schema inference.

normalise_json(column, **kwargs) -> pl.DataFrame | pl.Series

Normalises each JSON string in the column against a single, inferred Avro schema. Ensures every row matches the same structure and datatypes.

Parameters:

  • column: Name of the column containing JSON strings

  • decode: If True, decode to native Polars types (default: True)

  • unnest: If decode=True, expand the decoded struct into separate columns (default: True)

  • ignore_outer_array: Treat top-level arrays as streams of objects (default: True)

  • ndjson: Treat input as newline-delimited JSON (default: False)

  • empty_as_null: Convert empty arrays/maps to null (default: True)

  • coerce_strings: Coerce numeric/boolean strings (e.g. "42", "true") into numbers/booleans where the schema expects them (default: False)

  • map_encoding: Encoding for Avro maps: "kv" (default), "mapping", or "entries"

  • map_threshold: Detect maps when object has more than N keys (default: 20)

  • map_max_required_keys: Maximum required keys for Map inference (default: None). Objects with more required keys will be forced to Record type. If None, no gating based on required key count.

  • force_field_types: Dict of per-field overrides ("map"/"record")

  • wrap_root: Control root wrapping.

    • True → wrap using the column name
    • str → wrap using the given name
    • None → no wrapping (default)

Returns:

  • If decode=True:

    • unnest=Truepl.DataFrame with one column per schema field
    • unnest=Falsepl.DataFrame with a single struct column
  • If decode=Falsepl.Series of normalised JSON strings

Example:

df = pl.DataFrame({"json_data": ['{"labels": []}', '{"labels": {"en": "Hello"}}']})
out = df.genson.normalise_json("json_data")
print(out.to_list())
# ['{"labels": null}', '{"labels": {"en": "Hello"}}']

Examples

Working with Complex JSON

# Complex nested JSON with arrays of objects
df = pl.DataFrame({
    "complex_json": [
        '{"user": {"profile": {"name": "Alice", "preferences": {"theme": "dark"}}}, "posts": [{"title": "Hello", "likes": 5}]}',
        '{"user": {"profile": {"name": "Bob", "preferences": {"theme": "light"}}}, "posts": [{"title": "World", "likes": 3}, {"title": "Test", "likes": 1}]}'
    ]
})

schema = df.genson.infer_polars_schema("complex_json")
print(schema)
Schema({
    'user': Struct({
        'profile': Struct({
            'name': String, 
            'preferences': Struct({'theme': String})
        })
    }),
    'posts': List(Struct({'likes': Int64, 'title': String})),
})

Using Inferred Schema

# You can use the inferred schema for validation or DataFrame operations
inferred_schema = df.genson.infer_polars_schema("json_data")

# Use with other Polars operations
print(f"Schema has {len(inferred_schema)} fields:")
for name, dtype in inferred_schema.items():
    print(f"  {name}: {dtype}")

Contributing

This crate is part of the polars-genson project. See the main repository for the contribution and development docs.

License

MIT License

  • Contains vendored and slightly adapted copy of the Apache 2.0 licensed fork of genson-rs crate

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_genson-0.2.0.tar.gz (23.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_genson-0.2.0-cp39-abi3-win_amd64.whl (6.3 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_genson-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_genson-0.2.0-cp39-abi3-macosx_11_0_arm64.whl (5.1 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_genson-0.2.0-cp39-abi3-macosx_10_12_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file polars_genson-0.2.0.tar.gz.

File metadata

  • Download URL: polars_genson-0.2.0.tar.gz
  • Upload date:
  • Size: 23.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for polars_genson-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ba726887d5eccafaa61929fa51e621c36caee23768b0284276a5c4a8463b8bde
MD5 94e04bad17d1fa2ff028fdff8fd31797
BLAKE2b-256 bd38b3526ac8ecf83941e1931f292d4c44755078e57f009ebf943e0afca48a7d

See more details on using hashes here.

File details

Details for the file polars_genson-0.2.0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_genson-0.2.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 008f458e79f98edced64189ffbc025723ed8573e1ce0bde936f1b7bb90177117
MD5 359a9e7ae413acead5fc4fe5bb183b80
BLAKE2b-256 cc85e241e873873db8a2266f5b9379aefb738ef9f6b21d77dd5fd8bd6e831f7f

See more details on using hashes here.

File details

Details for the file polars_genson-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_genson-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5d9c5ae09ecc4cb77cb2a57914006edcb773c20f1f8a7a412afcf152e1bebe8b
MD5 28fdc765e4a064f4b4b6c79c61a41670
BLAKE2b-256 bbf36a2f277bef7b0e9541a0766bdcafc0c469d47e967c736e155ca4df7897ba

See more details on using hashes here.

File details

Details for the file polars_genson-0.2.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_genson-0.2.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fb92cb439eaeb4625101414e7f5170cae6cbc27e622ed7ebe40a1a6ceea89b8b
MD5 4ace2ba1a05d30f5d2a0672b951af8c2
BLAKE2b-256 600c052cc5977f22f519dbc007e6f0dfd1e7bfb876773074274dfac99b3cb1a7

See more details on using hashes here.

File details

Details for the file polars_genson-0.2.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_genson-0.2.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 42224614934555b0aec4aa994c24db05026c9f2246deca95a216900a0a253357
MD5 6a40989d9079bfdf54260ca10763c69a
BLAKE2b-256 c0ef0dabdb0ea9e449595c3cf029c7fdcd4bc0e01246174a3eeb58603fb6c8b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page