Skip to main content

A Polars plugin for JSON schema inference using genson-rs.

Project description

Polars Genson

PyPI crates.io: genson-core crates.io: polars-jsonschema-bridge Supported Python versions pre-commit.ci status

A Polars plugin for working with JSON schemas. Infer schemas from JSON data and convert between JSON Schema and Polars schema formats.

Installation

pip install polars-genson[polars]

On older CPUs run:

pip install polars-genson[polars-lts-cpu]

Features

Schema Inference

  • JSON Schema Inference: Generate JSON schemas from JSON strings in Polars columns
  • Polars Schema Inference: Directly infer Polars data types and schemas from JSON data
  • Multiple JSON Objects: Handle columns with varying JSON schemas across rows
  • Complex Types: Support for nested objects, arrays, and mixed types
  • Flexible Input: Support for both single JSON objects and arrays of objects

Schema Conversion

  • Polars → JSON Schema: Convert existing DataFrame schemas to JSON Schema format
  • JSON Schema → Polars: Convert JSON schemas to equivalent Polars schemas
  • Round-trip Support: Full bidirectional conversion with validation
  • Schema Manipulation: Validate, transform, and standardize schemas

Usage

The plugin adds a genson namespace to Polars DataFrames for schema inference and conversion.

import polars as pl
import polars_genson
import json

# Create a DataFrame with JSON strings
df = pl.DataFrame({
    "json_data": [
        '{"name": "Alice", "age": 30, "scores": [95, 87]}',
        '{"name": "Bob", "age": 25, "city": "NYC", "active": true}',
        '{"name": "Charlie", "age": 35, "metadata": {"role": "admin"}}'
    ]
})

print("Input DataFrame:")
print(df)
shape: (3, 1)
┌─────────────────────────────────┐
 json_data                       
 ---                             
 str                             
╞═════════════════════════════════╡
 {"name": "Alice", "age": 30, "… │
 {"name": "Bob", "age": 25, "ci… │
 {"name": "Charlie", "age": 35, 
└─────────────────────────────────┘

JSON Schema Inference

# Infer JSON schema from the JSON column
schema = df.genson.infer_json_schema("json_data")

print("Inferred JSON schema:")
print(json.dumps(schema, indent=2))
{
  "$schema": "http://json-schema.org/schema#",
  "properties": {
    "name": {
      "type": "string"
    },
    "age": {
      "type": "integer"
    },
    "scores": {
      "items": {
        "type": "integer"
      },
      "type": "array"
    }
    "city": {
      "type": "string"
    },
    "active": {
      "type": "boolean"
    },
    "metadata": {
      "properties": {
        "role": {
          "type": "string"
        }
      },
      "required": [
        "role"
      ],
      "type": "object"
    },
  },
  "required": [
    "age",
    "name"
  ],
  "type": "object"
}

Polars Schema Inference

Directly infer Polars data types and schemas:

# Infer Polars schema from the JSON column
polars_schema = df.genson.infer_polars_schema("json_data")

print("Inferred Polars schema:")
print(polars_schema)
Schema({
    'name': String,
    'age': Int64,
    'scores': List(Int64),
    'city': String,
    'active': Boolean,
    'metadata': Struct({'role': String}),
})

The Polars schema inference automatically handles:

  • Complex nested structures with proper Struct types
  • Typed arrays like List(Int64), List(String)
  • Mixed data types (integers, floats, booleans, strings)
  • Optional fields present in some but not all objects
  • Deep nesting with multiple levels of structure

Map vs Record Inference Control

For objects with varying keys, you can control whether they're inferred as Maps (dynamic key-value pairs) or Records (fixed fields) using the map_threshold and map_max_required_keys parameters:

# Data with different key patterns
df = pl.DataFrame({
    "json_data": [
        '{"user": {"id": 1, "name": "Alice"}, "attributes": {"source": "web", "campaign": "summer"}}',
        '{"user": {"id": 2, "name": "Bob"}, "attributes": {"source": "mobile"}}'
    ]
})

# Default: both user and attributes become Records
schema_default = df.genson.infer_json_schema("json_data")

# Lower thresholds: distinguish structured Records from dynamic Maps
schema_controlled = df.genson.infer_json_schema("json_data", 
    map_threshold=2,           # Objects with ≥2 keys can be Maps
    map_max_required_keys=1    # Maps can have ≤1 required key
)

In the controlled example:

  • user has 2 required keys (id, name) > 1 → Record (structured)
  • attributes has 1 required key (source) ≤ 1 → Map (dynamic)

This gives you fine-grained control over how objects with different key stability patterns are classified.

Root Wrapping (wrap_root)

By default, inferred schemas treat each JSON object as the root.
Sometimes you may want to wrap the schema in an extra record layer — for example, to make Avro schemas compatible with systems that require a named top-level record.

You can control this behavior with the wrap_root option:

  • wrap_root="true" → Wraps using the column name as the record name
  • wrap_root="<string>" → Wraps using the given string as the record name
  • wrap_root=None (default) → No wrapping (root is just "document" for Avro)

Example: Avro schema with wrap_root

df = pl.DataFrame({
    "json_data": [
        '{"value": "A"}',
        '{"value": "B"}'
    ]
})

schema = df.genson.infer_json_schema("json_data", avro=True, wrap_root="payload")

print(json.dumps(schema, indent=2))
{
  "type": "record",
  "name": "document",
  "namespace": "genson",
  "fields": [
    {
      "name": "payload",
      "type": {
        "type": "record",
        "name": "payload",
        "namespace": "genson.document_types",
        "fields": [
          {
            "name": "value",
            "type": "string"
          }
        ]
      }
    }
  ]
}

This is especially useful when:

  • Exporting Avro to systems that require a named top-level record
  • Keeping schema names consistent with your column names or domain models

Normalisation

In addition to schema inference, polars-genson can normalise JSON columns so that every row conforms to a single, consistent Avro schema.

This is especially useful for semi-structured data where fields may be missing, empty arrays/maps may need to collapse to null, or numeric/boolean values may sometimes be encoded as strings.

Features

  • Converts empty arrays/maps to null (default)
  • Preserves empties with empty_as_null=False
  • Ensures missing fields are inserted with null
  • Supports per-field coercion of numeric/boolean strings via coerce_strings=True
  • Supports top-level schema evolution with wrap_root

Example: Map Encoding in Polars

By default, Polars cannot store a dynamic JSON object ({"en":"Hello","fr":"Bonjour"}) without exploding it into a struct with fixed fields padded with nulls.
polars-genson solves this by normalising maps to a list of key/value structs:

This representation is schema-stable and preserves all map keys without null-padding. It matches how Arrow/Parquet model Avro map types internally.

import polars as pl
import polars_genson

df = pl.DataFrame({
    "json_data": [
        '{"id": 123, "tags": [], "labels": {}, "active": true}',
        '{"id": 456, "tags": ["x","y"], "labels": {"fr":"Bonjour"}, "active": false}',
        '{"id": 789, "labels": {"en": "Hi", "es": "Hola"}}'
    ]
})

print(df.genson.normalise_json("json_data", map_threshold=0))

Output:

shape: (3, 4)
┌─────┬────────────┬──────────────────────────────┬────────┐
│ id  ┆ tags       ┆ labels                       ┆ active │
│ --- ┆ ---        ┆ ---                          ┆ ---    │
│ i64 ┆ list[str]  ┆ list[struct[2]]              ┆ bool   │
╞═════╪════════════╪══════════════════════════════╪════════╡
│ 123 ┆ null       ┆ null                         ┆ true   │
│ 456 ┆ ["x", "y"] ┆ [{"fr","Bonjour"}]           ┆ false  │
│ 789 ┆ null       ┆ [{"en","Hi"}, {"es","Hola"}] ┆ null   │
└─────┴────────────┴──────────────────────────────┴────────┘

In the example above, normalise_json reshaped jagged JSON into a consistent, schema-aligned form:

  • Row 1

    • tags was present but empty ([]) → normalised to null (this prevents row elimination when exploding the column)
    • labels was present but empty ({}) → normalised to null
    • active stayed true
  • Row 2

    • tags had two values (["x","y"]) → preserved as a list of strings
    • labels had one entry ({"fr":"Bonjour"}) → normalised to a list of one key:value struct
    • active stayed false
  • Row 3

    • tags was missing entirely → injected as null
    • labels had two entries ({"en":"Hi","es":"Hola"}) → normalised to a list of two key:value structs
    • active was missing → injected as null

Example: Empty Arrays

df = pl.DataFrame({"json_data": ['{"labels": []}', '{"labels": {"en": "Hello"}}']})

out = df.genson.normalise_json("json_data")
print(out)

Output:

shape: (2, 1)
┌─────────────────────────────┐
│ normalised                  │
│ ---                         │
│ str                         │
╞═════════════════════════════╡
│ {"labels": null}            │
│ {"labels": {"en": "Hello"}} │
└─────────────────────────────┘

Example: Preserving Empty Arrays

out = df.genson.normalise_json("json_data", empty_as_null=False)
print(out)

Output:

┌─────────────────────────────┐
│ normalised                  │
╞═════════════════════════════╡
│ {"labels": []}              │
│ {"labels": {"en": "Hello"}} │
└─────────────────────────────┘

Example: String Coercion

df = pl.DataFrame({
    "json_data": [
        '{"id": "42", "active": "true"}',
        '{"id": 7, "active": false}'
    ]
})

# Default: no coercion
print(df.genson.normalise_json("json_data").to_list())
# ['{"id": null, "active": null}', '{"id": 7, "active": false}']

# With coercion
print(df.genson.normalise_json("json_data", coerce_strings=True).to_list())
# ['{"id": 42, "active": true}', '{"id": 7, "active": false}']

Schema-Aware Decoding

The decode parameter can be either a boolean or a schema.

  • decode=True → Infer a schema automatically, then decode JSON into native Polars types.
  • decode=False → Leave values as normalised JSON strings.
  • decode=pl.Schema | pl.Struct → Use your own schema for decoding (skip re-inference).
import polars as pl
import polars_genson

df = pl.DataFrame({
    "json_data": [
        '{"id": 1, "active": true}',
        '{"id": 2, "active": false}'
    ]
})

# Explicit schema
schema = pl.Struct({
    "id": pl.Int64,
    "active": pl.Boolean,
})

# Use schema directly for decoding
decoded = df.genson.normalise_json("json_data", decode=schema)
print(decoded)

Output:

shape: (2, 2)
┌─────┬────────┐
│ id  ┆ active │
│ --- ┆ ---    │
│ i64 ┆ bool   │
╞═════╪════════╡
│ 1   ┆ true   │
│ 2   ┆ false  │
└─────┴────────┘

Note: Normalisation always aligns rows to a consistent schema internally. Passing your own schema skips the extra inference step, which can improve performance, but if your schema doesn’t match what’s in the data, you'll hit a decoding error (polars.exceptions.ComputeError from .str.json_decode). That may in fact be desirable to halt on though.

For the best of both worlds, you can run with decode=True once, capture the resulting .schema, and then reuse it in future calls.

Advanced Usage

Per-Row Schema Processing

  • Only available with JSON schema currently (per-row/unmerged Polars schemas TODO)
# Get individual schemas and process them
df = pl.DataFrame({
    "ABCs": [
        '{"a": 1, "b": 2}',
        '{"a": 1, "c": true}',
    ]
})

# Analyze schema variations
individual_schemas = df.genson.infer_json_schema("ABCs", merge_schemas=False)

The result is a list of one schema per row. With merge_schemas=True you would get all 3 keys (a, b, c) in a single schema.

[{'$schema': 'http://json-schema.org/schema#',
  'properties': {'a': {'type': 'integer'}, 'b': {'type': 'integer'}},
  'required': ['a', 'b'],
  'type': 'object'},
 {'$schema': 'http://json-schema.org/schema#',
  'properties': {'a': {'type': 'integer'}, 'c': {'type': 'boolean'}},
  'required': ['a', 'c'],
  'type': 'object'}]

JSON Schema Options

# Use the expression directly for more control
result = df.select(
    polars_genson.infer_json_schema(
        pl.col("json_data"),
        merge_schemas=False,  # Get individual schemas instead of merged
    ).alias("individual_schemas")
)

# Or use with different options
schema = df.genson.infer_json_schema(
    "json_data",
    ignore_outer_array=False,  # Treat top-level arrays as arrays
    ndjson=True,               # Handle newline-delimited JSON
    schema_uri="https://json-schema.org/draft/2020-12/schema",  # Specify a schema URI
    merge_schemas=True         # Merge all schemas (default)
)

Polars Schema Options

# Infer Polars schema with options
polars_schema = df.genson.infer_polars_schema(
    "json_data",
    ignore_outer_array=True,  # Treat top-level arrays as streams of objects
    ndjson=False,            # Not newline-delimited JSON
    debug=False              # Disable debug output
)

# Note: merge_schemas=False not yet supported for Polars schemas

Method Reference

The genson namespace provides three main methods:

infer_json_schema(column, **kwargs) -> dict | list[dict]

Infers a JSON Schema (or Avro, if requested) from a string column.

Parameters:

  • column: Name of the column containing JSON strings

  • ignore_outer_array: Treat top-level arrays as streams of objects (default: True)

  • ndjson: Treat input as newline-delimited JSON (default: False)

  • schema_uri: Schema URI to embed in the output (default: "http://json-schema.org/schema#"). Ignored by some consumers when avro=True.

  • merge_schemas: Merge schemas from all rows (default: True). If False, returns one schema per row as a list.

  • debug: Print debug information (default: False)

  • map_threshold: Detect maps when object has more than N keys (default: 20)

  • map_max_required_keys: Maximum required keys for Map inference (default: None). Objects with more required keys will be forced to Record type. If None, no gating based on required key count.

  • force_field_types: Dict of per-field overrides, values must be "map" or "record". Example: {"labels": "map", "claims": "record"}

  • avro: Output Avro schema instead of JSON Schema (default: False)

  • wrap_root: Control root wrapping.

    • True → wrap using the column name
    • str → wrap using the given name
    • None → no wrapping (default)

Returns:

  • dict when merge_schemas=True
  • list[dict] when merge_schemas=False

infer_polars_schema(column, **kwargs) -> pl.Schema

Infers a native Polars schema from a string column.

Parameters:

  • column: Name of the column containing JSON strings

  • ignore_outer_array: Treat top-level arrays as streams of objects (default: True)

  • ndjson: Treat input as newline-delimited JSON (default: False)

  • merge_schemas: Merge schemas from all rows (default: True). (Currently the only supported mode.)

  • debug: Print debug information (default: False)

  • map_threshold: Detect maps when object has more than N keys (default: 20)

  • map_max_required_keys: Maximum required keys for Map inference (default: None). Objects with more required keys will be forced to Record type. If None, no gating based on required key count.

  • force_field_types: Dict of per-field overrides, values must be "map" or "record"

  • avro: Infer using Avro semantics (unions, maps, nullability) instead of pure JSON Schema semantics (default: False)

  • wrap_root: Control root wrapping.

    • True → wrap using the column name
    • str → wrap using the given name
    • None → no wrapping (default)

Returns:

  • pl.Schema

Note: merge_schemas=False is not supported for Polars schema inference.

normalise_json(column, **kwargs) -> pl.DataFrame | pl.Series

Normalises each JSON string in the column against a single, inferred Avro schema. Ensures every row matches the same structure and datatypes.

Parameters:

  • column: Name of the column containing JSON strings

  • decode: If True, decode to native Polars types (default: True)

  • unnest: If decode=True, expand the decoded struct into separate columns (default: True)

  • ignore_outer_array: Treat top-level arrays as streams of objects (default: True)

  • ndjson: Treat input as newline-delimited JSON (default: False)

  • empty_as_null: Convert empty arrays/maps to null (default: True)

  • coerce_strings: Coerce numeric/boolean strings (e.g. "42", "true") into numbers/booleans where the schema expects them (default: False)

  • map_encoding: Encoding for Avro maps: "kv" (default), "mapping", or "entries"

  • map_threshold: Detect maps when object has more than N keys (default: 20)

  • map_max_required_keys: Maximum required keys for Map inference (default: None). Objects with more required keys will be forced to Record type. If None, no gating based on required key count.

  • force_field_types: Dict of per-field overrides ("map"/"record")

  • wrap_root: Control root wrapping.

    • True → wrap using the column name
    • str → wrap using the given name
    • None → no wrapping (default)

Returns:

  • If decode=True:

    • unnest=Truepl.DataFrame with one column per schema field
    • unnest=Falsepl.DataFrame with a single struct column
  • If decode=Falsepl.Series of normalised JSON strings

Example:

df = pl.DataFrame({"json_data": ['{"labels": []}', '{"labels": {"en": "Hello"}}']})
out = df.genson.normalise_json("json_data")
print(out.to_list())
# ['{"labels": null}', '{"labels": {"en": "Hello"}}']

Examples

Working with Complex JSON

# Complex nested JSON with arrays of objects
df = pl.DataFrame({
    "complex_json": [
        '{"user": {"profile": {"name": "Alice", "preferences": {"theme": "dark"}}}, "posts": [{"title": "Hello", "likes": 5}]}',
        '{"user": {"profile": {"name": "Bob", "preferences": {"theme": "light"}}}, "posts": [{"title": "World", "likes": 3}, {"title": "Test", "likes": 1}]}'
    ]
})

schema = df.genson.infer_polars_schema("complex_json")
print(schema)
Schema({
    'user': Struct({
        'profile': Struct({
            'name': String, 
            'preferences': Struct({'theme': String})
        })
    }),
    'posts': List(Struct({'likes': Int64, 'title': String})),
})

Using Inferred Schema

# You can use the inferred schema for validation or DataFrame operations
inferred_schema = df.genson.infer_polars_schema("json_data")

# Use with other Polars operations
print(f"Schema has {len(inferred_schema)} fields:")
for name, dtype in inferred_schema.items():
    print(f"  {name}: {dtype}")

Contributing

This crate is part of the polars-genson project. See the main repository for the contribution and development docs.

License

MIT License

  • Contains vendored and slightly adapted copy of the Apache 2.0 licensed fork of genson-rs crate

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_genson-0.2.1.tar.gz (23.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_genson-0.2.1-cp39-abi3-win_amd64.whl (6.3 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_genson-0.2.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_genson-0.2.1-cp39-abi3-macosx_11_0_arm64.whl (5.1 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_genson-0.2.1-cp39-abi3-macosx_10_12_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file polars_genson-0.2.1.tar.gz.

File metadata

  • Download URL: polars_genson-0.2.1.tar.gz
  • Upload date:
  • Size: 23.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for polars_genson-0.2.1.tar.gz
Algorithm Hash digest
SHA256 a5f0095a1463e886465a001bd6279e3271089d61839e4d8410ea9470e876b78d
MD5 55a785cda52645975e33a55b3e8a2618
BLAKE2b-256 253e13bcc20d98a2cf03886dcfe3b138a8a8134d9c483f3333061bed46326d3d

See more details on using hashes here.

File details

Details for the file polars_genson-0.2.1-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_genson-0.2.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 62a34bf41df1e32a14b42e050923086e698345a90b40c493299eb84e2f8f9f10
MD5 e3f9facb43ee3efd0c25112ddaf8d422
BLAKE2b-256 abc4c5db8e3bc9e4d406670f2fc39bd7021dfa0cc1a63c41e0f528fdaade4a2e

See more details on using hashes here.

File details

Details for the file polars_genson-0.2.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_genson-0.2.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c594cb7e0179286ed09ff2d408c87ef644f9e448786b4aa23fae00b6b233bdc7
MD5 f92f1eccd6a3efacbebef58c2e57c1c8
BLAKE2b-256 d4ac136935ec25fc61f8b6177496e21d8b5ad79e4772432797317a865bfe8ddb

See more details on using hashes here.

File details

Details for the file polars_genson-0.2.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_genson-0.2.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6d5aa650542cbcd5c8d0aaadc652f8543d54c76cc2e2df7996c45d8ac76e1a62
MD5 1015b92402dadd27fb3e404811abea1d
BLAKE2b-256 54dedea9301282a4512c95bcea0ab82776bcc6689a05e67515ca91d11ae62ca4

See more details on using hashes here.

File details

Details for the file polars_genson-0.2.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_genson-0.2.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1afa7be62daf94f98832e5015aa46d7264de2adee6670fd76f2481168347750d
MD5 74a295bcd491577db74e9ba9048565a3
BLAKE2b-256 6a40302ea0ad15c0ee939470c36796577190f989b0da5a20abfd877225cff764

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page