Skip to main content

A Polars plugin for JSON schema inference using genson-rs.

Project description

Polars Genson

PyPI crates.io: genson-core crates.io: polars-jsonschema-bridge Supported Python versions pre-commit.ci status

A Polars plugin for working with JSON schemas. Infer schemas from JSON data and convert between JSON Schema and Polars schema formats.

Installation

pip install polars-genson[polars]

On older CPUs run:

pip install polars-genson[polars-lts-cpu]

Features

Schema Inference

  • JSON Schema Inference: Generate JSON schemas from JSON strings in Polars columns
  • Polars Schema Inference: Directly infer Polars data types and schemas from JSON data
  • Multiple JSON Objects: Handle columns with varying JSON schemas across rows
  • Complex Types: Support for nested objects, arrays, and mixed types
  • Flexible Input: Support for both single JSON objects and arrays of objects

Schema Conversion

  • Polars → JSON Schema: Convert existing DataFrame schemas to JSON Schema format
  • JSON Schema → Polars: Convert JSON schemas to equivalent Polars schemas
  • Round-trip Support: Full bidirectional conversion with validation
  • Schema Manipulation: Validate, transform, and standardize schemas

Usage

The plugin adds a genson namespace to Polars DataFrames for schema inference and conversion.

import polars as pl
import polars_genson
import json

# Create a DataFrame with JSON strings
df = pl.DataFrame({
    "json_data": [
        '{"name": "Alice", "age": 30, "scores": [95, 87]}',
        '{"name": "Bob", "age": 25, "city": "NYC", "active": true}',
        '{"name": "Charlie", "age": 35, "metadata": {"role": "admin"}}'
    ]
})

print("Input DataFrame:")
print(df)
shape: (3, 1)
┌─────────────────────────────────┐
 json_data                       
 ---                             
 str                             
╞═════════════════════════════════╡
 {"name": "Alice", "age": 30, "… │
 {"name": "Bob", "age": 25, "ci… │
 {"name": "Charlie", "age": 35, 
└─────────────────────────────────┘

JSON Schema Inference

# Infer JSON schema from the JSON column
schema = df.genson.infer_json_schema("json_data")

print("Inferred JSON schema:")
print(json.dumps(schema, indent=2))
{
  "$schema": "http://json-schema.org/schema#",
  "properties": {
    "name": {
      "type": "string"
    },
    "age": {
      "type": "integer"
    },
    "scores": {
      "items": {
        "type": "integer"
      },
      "type": "array"
    }
    "city": {
      "type": "string"
    },
    "active": {
      "type": "boolean"
    },
    "metadata": {
      "properties": {
        "role": {
          "type": "string"
        }
      },
      "required": [
        "role"
      ],
      "type": "object"
    },
  },
  "required": [
    "age",
    "name"
  ],
  "type": "object"
}

Polars Schema Inference

Directly infer Polars data types and schemas:

# Infer Polars schema from the JSON column
polars_schema = df.genson.infer_polars_schema("json_data")

print("Inferred Polars schema:")
print(polars_schema)
Schema({
    'name': String,
    'age': Int64,
    'scores': List(Int64),
    'city': String,
    'active': Boolean,
    'metadata': Struct({'role': String}),
})

The Polars schema inference automatically handles:

  • Complex nested structures with proper Struct types
  • Typed arrays like List(Int64), List(String)
  • Mixed data types (integers, floats, booleans, strings)
  • Optional fields present in some but not all objects
  • Deep nesting with multiple levels of structure

Root Wrapping (wrap_root)

By default, inferred schemas treat each JSON object as the root.
Sometimes you may want to wrap the schema in an extra record layer — for example, to make Avro schemas compatible with systems that require a named top-level record.

You can control this behavior with the wrap_root option:

  • wrap_root="true" → Wraps using the column name as the record name
  • wrap_root="<string>" → Wraps using the given string as the record name
  • wrap_root=None (default) → No wrapping (root is just "document" for Avro)

Example: Avro schema with wrap_root

df = pl.DataFrame({
    "json_data": [
        '{"value": "A"}',
        '{"value": "B"}'
    ]
})

schema = df.genson.infer_json_schema("json_data", avro=True, wrap_root="payload")

print(json.dumps(schema, indent=2))
{
  "type": "record",
  "name": "document",
  "namespace": "genson",
  "fields": [
    {
      "name": "payload",
      "type": {
        "type": "record",
        "name": "payload",
        "namespace": "genson.document_types",
        "fields": [
          {
            "name": "value",
            "type": "string"
          }
        ]
      }
    }
  ]
}

This is especially useful when:

  • Exporting Avro to systems that require a named top-level record
  • Keeping schema names consistent with your column names or domain models

Normalisation

In addition to schema inference, polars-genson can normalise JSON columns so that every row conforms to a single, consistent Avro schema.

This is especially useful for semi-structured data where fields may be missing, empty arrays/maps may need to collapse to null, or numeric/boolean values may sometimes be encoded as strings.

Features

  • Converts empty arrays/maps to null (default)
  • Preserves empties with empty_as_null=False
  • Ensures missing fields are inserted with null
  • Supports per-field coercion of numeric/boolean strings via coerce_strings=True
  • Supports top-level schema evolution with wrap_root

Example: Map Encoding in Polars

By default, Polars cannot store a dynamic JSON object ({"en":"Hello","fr":"Bonjour"}) without exploding it into a struct with fixed fields padded with nulls.
polars-genson solves this by normalising maps to a list of key/value structs:

This representation is schema-stable and preserves all map keys without null-padding. It matches how Arrow/Parquet model Avro map types internally.

import polars as pl
import polars_genson

df = pl.DataFrame({
    "json_data": [
        '{"id": 123, "tags": [], "labels": {}, "active": true}',
        '{"id": 456, "tags": ["x","y"], "labels": {"fr":"Bonjour"}, "active": false}',
        '{"id": 789, "labels": {"en": "Hi", "es": "Hola"}}'
    ]
})

print(df.genson.normalise_json("json_data", map_threshold=0))

Output:

shape: (3, 4)
┌─────┬────────────┬──────────────────────────────┬────────┐
│ id  ┆ tags       ┆ labels                       ┆ active │
│ --- ┆ ---        ┆ ---                          ┆ ---    │
│ i64 ┆ list[str]  ┆ list[struct[2]]              ┆ bool   │
╞═════╪════════════╪══════════════════════════════╪════════╡
│ 123 ┆ null       ┆ null                         ┆ true   │
│ 456 ┆ ["x", "y"] ┆ [{"fr","Bonjour"}]           ┆ false  │
│ 789 ┆ null       ┆ [{"en","Hi"}, {"es","Hola"}] ┆ null   │
└─────┴────────────┴──────────────────────────────┴────────┘

In the example above, normalise_json reshaped jagged JSON into a consistent, schema-aligned form:

  • Row 1

    • tags was present but empty ([]) → normalised to null (this prevents row elimination when exploding the column)
    • labels was present but empty ({}) → normalised to null
    • active stayed true
  • Row 2

    • tags had two values (["x","y"]) → preserved as a list of strings
    • labels had one entry ({"fr":"Bonjour"}) → normalised to a list of one key:value struct
    • active stayed false
  • Row 3

    • tags was missing entirely → injected as null
    • labels had two entries ({"en":"Hi","es":"Hola"}) → normalised to a list of two key:value structs
    • active was missing → injected as null

Example: Empty Arrays

df = pl.DataFrame({"json_data": ['{"labels": []}', '{"labels": {"en": "Hello"}}']})

out = df.genson.normalise_json("json_data")
print(out)

Output:

shape: (2, 1)
┌─────────────────────────────┐
│ normalised                  │
│ ---                         │
│ str                         │
╞═════════════════════════════╡
│ {"labels": null}            │
│ {"labels": {"en": "Hello"}} │
└─────────────────────────────┘

Example: Preserving Empty Arrays

out = df.genson.normalise_json("json_data", empty_as_null=False)
print(out)

Output:

┌─────────────────────────────┐
│ normalised                  │
╞═════════════════════════════╡
│ {"labels": []}              │
│ {"labels": {"en": "Hello"}} │
└─────────────────────────────┘

Example: String Coercion

df = pl.DataFrame({
    "json_data": [
        '{"id": "42", "active": "true"}',
        '{"id": 7, "active": false}'
    ]
})

# Default: no coercion
print(df.genson.normalise_json("json_data").to_list())
# ['{"id": null, "active": null}', '{"id": 7, "active": false}']

# With coercion
print(df.genson.normalise_json("json_data", coerce_strings=True).to_list())
# ['{"id": 42, "active": true}', '{"id": 7, "active": false}']

Advanced Usage

Per-Row Schema Processing

  • Only available with JSON schema currently (per-row/unmerged Polars schemas TODO)
# Get individual schemas and process them
df = pl.DataFrame({
    "ABCs": [
        '{"a": 1, "b": 2}',
        '{"a": 1, "c": true}',
    ]
})

# Analyze schema variations
individual_schemas = df.genson.infer_json_schema("ABCs", merge_schemas=False)

The result is a list of one schema per row. With merge_schemas=True you would get all 3 keys (a, b, c) in a single schema.

[{'$schema': 'http://json-schema.org/schema#',
  'properties': {'a': {'type': 'integer'}, 'b': {'type': 'integer'}},
  'required': ['a', 'b'],
  'type': 'object'},
 {'$schema': 'http://json-schema.org/schema#',
  'properties': {'a': {'type': 'integer'}, 'c': {'type': 'boolean'}},
  'required': ['a', 'c'],
  'type': 'object'}]

JSON Schema Options

# Use the expression directly for more control
result = df.select(
    polars_genson.infer_json_schema(
        pl.col("json_data"),
        merge_schemas=False,  # Get individual schemas instead of merged
    ).alias("individual_schemas")
)

# Or use with different options
schema = df.genson.infer_json_schema(
    "json_data",
    ignore_outer_array=False,  # Treat top-level arrays as arrays
    ndjson=True,               # Handle newline-delimited JSON
    schema_uri="https://json-schema.org/draft/2020-12/schema",  # Specify a schema URI
    merge_schemas=True         # Merge all schemas (default)
)

Polars Schema Options

# Infer Polars schema with options
polars_schema = df.genson.infer_polars_schema(
    "json_data",
    ignore_outer_array=True,  # Treat top-level arrays as streams of objects
    ndjson=False,            # Not newline-delimited JSON
    debug=False              # Disable debug output
)

# Note: merge_schemas=False not yet supported for Polars schemas

Method Reference

The genson namespace provides three main methods:

infer_json_schema(column, **kwargs) -> dict | list[dict]

Infers a JSON Schema (or Avro, if requested) from a string column.

Parameters:

  • column: Name of the column containing JSON strings

  • ignore_outer_array: Treat top-level arrays as streams of objects (default: True)

  • ndjson: Treat input as newline-delimited JSON (default: False)

  • schema_uri: Schema URI to embed in the output (default: "http://json-schema.org/schema#"). Ignored by some consumers when avro=True.

  • merge_schemas: Merge schemas from all rows (default: True). If False, returns one schema per row as a list.

  • debug: Print debug information (default: False)

  • map_threshold: Detect maps when object has more than N keys (default: 20)

  • force_field_types: Dict of per-field overrides, values must be "map" or "record". Example: {"labels": "map", "claims": "record"}

  • avro: Output Avro schema instead of JSON Schema (default: False)

  • wrap_root: Control root wrapping.

    • True → wrap using the column name
    • str → wrap using the given name
    • None → no wrapping (default)

Returns:

  • dict when merge_schemas=True
  • list[dict] when merge_schemas=False

infer_polars_schema(column, **kwargs) -> pl.Schema

Infers a native Polars schema from a string column.

Parameters:

  • column: Name of the column containing JSON strings

  • ignore_outer_array: Treat top-level arrays as streams of objects (default: True)

  • ndjson: Treat input as newline-delimited JSON (default: False)

  • merge_schemas: Merge schemas from all rows (default: True). (Currently the only supported mode.)

  • debug: Print debug information (default: False)

  • map_threshold: Detect maps when object has more than N keys (default: 20)

  • force_field_types: Dict of per-field overrides, values must be "map" or "record"

  • avro: Infer using Avro semantics (unions, maps, nullability) instead of pure JSON Schema semantics (default: False)

  • wrap_root: Control root wrapping.

    • True → wrap using the column name
    • str → wrap using the given name
    • None → no wrapping (default)

Returns:

  • pl.Schema

Note: merge_schemas=False is not supported for Polars schema inference.

normalise_json(column, **kwargs) -> pl.DataFrame | pl.Series

Normalises each JSON string in the column against a single, inferred Avro schema. Ensures every row matches the same structure and datatypes.

Parameters:

  • column: Name of the column containing JSON strings

  • decode: If True, decode to native Polars types (default: True)

  • unnest: If decode=True, expand the decoded struct into separate columns (default: True)

  • ignore_outer_array: Treat top-level arrays as streams of objects (default: True)

  • ndjson: Treat input as newline-delimited JSON (default: False)

  • empty_as_null: Convert empty arrays/maps to null (default: True)

  • coerce_strings: Coerce numeric/boolean strings (e.g. "42", "true") into numbers/booleans where the schema expects them (default: False)

  • map_encoding: Encoding for Avro maps: "kv" (default), "mapping", or "entries"

  • map_threshold: Detect maps when object has more than N keys (default: 20)

  • force_field_types: Dict of per-field overrides ("map"/"record")

  • wrap_root: Control root wrapping.

    • True → wrap using the column name
    • str → wrap using the given name
    • None → no wrapping (default)

Returns:

  • If decode=True:

    • unnest=Truepl.DataFrame with one column per schema field
    • unnest=Falsepl.DataFrame with a single struct column
  • If decode=Falsepl.Series of normalised JSON strings

Example:

df = pl.DataFrame({"json_data": ['{"labels": []}', '{"labels": {"en": "Hello"}}']})
out = df.genson.normalise_json("json_data")
print(out.to_list())
# ['{"labels": null}', '{"labels": {"en": "Hello"}}']

Examples

Working with Complex JSON

# Complex nested JSON with arrays of objects
df = pl.DataFrame({
    "complex_json": [
        '{"user": {"profile": {"name": "Alice", "preferences": {"theme": "dark"}}}, "posts": [{"title": "Hello", "likes": 5}]}',
        '{"user": {"profile": {"name": "Bob", "preferences": {"theme": "light"}}}, "posts": [{"title": "World", "likes": 3}, {"title": "Test", "likes": 1}]}'
    ]
})

schema = df.genson.infer_polars_schema("complex_json")
print(schema)
Schema({
    'user': Struct({
        'profile': Struct({
            'name': String, 
            'preferences': Struct({'theme': String})
        })
    }),
    'posts': List(Struct({'likes': Int64, 'title': String})),
})

Using Inferred Schema

# You can use the inferred schema for validation or DataFrame operations
inferred_schema = df.genson.infer_polars_schema("json_data")

# Use with other Polars operations
print(f"Schema has {len(inferred_schema)} fields:")
for name, dtype in inferred_schema.items():
    print(f"  {name}: {dtype}")

Contributing

This crate is part of the polars-genson project. See the main repository for the contribution and development docs.

License

MIT License

  • Contains vendored and slightly adapted copy of the Apache 2.0 licensed fork of genson-rs crate

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_genson-0.1.8.tar.gz (23.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_genson-0.1.8-cp39-abi3-win_amd64.whl (6.3 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_genson-0.1.8-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_genson-0.1.8-cp39-abi3-macosx_11_0_arm64.whl (5.1 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_genson-0.1.8-cp39-abi3-macosx_10_12_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file polars_genson-0.1.8.tar.gz.

File metadata

  • Download URL: polars_genson-0.1.8.tar.gz
  • Upload date:
  • Size: 23.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for polars_genson-0.1.8.tar.gz
Algorithm Hash digest
SHA256 a92bfd421c1c8524d95c33ce2abc6e3a88d8c81a75336d06fc36a6779a1cf456
MD5 80968ce3fd128f3d59d1d47a1e2864c2
BLAKE2b-256 03c8f15f1470bfc775db4719152b0c03092f66c7039dec2a3e31d2ebc6b33aff

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.8-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.8-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 e48ce551c90749fa1dbd729788e8bafe037bec81fe67265f0674efa3061ff8c3
MD5 d947cd49a438c15d484b6e4c3eb2b981
BLAKE2b-256 1d5023c2ed9d36b91843d6b25e8bce1a3cadb22e9179a5211c12ed4d4851a42b

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.8-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.8-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 47a4606509c7d74701823e171cfe3ca3353147913ad5afd9c6f2f35a6a7c5738
MD5 a93128b820e0681688c678c0ca460a93
BLAKE2b-256 5f578bc50319811e520517ab47ed6c39fbd4887b8c29c4f1969d5f0743858d5d

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.8-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.8-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c7d00dcac9a15cfbfd736f7805dd169515163fb749627dab1e43e458c659a02b
MD5 bab0e1fedda1e870a5eb774a3488de66
BLAKE2b-256 4194d744c68999d62e0a59bbec733a956e6c6fb24d30738f2e50494e113097c2

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.8-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_genson-0.1.8-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e4b802b5d7104f4e7b902c9d77613e2076cdaf92790b43a88b04a205af3d9c11
MD5 d4e28dabb2c71175a8127a415f615a0e
BLAKE2b-256 28f735950be01b29c55ca7e47ea4f372856e2b4257bf6c47dc5709dfdaa5011c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page