A Polars plugin for JSON schema inference using genson-rs.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lmmx

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Project description

Polars Genson

A Polars plugin for working with JSON schemas. Infer schemas from JSON data and convert between JSON Schema and Polars schema formats.

Installation

pip install polars-genson[polars]

On older CPUs run:

pip install polars-genson[polars-lts-cpu]

Features

Schema Inference

JSON Schema Inference: Generate JSON schemas from JSON strings in Polars columns
Polars Schema Inference: Directly infer Polars data types and schemas from JSON data
Multiple JSON Objects: Handle columns with varying JSON schemas across rows
Complex Types: Support for nested objects, arrays, and mixed types
Flexible Input: Support for both single JSON objects and arrays of objects

Schema Conversion

Polars → JSON Schema: Convert existing DataFrame schemas to JSON Schema format
JSON Schema → Polars: Convert JSON schemas to equivalent Polars schemas
Round-trip Support: Full bidirectional conversion with validation
Schema Manipulation: Validate, transform, and standardize schemas

Usage

The plugin adds a genson namespace to Polars DataFrames for schema inference and conversion.

import polars as pl
import polars_genson
import json

# Create a DataFrame with JSON strings
df = pl.DataFrame({
    "json_data": [
        '{"name": "Alice", "age": 30, "scores": [95, 87]}',
        '{"name": "Bob", "age": 25, "city": "NYC", "active": true}',
        '{"name": "Charlie", "age": 35, "metadata": {"role": "admin"}}'
    ]
})

print("Input DataFrame:")
print(df)

shape: (3, 1)
┌─────────────────────────────────┐
│ json_data                       │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ {"name": "Alice", "age": 30, "… │
│ {"name": "Bob", "age": 25, "ci… │
│ {"name": "Charlie", "age": 35,… │
└─────────────────────────────────┘

JSON Schema Inference

# Infer JSON schema from the JSON column
schema = df.genson.infer_json_schema("json_data")

print("Inferred JSON schema:")
print(json.dumps(schema, indent=2))

{
  "$schema": "http://json-schema.org/schema#",
  "properties": {
    "name": {
      "type": "string"
    },
    "age": {
      "type": "integer"
    },
    "scores": {
      "items": {
        "type": "integer"
      },
      "type": "array"
    }
    "city": {
      "type": "string"
    },
    "active": {
      "type": "boolean"
    },
    "metadata": {
      "properties": {
        "role": {
          "type": "string"
        }
      },
      "required": [
        "role"
      ],
      "type": "object"
    },
  },
  "required": [
    "age",
    "name"
  ],
  "type": "object"
}

Polars Schema Inference

Directly infer Polars data types and schemas:

# Infer Polars schema from the JSON column
polars_schema = df.genson.infer_polars_schema("json_data")

print("Inferred Polars schema:")
print(polars_schema)

Schema({
    'name': String,
    'age': Int64,
    'scores': List(Int64),
    'city': String,
    'active': Boolean,
    'metadata': Struct({'role': String}),
})

The Polars schema inference automatically handles:

✅ Complex nested structures with proper Struct types
✅ Typed arrays like List(Int64), List(String)
✅ Mixed data types (integers, floats, booleans, strings)
✅ Optional fields present in some but not all objects
✅ Deep nesting with multiple levels of structure

Root Wrapping (`wrap_root`)

By default, inferred schemas treat each JSON object as the root.
Sometimes you may want to wrap the schema in an extra record layer — for example, to make Avro schemas compatible with systems that require a named top-level record.

You can control this behavior with the wrap_root option:

wrap_root="true" → Wraps using the column name as the record name
wrap_root="<string>" → Wraps using the given string as the record name
wrap_root=None (default) → No wrapping (root is just "document" for Avro)

Example: Avro schema with wrap_root

df = pl.DataFrame({
    "json_data": [
        '{"value": "A"}',
        '{"value": "B"}'
    ]
})

schema = df.genson.infer_json_schema("json_data", avro=True, wrap_root="payload")

print(json.dumps(schema, indent=2))

{
  "type": "record",
  "name": "document",
  "namespace": "genson",
  "fields": [
    {
      "name": "payload",
      "type": {
        "type": "record",
        "name": "payload",
        "namespace": "genson.document_types",
        "fields": [
          {
            "name": "value",
            "type": "string"
          }
        ]
      }
    }
  ]
}

This is especially useful when:

Exporting Avro to systems that require a named top-level record
Keeping schema names consistent with your column names or domain models

Normalisation

In addition to schema inference, polars-genson can normalise JSON columns so that every row conforms to a single, consistent Avro schema.

This is especially useful for semi-structured data where fields may be missing, empty arrays/maps may need to collapse to null, or numeric/boolean values may sometimes be encoded as strings.

Features

Converts empty arrays/maps to null (default)
Preserves empties with empty_as_null=False
Ensures missing fields are inserted with null
Supports per-field coercion of numeric/boolean strings via coerce_strings=True
Supports top-level schema evolution with wrap_root

Example: Map Encoding in Polars

By default, Polars cannot store a dynamic JSON object ({"en":"Hello","fr":"Bonjour"}) without exploding it into a struct with fixed fields padded with nulls.
polars-genson solves this by normalising maps to a list of key/value structs:

This representation is schema-stable and preserves all map keys without null-padding. It matches how Arrow/Parquet model Avro map types internally.

import polars as pl
import polars_genson

df = pl.DataFrame({
    "json_data": [
        '{"id": 123, "tags": [], "labels": {}, "active": true}',
        '{"id": 456, "tags": ["x","y"], "labels": {"fr":"Bonjour"}, "active": false}',
        '{"id": 789, "labels": {"en": "Hi", "es": "Hola"}}'
    ]
})

print(df.genson.normalise_json("json_data", map_threshold=0))

Output:

shape: (3, 4)
┌─────┬────────────┬──────────────────────────────┬────────┐
│ id  ┆ tags       ┆ labels                       ┆ active │
│ --- ┆ ---        ┆ ---                          ┆ ---    │
│ i64 ┆ list[str]  ┆ list[struct[2]]              ┆ bool   │
╞═════╪════════════╪══════════════════════════════╪════════╡
│ 123 ┆ null       ┆ null                         ┆ true   │
│ 456 ┆ ["x", "y"] ┆ [{"fr","Bonjour"}]           ┆ false  │
│ 789 ┆ null       ┆ [{"en","Hi"}, {"es","Hola"}] ┆ null   │
└─────┴────────────┴──────────────────────────────┴────────┘

In the example above, normalise_json reshaped jagged JSON into a consistent, schema-aligned form:

Row 1
- tags was present but empty ([]) → normalised to null (this prevents row elimination when exploding the column)
- labels was present but empty ({}) → normalised to null
- active stayed true
Row 2
- tags had two values (["x","y"]) → preserved as a list of strings
- labels had one entry ({"fr":"Bonjour"}) → normalised to a list of one key:value struct
- active stayed false
Row 3
- tags was missing entirely → injected as null
- labels had two entries ({"en":"Hi","es":"Hola"}) → normalised to a list of two key:value structs
- active was missing → injected as null

Example: Empty Arrays

df = pl.DataFrame({"json_data": ['{"labels": []}', '{"labels": {"en": "Hello"}}']})

out = df.genson.normalise_json("json_data")
print(out)

Output:

shape: (2, 1)
┌─────────────────────────────┐
│ normalised                  │
│ ---                         │
│ str                         │
╞═════════════════════════════╡
│ {"labels": null}            │
│ {"labels": {"en": "Hello"}} │
└─────────────────────────────┘

Example: Preserving Empty Arrays

out = df.genson.normalise_json("json_data", empty_as_null=False)
print(out)

Output:

┌─────────────────────────────┐
│ normalised                  │
╞═════════════════════════════╡
│ {"labels": []}              │
│ {"labels": {"en": "Hello"}} │
└─────────────────────────────┘

Example: String Coercion

df = pl.DataFrame({
    "json_data": [
        '{"id": "42", "active": "true"}',
        '{"id": 7, "active": false}'
    ]
})

# Default: no coercion
print(df.genson.normalise_json("json_data").to_list())
# ['{"id": null, "active": null}', '{"id": 7, "active": false}']

# With coercion
print(df.genson.normalise_json("json_data", coerce_strings=True).to_list())
# ['{"id": 42, "active": true}', '{"id": 7, "active": false}']

Advanced Usage

Per-Row Schema Processing

Only available with JSON schema currently (per-row/unmerged Polars schemas TODO)

# Get individual schemas and process them
df = pl.DataFrame({
    "ABCs": [
        '{"a": 1, "b": 2}',
        '{"a": 1, "c": true}',
    ]
})

# Analyze schema variations
individual_schemas = df.genson.infer_json_schema("ABCs", merge_schemas=False)

The result is a list of one schema per row. With merge_schemas=True you would get all 3 keys (a, b, c) in a single schema.

[{'$schema': 'http://json-schema.org/schema#',
  'properties': {'a': {'type': 'integer'}, 'b': {'type': 'integer'}},
  'required': ['a', 'b'],
  'type': 'object'},
 {'$schema': 'http://json-schema.org/schema#',
  'properties': {'a': {'type': 'integer'}, 'c': {'type': 'boolean'}},
  'required': ['a', 'c'],
  'type': 'object'}]

JSON Schema Options

# Use the expression directly for more control
result = df.select(
    polars_genson.infer_json_schema(
        pl.col("json_data"),
        merge_schemas=False,  # Get individual schemas instead of merged
    ).alias("individual_schemas")
)

# Or use with different options
schema = df.genson.infer_json_schema(
    "json_data",
    ignore_outer_array=False,  # Treat top-level arrays as arrays
    ndjson=True,               # Handle newline-delimited JSON
    schema_uri="https://json-schema.org/draft/2020-12/schema",  # Specify a schema URI
    merge_schemas=True         # Merge all schemas (default)
)

Polars Schema Options

# Infer Polars schema with options
polars_schema = df.genson.infer_polars_schema(
    "json_data",
    ignore_outer_array=True,  # Treat top-level arrays as streams of objects
    ndjson=False,            # Not newline-delimited JSON
    debug=False              # Disable debug output
)

# Note: merge_schemas=False not yet supported for Polars schemas

Method Reference

The genson namespace provides three main methods:

`infer_json_schema(column, **kwargs) -> dict | list[dict]`

Infers a JSON Schema (or Avro, if requested) from a string column.

Parameters:

column: Name of the column containing JSON strings
ignore_outer_array: Treat top-level arrays as streams of objects (default: True)
ndjson: Treat input as newline-delimited JSON (default: False)
schema_uri: Schema URI to embed in the output (default: "http://json-schema.org/schema#"). Ignored by some consumers when avro=True.
merge_schemas: Merge schemas from all rows (default: True). If False, returns one schema per row as a list.
debug: Print debug information (default: False)
map_threshold: Detect maps when object has more than N keys (default: 20)
force_field_types: Dict of per-field overrides, values must be "map" or "record". Example: {"labels": "map", "claims": "record"}
avro: Output Avro schema instead of JSON Schema (default: False)
wrap_root: Control root wrapping.
- True → wrap using the column name
- str → wrap using the given name
- None → no wrapping (default)

Returns:

dict when merge_schemas=True
list[dict] when merge_schemas=False

`infer_polars_schema(column, **kwargs) -> pl.Schema`

Infers a native Polars schema from a string column.

Parameters:

column: Name of the column containing JSON strings
ignore_outer_array: Treat top-level arrays as streams of objects (default: True)
ndjson: Treat input as newline-delimited JSON (default: False)
merge_schemas: Merge schemas from all rows (default: True). (Currently the only supported mode.)
debug: Print debug information (default: False)
map_threshold: Detect maps when object has more than N keys (default: 20)
force_field_types: Dict of per-field overrides, values must be "map" or "record"
avro: Infer using Avro semantics (unions, maps, nullability) instead of pure JSON Schema semantics (default: False)
wrap_root: Control root wrapping.
- True → wrap using the column name
- str → wrap using the given name
- None → no wrapping (default)

Returns:

pl.Schema

Note: merge_schemas=False is not supported for Polars schema inference.

`normalise_json(column, **kwargs) -> pl.DataFrame | pl.Series`

Normalises each JSON string in the column against a single, inferred Avro schema. Ensures every row matches the same structure and datatypes.

Parameters:

column: Name of the column containing JSON strings
decode: If True, decode to native Polars types (default: True)
unnest: If decode=True, expand the decoded struct into separate columns (default: True)
ignore_outer_array: Treat top-level arrays as streams of objects (default: True)
ndjson: Treat input as newline-delimited JSON (default: False)
empty_as_null: Convert empty arrays/maps to null (default: True)
coerce_strings: Coerce numeric/boolean strings (e.g. "42", "true") into numbers/booleans where the schema expects them (default: False)
map_encoding: Encoding for Avro maps: "kv" (default), "mapping", or "entries"
map_threshold: Detect maps when object has more than N keys (default: 20)
force_field_types: Dict of per-field overrides ("map"/"record")
wrap_root: Control root wrapping.
- True → wrap using the column name
- str → wrap using the given name
- None → no wrapping (default)

Returns:

If decode=True:
- unnest=True → pl.DataFrame with one column per schema field
- unnest=False → pl.DataFrame with a single struct column
If decode=False → pl.Series of normalised JSON strings

Example:

df = pl.DataFrame({"json_data": ['{"labels": []}', '{"labels": {"en": "Hello"}}']})
out = df.genson.normalise_json("json_data")
print(out.to_list())
# ['{"labels": null}', '{"labels": {"en": "Hello"}}']

Examples

Working with Complex JSON

# Complex nested JSON with arrays of objects
df = pl.DataFrame({
    "complex_json": [
        '{"user": {"profile": {"name": "Alice", "preferences": {"theme": "dark"}}}, "posts": [{"title": "Hello", "likes": 5}]}',
        '{"user": {"profile": {"name": "Bob", "preferences": {"theme": "light"}}}, "posts": [{"title": "World", "likes": 3}, {"title": "Test", "likes": 1}]}'
    ]
})

schema = df.genson.infer_polars_schema("complex_json")
print(schema)

Schema({
    'user': Struct({
        'profile': Struct({
            'name': String, 
            'preferences': Struct({'theme': String})
        })
    }),
    'posts': List(Struct({'likes': Int64, 'title': String})),
})

Using Inferred Schema

# You can use the inferred schema for validation or DataFrame operations
inferred_schema = df.genson.infer_polars_schema("json_data")

# Use with other Polars operations
print(f"Schema has {len(inferred_schema)} fields:")
for name, dtype in inferred_schema.items():
    print(f"  {name}: {dtype}")

Contributing

This crate is part of the polars-genson project. See the main repository for the contribution and development docs.

License

MIT License

Contains vendored and slightly adapted copy of the Apache 2.0 licensed fork of genson-rs crate

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lmmx

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.7.4

Jan 17, 2026

0.7.3

Jan 15, 2026

0.7.1

Jan 3, 2026

0.7.0

Oct 11, 2025

0.6.8

Oct 10, 2025

0.6.7

Oct 9, 2025

0.6.6

Oct 9, 2025

0.6.5

Oct 8, 2025

0.6.4

Oct 8, 2025

0.6.3

Oct 8, 2025

0.6.2

Oct 8, 2025

0.6.1

Oct 8, 2025

0.6.0

Oct 8, 2025

0.5.8

Oct 8, 2025

0.5.7

Oct 7, 2025

0.5.6

Oct 4, 2025

0.5.5

Oct 4, 2025

0.5.4

Oct 4, 2025

0.5.3

Oct 3, 2025

0.5.2

Oct 3, 2025

0.5.1

Oct 3, 2025

0.5.0

Oct 3, 2025

0.4.7

Oct 2, 2025

0.4.6

Oct 2, 2025

0.4.5

Oct 2, 2025

0.4.4

Oct 1, 2025

0.4.3

Oct 1, 2025

0.4.2

Sep 30, 2025

0.4.1

Sep 26, 2025

0.4.0

Sep 25, 2025

0.3.0

Sep 23, 2025

0.2.6

Sep 20, 2025

0.2.5

Sep 18, 2025

0.2.4

Sep 17, 2025

0.2.3

Sep 16, 2025

0.2.2

Sep 16, 2025

0.2.1

Sep 15, 2025

0.2.0

Sep 15, 2025

0.1.10

Sep 11, 2025

0.1.9

Sep 10, 2025

This version

0.1.8

Sep 10, 2025

0.1.7

Sep 9, 2025

0.1.6

Sep 9, 2025

0.1.5

Sep 9, 2025

0.1.4

Sep 8, 2025

0.1.2

Aug 20, 2025

0.1.1

Aug 20, 2025

0.1.0

Aug 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_genson-0.1.8.tar.gz (23.5 MB view details)

Uploaded Sep 10, 2025 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

polars_genson-0.1.8-cp39-abi3-win_amd64.whl (6.3 MB view details)

Uploaded Sep 10, 2025 CPython 3.9+Windows x86-64

polars_genson-0.1.8-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.1 MB view details)

Uploaded Sep 10, 2025 CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_genson-0.1.8-cp39-abi3-macosx_11_0_arm64.whl (5.1 MB view details)

Uploaded Sep 10, 2025 CPython 3.9+macOS 11.0+ ARM64

polars_genson-0.1.8-cp39-abi3-macosx_10_12_x86_64.whl (5.9 MB view details)

Uploaded Sep 10, 2025 CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file polars_genson-0.1.8.tar.gz.

File metadata

Download URL: polars_genson-0.1.8.tar.gz
Upload date: Sep 10, 2025
Size: 23.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.9.4

File hashes

Hashes for polars_genson-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`a92bfd421c1c8524d95c33ce2abc6e3a88d8c81a75336d06fc36a6779a1cf456`
MD5	`80968ce3fd128f3d59d1d47a1e2864c2`
BLAKE2b-256	`03c8f15f1470bfc775db4719152b0c03092f66c7039dec2a3e31d2ebc6b33aff`

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.8-cp39-abi3-win_amd64.whl.

File metadata

Download URL: polars_genson-0.1.8-cp39-abi3-win_amd64.whl
Upload date: Sep 10, 2025
Size: 6.3 MB
Tags: CPython 3.9+, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.9.4

File hashes

Hashes for polars_genson-0.1.8-cp39-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`e48ce551c90749fa1dbd729788e8bafe037bec81fe67265f0674efa3061ff8c3`
MD5	`d947cd49a438c15d484b6e4c3eb2b981`
BLAKE2b-256	`1d5023c2ed9d36b91843d6b25e8bce1a3cadb22e9179a5211c12ed4d4851a42b`

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.8-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: polars_genson-0.1.8-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Sep 10, 2025
Size: 6.1 MB
Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.9.4

File hashes

Hashes for polars_genson-0.1.8-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`47a4606509c7d74701823e171cfe3ca3353147913ad5afd9c6f2f35a6a7c5738`
MD5	`a93128b820e0681688c678c0ca460a93`
BLAKE2b-256	`5f578bc50319811e520517ab47ed6c39fbd4887b8c29c4f1969d5f0743858d5d`

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.8-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: polars_genson-0.1.8-cp39-abi3-macosx_11_0_arm64.whl
Upload date: Sep 10, 2025
Size: 5.1 MB
Tags: CPython 3.9+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.9.4

File hashes

Hashes for polars_genson-0.1.8-cp39-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`c7d00dcac9a15cfbfd736f7805dd169515163fb749627dab1e43e458c659a02b`
MD5	`bab0e1fedda1e870a5eb774a3488de66`
BLAKE2b-256	`4194d744c68999d62e0a59bbec733a956e6c6fb24d30738f2e50494e113097c2`

See more details on using hashes here.

File details

Details for the file polars_genson-0.1.8-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

Download URL: polars_genson-0.1.8-cp39-abi3-macosx_10_12_x86_64.whl
Upload date: Sep 10, 2025
Size: 5.9 MB
Tags: CPython 3.9+, macOS 10.12+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: maturin/1.9.4

File hashes

Hashes for polars_genson-0.1.8-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm	Hash digest
SHA256	`e4b802b5d7104f4e7b902c9d77613e2076cdaf92790b43a88b04a205af3d9c11`
MD5	`d4e28dabb2c71175a8127a415f615a0e`
BLAKE2b-256	`28f735950be01b29c55ca7e47ea4f372856e2b4257bf6c47dc5709dfdaa5011c`

See more details on using hashes here.

polars-genson 0.1.8

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Polars Genson

Installation

Features

Schema Inference

Schema Conversion

Usage

JSON Schema Inference

Polars Schema Inference

Root Wrapping (wrap_root)

Example: Avro schema with wrap_root

Normalisation

Features

Example: Map Encoding in Polars

Example: Empty Arrays

Example: Preserving Empty Arrays

Example: String Coercion

Advanced Usage

Per-Row Schema Processing

JSON Schema Options

Polars Schema Options

Method Reference

infer_json_schema(column, **kwargs) -> dict | list[dict]

infer_polars_schema(column, **kwargs) -> pl.Schema

normalise_json(column, **kwargs) -> pl.DataFrame | pl.Series

Examples

Working with Complex JSON

Using Inferred Schema

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

Root Wrapping (`wrap_root`)

`infer_json_schema(column, **kwargs) -> dict | list[dict]`

`infer_polars_schema(column, **kwargs) -> pl.Schema`

`normalise_json(column, **kwargs) -> pl.DataFrame | pl.Series`