A Polars plugin for JSON schema inference using genson-rs.
Project description
Polars Genson
A Polars plugin for working with JSON schemas. Infer schemas from JSON data and convert between JSON Schema and Polars schema formats.
Installation
pip install polars-genson[polars]
On older CPUs run:
pip install polars-genson[polars-lts-cpu]
Features
Schema Inference
- JSON Schema Inference: Generate JSON schemas from JSON strings in Polars columns
- Polars Schema Inference: Directly infer Polars data types and schemas from JSON data
- Multiple JSON Objects: Handle columns with varying JSON schemas across rows
- Complex Types: Support for nested objects, arrays, and mixed types
- Flexible Input: Support for both single JSON objects and arrays of objects
Schema Conversion
- Polars → JSON Schema: Convert existing DataFrame schemas to JSON Schema format
- JSON Schema → Polars: Convert JSON schemas to equivalent Polars schemas
- Round-trip Support: Full bidirectional conversion with validation
- Schema Manipulation: Validate, transform, and standardize schemas
Usage
The plugin adds a genson namespace to Polars DataFrames for schema inference and conversion.
import polars as pl
import polars_genson
import json
# Create a DataFrame with JSON strings
df = pl.DataFrame({
"json_data": [
'{"name": "Alice", "age": 30, "scores": [95, 87]}',
'{"name": "Bob", "age": 25, "city": "NYC", "active": true}',
'{"name": "Charlie", "age": 35, "metadata": {"role": "admin"}}'
]
})
print("Input DataFrame:")
print(df)
shape: (3, 1)
┌─────────────────────────────────┐
│ json_data │
│ --- │
│ str │
╞═════════════════════════════════╡
│ {"name": "Alice", "age": 30, "… │
│ {"name": "Bob", "age": 25, "ci… │
│ {"name": "Charlie", "age": 35,… │
└─────────────────────────────────┘
JSON Schema Inference
# Infer JSON schema from the JSON column
schema = df.genson.infer_json_schema("json_data")
print("Inferred JSON schema:")
print(json.dumps(schema, indent=2))
{
"$schema": "http://json-schema.org/schema#",
"properties": {
"name": {
"type": "string"
},
"age": {
"type": "integer"
},
"scores": {
"items": {
"type": "integer"
},
"type": "array"
}
"city": {
"type": "string"
},
"active": {
"type": "boolean"
},
"metadata": {
"properties": {
"role": {
"type": "string"
}
},
"required": [
"role"
],
"type": "object"
},
},
"required": [
"age",
"name"
],
"type": "object"
}
Polars Schema Inference
Directly infer Polars data types and schemas:
# Infer Polars schema from the JSON column
polars_schema = df.genson.infer_polars_schema("json_data")
print("Inferred Polars schema:")
print(polars_schema)
Schema({
'name': String,
'age': Int64,
'scores': List(Int64),
'city': String,
'active': Boolean,
'metadata': Struct({'role': String}),
})
The Polars schema inference automatically handles:
- ✅ Complex nested structures with proper
Structtypes - ✅ Typed arrays like
List(Int64),List(String) - ✅ Mixed data types (integers, floats, booleans, strings)
- ✅ Optional fields present in some but not all objects
- ✅ Deep nesting with multiple levels of structure
Normalisation
In addition to schema inference, polars-genson can normalise JSON columns so that every row conforms to a single, consistent Avro schema.
This is especially useful for semi-structured data where fields may be missing, empty arrays/maps may need to collapse to null, or numeric/boolean values may sometimes be encoded as strings.
Features
- Converts empty arrays/maps to
null(default) - Preserves empties with
empty_as_null=False - Ensures missing fields are inserted with
null - Supports per-field coercion of numeric/boolean strings via
coerce_string=True
Example: Empty Arrays
df = pl.DataFrame({"json_data": ['{"labels": []}', '{"labels": {"en": "Hello"}}']})
out = df.genson.normalise_json("json_data")
print(out)
Output:
shape: (2, 1)
┌─────────────────────────────┐
│ normalised │
│ --- │
│ str │
╞═════════════════════════════╡
│ {"labels": null} │
│ {"labels": {"en": "Hello"}} │
└─────────────────────────────┘
Example: Preserving Empty Arrays
out = df.genson.normalise_json("json_data", empty_as_null=False)
print(out)
Output:
┌─────────────────────────────┐
│ normalised │
╞═════════════════════════════╡
│ {"labels": []} │
│ {"labels": {"en": "Hello"}} │
└─────────────────────────────┘
Example: String Coercion
df = pl.DataFrame({
"json_data": [
'{"id": "42", "active": "true"}',
'{"id": 7, "active": false}'
]
})
# Default: no coercion
print(df.genson.normalise_json("json_data").to_list())
# ['{"id": null, "active": null}', '{"id": 7, "active": false}']
# With coercion
print(df.genson.normalise_json("json_data", coerce_string=True).to_list())
# ['{"id": 42, "active": true}', '{"id": 7, "active": false}']
Advanced Usage
Per-Row Schema Processing
- Only available with JSON schema currently (per-row/unmerged Polars schemas TODO)
# Get individual schemas and process them
df = pl.DataFrame({
"ABCs": [
'{"a": 1, "b": 2}',
'{"a": 1, "c": true}',
]
})
# Analyze schema variations
individual_schemas = df.genson.infer_json_schema("ABCs", merge_schemas=False)
The result is a list of one schema per row. With merge_schemas=True you would
get all 3 keys (a, b, c) in a single schema.
[{'$schema': 'http://json-schema.org/schema#',
'properties': {'a': {'type': 'integer'}, 'b': {'type': 'integer'}},
'required': ['a', 'b'],
'type': 'object'},
{'$schema': 'http://json-schema.org/schema#',
'properties': {'a': {'type': 'integer'}, 'c': {'type': 'boolean'}},
'required': ['a', 'c'],
'type': 'object'}]
JSON Schema Options
# Use the expression directly for more control
result = df.select(
polars_genson.infer_json_schema(
pl.col("json_data"),
merge_schemas=False, # Get individual schemas instead of merged
).alias("individual_schemas")
)
# Or use with different options
schema = df.genson.infer_json_schema(
"json_data",
ignore_outer_array=False, # Treat top-level arrays as arrays
ndjson=True, # Handle newline-delimited JSON
schema_uri="https://json-schema.org/draft/2020-12/schema", # Specify a schema URI
merge_schemas=True # Merge all schemas (default)
)
Polars Schema Options
# Infer Polars schema with options
polars_schema = df.genson.infer_polars_schema(
"json_data",
ignore_outer_array=True, # Treat top-level arrays as streams of objects
ndjson=False, # Not newline-delimited JSON
debug=False # Disable debug output
)
# Note: merge_schemas=False not yet supported for Polars schemas
Method Reference
The genson namespace provides three main methods:
infer_json_schema(column, **kwargs) -> dict
Returns a JSON Schema (as a Python dict) following the JSON Schema specification.
Parameters:
column: Name of the column containing JSON stringsignore_outer_array: Treat top-level arrays as streams of objects (default:True)ndjson: Treat input as newline-delimited JSON (default:False)schema_uri: Schema URI to embed in the output (default:"http://json-schema.org/schema#")merge_schemas: Merge schemas from all rows (default:True)map_threshold: Detect maps when object has more than N keys (default:20)force_field_types: Explicitly force fields to"map"or"record"avro: Output Avro schema instead of JSON Schema (default:False)debug: Print debug information (default:False)
infer_polars_schema(column, **kwargs) -> pl.Schema
Returns a Polars schema with native data types for direct use in Polars.
Parameters:
column: Name of the column containing JSON stringsignore_outer_array: Treat top-level arrays as streams of objects (default:True)ndjson: Treat input as newline-delimited JSON (default:False)map_threshold: Detect maps when object has more than N keys (default:20)force_field_types: Explicitly force fields to"map"or"record"debug: Print debug information (default:False)
Note: merge_schemas=False is not yet supported for Polars schema inference.
normalise_json(column, **kwargs) -> pl.Series
Normalises each JSON string in the column against a globally inferred Avro schema. Every row is transformed to match the same schema, with consistent handling of missing fields, empty values, and type coercion.
Parameters:
column: Name of the column containing JSON stringsignore_outer_array: Treat top-level arrays as streams of objects (default:True)ndjson: Treat input as newline-delimited JSON (default:False)empty_as_null: Convert empty arrays/maps tonull(default:True)coerce_string: Coerce numeric/boolean strings to numbers/booleans (default:False)map_threshold: Detect maps when object has more than N keys (default:20)force_field_types: Explicitly force fields to"map"or"record"debug: Print debug information (default:False)
Returns:
A new pl.Series of strings, one per input row, with each row normalised to the same Avro schema.
Example:
df = pl.DataFrame({"json_data": ['{"labels": []}', '{"labels": {"en": "Hello"}}']})
out = df.genson.normalise_json("json_data")
print(out.to_list())
# ['{"labels": null}', '{"labels": {"en": "Hello"}}']
Examples
Working with Complex JSON
# Complex nested JSON with arrays of objects
df = pl.DataFrame({
"complex_json": [
'{"user": {"profile": {"name": "Alice", "preferences": {"theme": "dark"}}}, "posts": [{"title": "Hello", "likes": 5}]}',
'{"user": {"profile": {"name": "Bob", "preferences": {"theme": "light"}}}, "posts": [{"title": "World", "likes": 3}, {"title": "Test", "likes": 1}]}'
]
})
schema = df.genson.infer_polars_schema("complex_json")
print(schema)
Schema({
'user': Struct({
'profile': Struct({
'name': String,
'preferences': Struct({'theme': String})
})
}),
'posts': List(Struct({'likes': Int64, 'title': String})),
})
Using Inferred Schema
# You can use the inferred schema for validation or DataFrame operations
inferred_schema = df.genson.infer_polars_schema("json_data")
# Use with other Polars operations
print(f"Schema has {len(inferred_schema)} fields:")
for name, dtype in inferred_schema.items():
print(f" {name}: {dtype}")
Contributing
This crate is part of the polars-genson project. See the main repository for the contribution and development docs.
License
MIT License
- Contains vendored and slightly adapted copy of the Apache 2.0 licensed fork of
genson-rscrate
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polars_genson-0.1.4.tar.gz.
File metadata
- Download URL: polars_genson-0.1.4.tar.gz
- Upload date:
- Size: 23.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
57136f02f372a648de1ef18e8acdb43acfbd7cbd93e189f655246dd6d1d661ca
|
|
| MD5 |
2399cccf4a0d9f8e8b9685fe3b3fe803
|
|
| BLAKE2b-256 |
a5ec80336fa0510840fc9feaa8712ea450d4e859902cc304131d1256e655e30e
|
File details
Details for the file polars_genson-0.1.4-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: polars_genson-0.1.4-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 6.3 MB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b952ec573ea9b020706e2a3a251f95c2ca7d05f86edc57793f69c0295b8dd020
|
|
| MD5 |
a559ec9866e0ca39250b5e8e3c19ca55
|
|
| BLAKE2b-256 |
15841bdd4592836985876d9040a034b405108c7b946a5b1596c7b0537d024cc0
|
File details
Details for the file polars_genson-0.1.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: polars_genson-0.1.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 6.0 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
471751f974235a258e09e12d4f5d2eddcb4f0a4931107b5744417d229a019db8
|
|
| MD5 |
ccb64e2a1a531217ba4fd43cf7204cb7
|
|
| BLAKE2b-256 |
7f31a8122984934d10411d55e5da71c75d92f5c81c577fb83fdaa12ce276f335
|
File details
Details for the file polars_genson-0.1.4-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: polars_genson-0.1.4-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 5.1 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d645d0e36f31fea06d1c65f08b197d8dfdaf3f6b46650b9dc413189358f30e75
|
|
| MD5 |
eaae8eb71de4cbf2b7b99583be86965b
|
|
| BLAKE2b-256 |
2f9b43aa0b684edec298952080223c47f15a9c9b70bbd2d6c7c64aa2527316c0
|
File details
Details for the file polars_genson-0.1.4-cp39-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: polars_genson-0.1.4-cp39-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 5.9 MB
- Tags: CPython 3.9+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b5d251fba679dfb7a23ec6ef66e047c64b7f766c1c0962aec0105986f28679b
|
|
| MD5 |
4475fcc7645e4ee827356d6e83ff1664
|
|
| BLAKE2b-256 |
d2bdda4f553363abbbf63d3bb4a51058ce9c92d5b6be6561dd93bb2c3b9b23fb
|