Spec-driven data sanitization for CSV, JSON, JSONL, XML, Parquet, and Python objects.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

schema-sanitizer

Version 0.1.2: this project is still in a testing phase. Expect the core behavior to be exercised heavily before treating it as a stable production dependency.

The extension is currently being tuned and tested for generating Parquet files and schemas used by BigQuery external tables.

schema-sanitizer turns extremely messy semistructured data into stable, consistent tables. It is built for CSV, JSON, JSON Lines, XML, Parquet, and Python rows whose real-world values do not agree on one neat schema: fields appear late, arrays and objects change shape, timestamps arrive in several formats, scalars collide with nested values, and malformed records still need a place to go.

The library's main purpose is to make ingestion predictable before data reaches analytics engines, warehouses, or incremental pipelines. It scans source data, infers a reconciled Arrow schema, converts compatible values into that schema, and isolates rows that cannot be represented cleanly. The result is a table that downstream tools can consume without rediscovering schema drift on every run.

The hard parts are handled explicitly:

Turning messy semistructured data into tables: mixed scalar, list, struct, null, date/time, and string values are reconciled into stable columns.
Schema reconciliation for incremental pipelines: base_schema lets later batches align to a previous PyArrow schema. Additive mode keeps known field types while accepting newly observed fields, and strict mode is available when the schema must not drift.
Memory safety: readers and converters use bounded batches, streaming writers, spill-to-disk paths where needed, depth limits, row-size budgets, and quarantine output so large or malformed inputs do not require loading the whole cleaned dataset into memory.
Max depth enforcement: Arrow and Parquet depth budgets can cap deeply nested records before they exceed downstream limits such as warehouse nesting constraints.

Every public reader and converter returns a Result object with clean data, bad rows, and stats.

It has two public workflows:

In-Memory Analytics: read_* functions return a Result whose clean_data is PyArrow, pandas, Polars, or DuckDB data.
File-To-File Converters: to_* functions stream sanitized files to CSV, JSON Lines, or Parquet and return a Result whose clean_data is None.

import schema_sanitizer as ss

events = ss.read_jsonl("raw/events.jsonl")
customers = ss.read_csv("raw/customers.csv", output_format="pandas")

table = events.clean_data
df = customers.clean_data

ss.to_parquet("raw/events.jsonl", "clean/events.parquet")

Index

Install
In-Memory Analytics
File-To-File Converters
Result Object
Error Handling
Schema Control
Timestamp Precision
Custom Tokens and Date/Time Patterns
In-Memory Analytics Options
- read_csv(path, ...)
- read_json(path, ...)
- read_json_folder(path, ...)
- read_jsonl(path, ...)
- read_xml(path, ...)
- read_xml_folder(path, ...)
- read_parquet(path, ...)
- read_python(rows, ...)
File-To-File Converter Options
- to_csv(input_path, output_path, ...)
- to_jsonl(input_path, output_path, ...)
- to_parquet(input_path, output_path, ...)
Schema Inference Heuristics
Base Schema Enforcement
Max Depth Enforcement
Quarantine Rows Pipeline
Memory Safety Measures
PyArrow Filesystem Integration
Supported Inputs
Unsupported Inputs
Examples
Platform Notes
Development
License

Install

schema-sanitizer supports Python >=3.11.

For Arrow reads and file-to-file converters:

pip install 'schema-sanitizer[pyarrow]'

Install adapter extras for the in-memory analytics tools you use:

pip install 'schema-sanitizer[pyarrow,pandas]'
pip install 'schema-sanitizer[pyarrow,polars]'
pip install 'schema-sanitizer[pyarrow,duckdb]'
pip install 'schema-sanitizer[all]'

Import with an underscore:

import schema_sanitizer as ss

In-Memory Analytics

Use read_* when you want clean data back in Python with stats.

Function	Input	Typical use
`read_csv(path, ...)`	Local or PyArrow FS `.csv` file	Inspect or analyze CSV data.
`read_json(path, ...)`	Local or PyArrow FS `.json` file	Read JSON files into a table.
`read_json_folder(path, ...)`	Local or PyArrow FS folder of `.json` files	Read direct JSON file children as JSONL rows.
`read_jsonl(path, ...)`	Local or PyArrow FS `.jsonl` / `.ndjson` file	Read JSON Lines or NDJSON event and log data.
`read_xml(path, ...)`	Local or PyArrow FS `.xml` file	Read XML documents through the native sanitizer pipeline.
`read_xml_folder(path, ...)`	Local or PyArrow FS folder of `.xml` files	Read direct XML file children as XML document rows.
`read_parquet(path, ...)`	Local or PyArrow FS `.parquet` / `.pq` file	Read Parquet through the same cleaning pipeline.
`read_python(rows, ...)`	`list[dict]`	Clean rows already in memory.

Readers always return a Result. By default, result.clean_data is a PyArrow table.

result = ss.read_jsonl("data/events.jsonl")

print(result.clean_data.schema)
print(result.clean_data.num_rows)
print(result.stats)

Choose another in-memory analytics target with output_format.

pandas_result = ss.read_csv("data/customers.csv", output_format="pandas")
polars_result = ss.read_csv("data/customers.csv", output_format="polars")
duckdb_result = ss.read_csv("data/customers.csv", output_format="duckdb")

pandas_df = pandas_result.clean_data
polars_df = polars_result.clean_data
duckdb_rel = duckdb_result.clean_data

Accepted output_format values are pyarrow, pandas, polars, and duckdb.

Use read_python for rows that are already in memory.

rows = [
    {"id": 1, "active": "yes", "score": "10.5"},
    {"id": 2, "active": "no", "score": 8},
]

result = ss.read_python(
    rows,
    true_tokens=("yes",),
    false_tokens=("no",),
)

table = result.clean_data

File-To-File Converters

Use to_* when you want a sanitized output file and do not need clean data in memory. These functions stream sanitized output and return a Result with clean_data set to None, plus bad rows and stats.

Function	Output	Typical use
`to_csv(input_path, output_path, ...)`	CSV	Produce a flat file for spreadsheets or downstream text tools.
`to_jsonl(input_path, output_path, ...)`	JSON Lines	Produce one cleaned JSON object per line.
`to_parquet(input_path, output_path, ...)`	Parquet	Produce a typed columnar file for analytics systems.

result = ss.to_parquet("raw/orders.csv", "clean/orders.parquet")

assert result.clean_data is None
print(result.stats)

ss.to_csv("raw/events.jsonl", "clean/events.csv")
ss.to_jsonl("raw/orders.parquet", "clean/orders.jsonl")

Converters infer the input format from the input file extension. If the input path has no useful extension, pass input_format.

ss.to_parquet("raw/events", "clean/events.parquet", input_format="jsonl")

Accepted input_format values are auto, csv, json, jsonl, ndjson, xml, and parquet.

Result Object

All public read_* and to_* functions return schema_sanitizer.Result.

For readers, result.clean_data contains the requested clean in-memory output. For converters, clean data is written to output_path, so result.clean_data is always None.

result = ss.read_csv("data/customers.csv", output_format="pandas")

df = result.clean_data
stats = result.stats
bad_rows = result.bad_rows

Property or method	What it returns
`clean_data`	Clean data in the requested reader `output_format`: PyArrow table, pandas DataFrame, Polars DataFrame, or DuckDB relation. Always `None` for `to_*` converters.
`stats`	Dictionary of counters such as rows inferred, rows materialized, batches, skipped rows, quarantined rows, warnings, and errors.
`bad_rows`	Quarantined rows as a `pyarrow.Table`. The table may be empty when no rows were quarantined.

Result Stats

result.stats is a plain dict. All properties are integers and default to 0 when the runtime did not report that counter.

Property	What it means
`inferred_rows`	Rows scanned while inferring the input schema.
`inferred_bytes`	Approximate input bytes scanned while inferring the schema.
`arrow_schema_depth`	Maximum Arrow container depth found during inference. Struct and list containers count; scalar leaves and top-level field wrappers do not.
`parquet_schema_depth`	Maximum Parquet/BigQuery RECORD depth found during inference. Struct containers count; list containers and scalar leaves do not.
`materialized_rows`	Clean rows materialized for `read_` results or written by `to_` converters.
`batches`	Number of output batches materialized or written.
`flattened_fields`	Nested fields flattened by the selected flattening options.
`scalar_wrappings`	Scalar values wrapped to fit list or struct-like output shapes.
`skipped_rows`	Rows dropped by `on_error="skip_row"`.
`quarantined_rows`	Rows dropped from clean output and stored in `result.bad_rows`.
`warnings`	Non-fatal warnings reported by the runtime.
`errors`	Fatal errors reported by the runtime.
`soft_errors`	Recoverable row or value errors handled by policy.

Error Handling

By default, rows that fail materialization are kept as null rows. Choose a policy with on_error.

Policy	Behavior
`stop`	Raise an error as soon as a row cannot be processed.
`skip_row`	Drop bad rows from the output.
`emit_null_row`	Keep row count stable by emitting a null row.
`quarantine`	Drop bad rows from the output and keep them in `result.bad_rows`.

result = ss.read_jsonl(
    "data/events.jsonl",
    on_error="quarantine",
)

clean = result.clean_data
print(result.stats)

bad_rows = result.bad_rows

Converters return the same Result shape as readers. Because the clean data is written to output_path, converter results always have clean_data is None.

result = ss.to_parquet(
    "raw/events.jsonl",
    "clean/events.parquet",
    on_error="quarantine",
)

print(result.stats)
bad_rows = result.bad_rows

Schema Control

Pass base_schema when the output must match or evolve from an expected contract.

import pyarrow as pa
import schema_sanitizer as ss

schema = pa.schema(
    [
        pa.field("id", pa.int64(), nullable=False),
        pa.field("email", pa.string()),
    ]
)

result = ss.read_jsonl(
    "data/users.jsonl",
    base_schema=schema,
    schema_mode="strict",
    on_error="quarantine",
)

table = result.clean_data

Mode	Behavior
`strict`	Output exactly `base_schema`. Requires `base_schema`; inference is skipped.
`additive`	Keep `base_schema` field types and add newly observed fields.

column_order defaults to base_schema_first. Use column_order="sorted" for lexicographic field ordering.

Timestamp Precision

Timestamp strings are parsed internally with nanosecond precision, then written to the output Arrow schema using timestamp_precision.

result = ss.read_jsonl(
    "data/events.jsonl",
    timestamp_precision="TIMESTAMP_MICROS",
)

ss.to_parquet(
    "raw/events.jsonl",
    "clean/events.parquet",
    timestamp_precision="TIMESTAMP_MICROS",
)

Accepted values are TIMESTAMP_MILLIS, TIMESTAMP_MICROS, and TIMESTAMP_NANOS. The default is TIMESTAMP_MICROS because it is compatible with BigQuery Parquet external tables. Selecting TIMESTAMP_NANOS preserves nanosecond Arrow/Parquet timestamps, but some downstream engines, including BigQuery, do not support Parquet TIMESTAMP_NANOS.

When parsed timestamp strings contain finer precision than the selected output unit, the value is truncated to that unit. Integer values coerced into timestamp fields are interpreted as already being in the selected output unit.

Custom Tokens and Date/Time Patterns

Use true_tokens and false_tokens when boolean values use domain-specific strings. Use temporal regex options when dates or times do not match the built-in parsers.

result = ss.read_csv(
    "data/events.csv",
    true_tokens=("yes", "enabled", "1"),
    false_tokens=("no", "disabled", "0"),
    timestamp_patterns=(
        r"^(\d{4})/(\d{2})/(\d{2})[ T](\d{2}):(\d{2}):(\d{2})$",
    ),
    date_patterns=(
        r"^(\d{4})\.(\d{2})\.(\d{2})$",
    ),
    time_patterns=(
        r"^(\d{2})h(\d{2})m(\d{2})s$",
    ),
)

table = result.clean_data

For timestamp_patterns, capture groups 1-6 are year, month, day, hour, minute, and second. Optional group 7 may contain fractions, and group 8 may contain a timezone. For date_patterns, groups 1-3 are year, month, and day. For time_patterns, groups 1-3 are hour, minute, and second.

In-Memory Analytics Options

Each reader accepts the parameters listed in its section.

`read_csv(path, ...)`

Parameter	Default	Accepted values	What it controls
`path`	required	`str` or path-like object	Local CSV file to read.
`output_format`	`pyarrow`	`pyarrow`, `pandas`, `polars`, `duckdb`	Type stored in `Result.clean_data`.
`base_schema`	`None`	`pyarrow.Schema` or `None`	Optional base output contract.
`schema_mode`	`additive`	`additive`, `strict`	How inferred fields reconcile with `base_schema`.
`column_order`	`base_schema_first`	`base_schema_first`, `sorted`	Output field ordering.
`timestamp_precision`	`TIMESTAMP_MICROS`	`TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `TIMESTAMP_NANOS`	Output Arrow/Parquet timestamp unit.
`parse_integers`	`True`	`bool`	Parse integer-looking strings as integers.
`parse_floats`	`True`	`bool`	Parse float-looking strings as floats.
`true_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `true`.
`false_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `false`.
`timestamp_patterns`	`()`	sequence of regex strings	Extra timestamp parsers. Groups 1-6 map to year, month, day, hour, minute, second; group 7 may hold fractions and group 8 timezone.
`date_patterns`	`()`	sequence of regex strings	Extra date parsers. Groups 1-3 map to year, month, day.
`time_patterns`	`()`	sequence of regex strings	Extra time parsers. Groups 1-3 map to hour, minute, second.
`arrow_max_depth`	`32`	integer `>= 0`	Maximum Arrow container depth for object and array expansion.
`parquet_max_depth`	`15`	integer `>= 0`	Maximum Parquet/BigQuery RECORD depth for object expansion.
`scalar_object_key`	`default_key`	string	Key used when a scalar must be wrapped as an object.
`csv_has_header`	`True`	`bool`	Whether the first CSV row is a header.
`csv_delimiter`	`,`	single-character string	CSV delimiter.
`input_text_encoding`	`utf-8`	text encoding name	Encoding used to decode CSV bytes.
`on_error`	`emit_null_row`	`stop`, `skip_row`, `emit_null_row`, `quarantine`	Row-level error policy.
`batch_memory_limit_bytes`	`None`	positive integer bytes or `None`	Best-effort per-batch memory budget.
`read_chunk_bytes`	`1048576`	positive integer bytes	Chunk size for streaming CSV reads.

`read_json(path, ...)`

Parameter	Default	Accepted values	What it controls
`path`	required	`str` or path-like object	Local JSON file to read.
`output_format`	`pyarrow`	`pyarrow`, `pandas`, `polars`, `duckdb`	Type stored in `Result.clean_data`.
`base_schema`	`None`	`pyarrow.Schema` or `None`	Optional base output contract.
`schema_mode`	`additive`	`additive`, `strict`	How inferred fields reconcile with `base_schema`.
`column_order`	`base_schema_first`	`base_schema_first`, `sorted`	Output field ordering.
`timestamp_precision`	`TIMESTAMP_MICROS`	`TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `TIMESTAMP_NANOS`	Output Arrow/Parquet timestamp unit.
`parse_integers`	`True`	`bool`	Parse integer-looking strings as integers.
`parse_floats`	`True`	`bool`	Parse float-looking strings as floats.
`true_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `true`.
`false_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `false`.
`timestamp_patterns`	`()`	sequence of regex strings	Extra timestamp parsers.
`date_patterns`	`()`	sequence of regex strings	Extra date parsers.
`time_patterns`	`()`	sequence of regex strings	Extra time parsers.
`arrow_max_depth`	`32`	integer `>= 0`	Maximum Arrow container depth for object and array expansion.
`parquet_max_depth`	`15`	integer `>= 0`	Maximum Parquet/BigQuery RECORD depth for object expansion.
`scalar_object_key`	`default_key`	string	Key used when a scalar must be wrapped as an object.
`input_text_encoding`	`utf-8`	text encoding name	Encoding used to decode JSON bytes.
`on_error`	`emit_null_row`	`stop`, `skip_row`, `emit_null_row`, `quarantine`	Row-level error policy.
`batch_memory_limit_bytes`	`None`	positive integer bytes or `None`	Best-effort per-batch memory budget.
`read_chunk_bytes`	`1048576`	positive integer bytes	Chunk size for streaming JSON reads.

`read_json_folder(path, ...)`

read_json_folder reads the direct .json children of a local folder or PyArrow filesystem folder URI in deterministic filename order. Folder exploration is not recursive. Each source file must contain one JSON document; the reader compacts those documents into a temporary JSON Lines stream and then runs the same sanitizer path used by read_json.

Parameter	Default	Accepted values	What it controls
`path`	required	`str` or path-like object	Local folder or PyArrow FS folder URI containing `.json` files.
`output_format`	`pyarrow`	`pyarrow`, `pandas`, `polars`, `duckdb`	Type stored in `Result.clean_data`.
`base_schema`	`None`	`pyarrow.Schema` or `None`	Optional base output contract.
`schema_mode`	`additive`	`additive`, `strict`	How inferred fields reconcile with `base_schema`.
`column_order`	`base_schema_first`	`base_schema_first`, `sorted`	Output field ordering.
`timestamp_precision`	`TIMESTAMP_MICROS`	`TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `TIMESTAMP_NANOS`	Output Arrow/Parquet timestamp unit.
`parse_integers`	`True`	`bool`	Parse integer-looking strings as integers.
`parse_floats`	`True`	`bool`	Parse float-looking strings as floats.
`true_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `true`.
`false_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `false`.
`timestamp_patterns`	`()`	sequence of regex strings	Extra timestamp parsers.
`date_patterns`	`()`	sequence of regex strings	Extra date parsers.
`time_patterns`	`()`	sequence of regex strings	Extra time parsers.
`arrow_max_depth`	`32`	integer `>= 0`	Maximum Arrow container depth for object and array expansion.
`parquet_max_depth`	`15`	integer `>= 0`	Maximum Parquet/BigQuery RECORD depth for object expansion.
`scalar_object_key`	`default_key`	string	Key used when a scalar must be wrapped as an object.
`input_text_encoding`	`utf-8`	text encoding name	Encoding used to decode each source JSON file.
`on_error`	`emit_null_row`	`stop`, `skip_row`, `emit_null_row`, `quarantine`	Row-level error policy.
`batch_memory_limit_bytes`	`None`	positive integer bytes or `None`	Best-effort per-document and per-batch memory budget.
`read_chunk_bytes`	`1048576`	positive integer bytes	Chunk size for the compacted JSON Lines stream.

`read_jsonl(path, ...)`

Parameter	Default	Accepted values	What it controls
`path`	required	`str` or path-like object	Local JSON Lines or NDJSON file to read.
`output_format`	`pyarrow`	`pyarrow`, `pandas`, `polars`, `duckdb`	Type stored in `Result.clean_data`.
`base_schema`	`None`	`pyarrow.Schema` or `None`	Optional base output contract.
`schema_mode`	`additive`	`additive`, `strict`	How inferred fields reconcile with `base_schema`.
`column_order`	`base_schema_first`	`base_schema_first`, `sorted`	Output field ordering.
`timestamp_precision`	`TIMESTAMP_MICROS`	`TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `TIMESTAMP_NANOS`	Output Arrow/Parquet timestamp unit.
`parse_integers`	`True`	`bool`	Parse integer-looking strings as integers.
`parse_floats`	`True`	`bool`	Parse float-looking strings as floats.
`true_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `true`.
`false_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `false`.
`timestamp_patterns`	`()`	sequence of regex strings	Extra timestamp parsers.
`date_patterns`	`()`	sequence of regex strings	Extra date parsers.
`time_patterns`	`()`	sequence of regex strings	Extra time parsers.
`arrow_max_depth`	`32`	integer `>= 0`	Maximum Arrow container depth for object and array expansion.
`parquet_max_depth`	`15`	integer `>= 0`	Maximum Parquet/BigQuery RECORD depth for object expansion.
`scalar_object_key`	`default_key`	string	Key used when a scalar must be wrapped as an object.
`input_text_encoding`	`utf-8`	text encoding name	Encoding used to decode JSON Lines or NDJSON bytes.
`on_error`	`emit_null_row`	`stop`, `skip_row`, `emit_null_row`, `quarantine`	Row-level error policy.
`batch_memory_limit_bytes`	`None`	positive integer bytes or `None`	Best-effort per-batch memory budget.
`read_chunk_bytes`	`1048576`	positive integer bytes	Chunk size for streaming JSON Lines or NDJSON reads.

`read_xml(path, ...)`

read_xml parses a local XML document in the native C++ frontend and sends the resulting rows through the same schema inference, cleaning, quarantine, and output adapter pipeline as the JSON and CSV readers.

By default, the root element is treated as one row, like a single JSON object. Pass xml_row_tag="row" when a file contains repeated direct child elements that should become separate rows; the XML scanner then streams each matching row element. Attributes become fields prefixed with @, repeated child tags become lists, and mixed element text is stored under #text.

result = ss.read_xml(
    "raw/orders.xml",
    xml_row_tag="order",
    read_chunk_bytes=1024 * 1024,
    batch_memory_limit_bytes=256 * 1024 * 1024,
)

Parameter	Default	Accepted values	What it controls
`path`	required	`str` or path-like object	Local XML file to read.
`output_format`	`pyarrow`	`pyarrow`, `pandas`, `polars`, `duckdb`	Type stored in `Result.clean_data`.
`base_schema`	`None`	`pyarrow.Schema` or `None`	Optional base output contract.
`schema_mode`	`additive`	`additive`, `strict`	How inferred fields reconcile with `base_schema`.
`column_order`	`base_schema_first`	`base_schema_first`, `sorted`	Output field ordering.
`timestamp_precision`	`TIMESTAMP_MICROS`	`TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `TIMESTAMP_NANOS`	Output Arrow/Parquet timestamp unit.
`parse_integers`	`True`	`bool`	Parse integer-looking strings as integers.
`parse_floats`	`True`	`bool`	Parse float-looking strings as floats.
`true_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `true`.
`false_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `false`.
`timestamp_patterns`	`()`	sequence of regex strings	Extra timestamp parsers.
`date_patterns`	`()`	sequence of regex strings	Extra date parsers.
`time_patterns`	`()`	sequence of regex strings	Extra time parsers.
`arrow_max_depth`	`32`	integer `>= 0`	Maximum Arrow container depth for object and array expansion.
`parquet_max_depth`	`15`	integer `>= 0`	Maximum Parquet/BigQuery RECORD depth for object expansion.
`scalar_object_key`	`default_key`	string	Key used when a scalar must be wrapped as an object.
`input_text_encoding`	`utf-8`	text encoding name	Encoding used to decode XML bytes when transcoding is needed.
`xml_row_tag`	`None`	XML element tag name or `None`	Direct child element tag to stream as separate rows. `None` treats the whole document as one row.
`on_error`	`emit_null_row`	`stop`, `skip_row`, `emit_null_row`, `quarantine`	Row-level error policy.
`batch_memory_limit_bytes`	`None`	positive integer bytes or `None`	Best-effort per-batch memory budget.
`read_chunk_bytes`	`1048576`	positive integer bytes	Chunk size for streaming text input reads.

`read_xml_folder(path, ...)`

read_xml_folder reads the direct .xml children of a local folder or PyArrow filesystem folder URI in deterministic filename order. Folder exploration is not recursive. Each source file must contain one XML document, and all documents must use the same root tag unless you pass that tag explicitly as xml_row_tag. The reader wraps those documents in a temporary XML stream and then runs the same sanitizer path used by read_xml.

result = ss.read_xml_folder(
    "raw/order-events",
    xml_row_tag="order",
    batch_memory_limit_bytes=256 * 1024 * 1024,
)

Parameter	Default	Accepted values	What it controls
`path`	required	`str` or path-like object	Local folder or PyArrow FS folder URI containing `.xml` files.
`output_format`	`pyarrow`	`pyarrow`, `pandas`, `polars`, `duckdb`	Type stored in `Result.clean_data`.
`base_schema`	`None`	`pyarrow.Schema` or `None`	Optional base output contract.
`schema_mode`	`additive`	`additive`, `strict`	How inferred fields reconcile with `base_schema`.
`column_order`	`base_schema_first`	`base_schema_first`, `sorted`	Output field ordering.
`timestamp_precision`	`TIMESTAMP_MICROS`	`TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `TIMESTAMP_NANOS`	Output Arrow/Parquet timestamp unit.
`parse_integers`	`True`	`bool`	Parse integer-looking strings as integers.
`parse_floats`	`True`	`bool`	Parse float-looking strings as floats.
`true_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `true`.
`false_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `false`.
`timestamp_patterns`	`()`	sequence of regex strings	Extra timestamp parsers.
`date_patterns`	`()`	sequence of regex strings	Extra date parsers.
`time_patterns`	`()`	sequence of regex strings	Extra time parsers.
`arrow_max_depth`	`32`	integer `>= 0`	Maximum Arrow container depth for object and array expansion.
`parquet_max_depth`	`15`	integer `>= 0`	Maximum Parquet/BigQuery RECORD depth for object expansion.
`scalar_object_key`	`default_key`	string	Key used when a scalar must be wrapped as an object.
`input_text_encoding`	`utf-8`	text encoding name	Encoding used to decode each source XML file.
`xml_row_tag`	`None`	XML element tag name or `None`	Expected XML document root tag. `None` infers it from the first file.
`on_error`	`emit_null_row`	`stop`, `skip_row`, `emit_null_row`, `quarantine`	Row-level error policy.
`batch_memory_limit_bytes`	`None`	positive integer bytes or `None`	Best-effort per-document-row memory budget.
`read_chunk_bytes`	`1048576`	positive integer bytes	Chunk size for the compacted XML stream.

`read_parquet(path, ...)`

Parameter	Default	Accepted values	What it controls
`path`	required	`str` or path-like object	Local Parquet file to read.
`output_format`	`pyarrow`	`pyarrow`, `pandas`, `polars`, `duckdb`	Type stored in `Result.clean_data`.
`base_schema`	`None`	`pyarrow.Schema` or `None`	Optional base output contract.
`schema_mode`	`additive`	`additive`, `strict`	How inferred fields reconcile with `base_schema`.
`column_order`	`base_schema_first`	`base_schema_first`, `sorted`	Output field ordering.
`timestamp_precision`	`TIMESTAMP_MICROS`	`TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `TIMESTAMP_NANOS`	Output Arrow/Parquet timestamp unit.
`arrow_max_depth`	`32`	integer `>= 0`	Maximum Arrow container depth for object and array expansion.
`parquet_max_depth`	`15`	integer `>= 0`	Maximum Parquet/BigQuery RECORD depth for object expansion.
`scalar_object_key`	`default_key`	string	Key used when a scalar must be wrapped as an object.
`on_error`	`emit_null_row`	`stop`, `skip_row`, `emit_null_row`, `quarantine`	Row-level error policy.
`batch_memory_limit_bytes`	`None`	positive integer bytes or `None`	Best-effort per-batch memory budget.

`read_python(rows, ...)`

Parameter	Default	Accepted values	What it controls
`rows`	required	`list[dict]`	In-memory rows to normalize.
`output_format`	`pyarrow`	`pyarrow`, `pandas`, `polars`, `duckdb`	Type stored in `Result.clean_data`.
`base_schema`	`None`	`pyarrow.Schema` or `None`	Optional base output contract.
`schema_mode`	`additive`	`additive`, `strict`	How inferred fields reconcile with `base_schema`.
`column_order`	`base_schema_first`	`base_schema_first`, `sorted`	Output field ordering.
`timestamp_precision`	`TIMESTAMP_MICROS`	`TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `TIMESTAMP_NANOS`	Output Arrow/Parquet timestamp unit.
`parse_integers`	`True`	`bool`	Parse integer-looking strings as integers.
`parse_floats`	`True`	`bool`	Parse float-looking strings as floats.
`true_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `true`.
`false_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `false`.
`timestamp_patterns`	`()`	sequence of regex strings	Extra timestamp parsers.
`date_patterns`	`()`	sequence of regex strings	Extra date parsers.
`time_patterns`	`()`	sequence of regex strings	Extra time parsers.
`arrow_max_depth`	`32`	integer `>= 0`	Maximum Arrow container depth for object and array expansion.
`parquet_max_depth`	`15`	integer `>= 0`	Maximum Parquet/BigQuery RECORD depth for object expansion.
`scalar_object_key`	`default_key`	string	Key used when a scalar must be wrapped as an object.
`on_error`	`emit_null_row`	`stop`, `skip_row`, `emit_null_row`, `quarantine`	Row-level error policy.
`batch_memory_limit_bytes`	`None`	positive integer bytes or `None`	Best-effort memory budget for the already-resident Python payload.

File-To-File Converter Options

Converters accept local or PyArrow FS URI output paths. Inputs can be local paths or PyArrow FS URI strings. They infer input format from the input extension unless you pass input_format.

`to_csv(input_path, output_path, ...)`

Parameter	Default	Accepted values	What it controls
`input_path`	required	`str` or path-like object	Local file or PyArrow FS URI to sanitize.
`output_path`	required	`str` or path-like object	Local or PyArrow FS URI CSV file to create.
`input_format`	`auto`	`auto`, `csv`, `json`, `jsonl`, `ndjson`, `xml`, `parquet`	Input format selector.
`base_schema`	`None`	`pyarrow.Schema` or `None`	Optional base output contract.
`schema_mode`	`additive`	`additive`, `strict`	How inferred fields reconcile with `base_schema`.
`column_order`	`base_schema_first`	`base_schema_first`, `sorted`	Output field ordering.
`timestamp_precision`	`TIMESTAMP_MICROS`	`TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `TIMESTAMP_NANOS`	Output Arrow/Parquet timestamp unit.
`parse_integers`	`True`	`bool`	Parse integer-looking strings as integers.
`parse_floats`	`True`	`bool`	Parse float-looking strings as floats.
`true_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `true`.
`false_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `false`.
`timestamp_patterns`	`()`	sequence of regex strings	Extra timestamp parsers.
`date_patterns`	`()`	sequence of regex strings	Extra date parsers.
`time_patterns`	`()`	sequence of regex strings	Extra time parsers.
`arrow_max_depth`	`32`	integer `>= 0`	Maximum Arrow container depth for object and array expansion.
`parquet_max_depth`	`15`	integer `>= 0`	Maximum Parquet/BigQuery RECORD depth for object expansion.
`scalar_object_key`	`default_key`	string	Key used when a scalar must be wrapped as an object.
`csv_has_header`	`True`	`bool`	Whether CSV input has a header.
`csv_delimiter`	`,`	single-character string	CSV input delimiter.
`input_text_encoding`	`utf-8`	text encoding name	Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input.
`xml_row_tag`	`None`	XML element tag name or `None`	Direct child XML element tag to stream as separate rows when reading XML input.
`on_error`	`emit_null_row`	`stop`, `skip_row`, `emit_null_row`, `quarantine`	Row-level error policy.
`batch_memory_limit_bytes`	`None`	positive integer bytes or `None`	Best-effort per-batch memory budget.
`read_chunk_bytes`	`1048576`	positive integer bytes	Chunk size for streaming text input reads.

`to_jsonl(input_path, output_path, ...)`

Parameter	Default	Accepted values	What it controls
`input_path`	required	`str` or path-like object	Local file or PyArrow FS URI to sanitize.
`output_path`	required	`str` or path-like object	Local or PyArrow FS URI JSON Lines file to create.
`input_format`	`auto`	`auto`, `csv`, `json`, `jsonl`, `ndjson`, `xml`, `parquet`	Input format selector.
`base_schema`	`None`	`pyarrow.Schema` or `None`	Optional base output contract.
`schema_mode`	`additive`	`additive`, `strict`	How inferred fields reconcile with `base_schema`.
`column_order`	`base_schema_first`	`base_schema_first`, `sorted`	Output field ordering.
`timestamp_precision`	`TIMESTAMP_MICROS`	`TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `TIMESTAMP_NANOS`	Output Arrow/Parquet timestamp unit.
`parse_integers`	`True`	`bool`	Parse integer-looking strings as integers.
`parse_floats`	`True`	`bool`	Parse float-looking strings as floats.
`true_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `true`.
`false_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `false`.
`timestamp_patterns`	`()`	sequence of regex strings	Extra timestamp parsers.
`date_patterns`	`()`	sequence of regex strings	Extra date parsers.
`time_patterns`	`()`	sequence of regex strings	Extra time parsers.
`arrow_max_depth`	`32`	integer `>= 0`	Maximum Arrow container depth for object and array expansion.
`parquet_max_depth`	`15`	integer `>= 0`	Maximum Parquet/BigQuery RECORD depth for object expansion.
`scalar_object_key`	`default_key`	string	Key used when a scalar must be wrapped as an object.
`csv_has_header`	`True`	`bool`	Whether CSV input has a header.
`csv_delimiter`	`,`	single-character string	CSV input delimiter.
`input_text_encoding`	`utf-8`	text encoding name	Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input.
`xml_row_tag`	`None`	XML element tag name or `None`	Direct child XML element tag to stream as separate rows when reading XML input.
`on_error`	`emit_null_row`	`stop`, `skip_row`, `emit_null_row`, `quarantine`	Row-level error policy.
`batch_memory_limit_bytes`	`None`	positive integer bytes or `None`	Best-effort per-batch memory budget.
`read_chunk_bytes`	`1048576`	positive integer bytes	Chunk size for streaming text input reads.

`to_parquet(input_path, output_path, ...)`

Parameter	Default	Accepted values	What it controls
`input_path`	required	`str` or path-like object	Local file or PyArrow FS URI to sanitize.
`output_path`	required	`str` or path-like object	Local or PyArrow FS URI Parquet file to create.
`input_format`	`auto`	`auto`, `csv`, `json`, `jsonl`, `ndjson`, `xml`, `parquet`	Input format selector.
`base_schema`	`None`	`pyarrow.Schema` or `None`	Optional base output contract.
`schema_mode`	`additive`	`additive`, `strict`	How inferred fields reconcile with `base_schema`.
`column_order`	`base_schema_first`	`base_schema_first`, `sorted`	Output field ordering.
`timestamp_precision`	`TIMESTAMP_MICROS`	`TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `TIMESTAMP_NANOS`	Output Arrow/Parquet timestamp unit.
`parse_integers`	`True`	`bool`	Parse integer-looking strings as integers.
`parse_floats`	`True`	`bool`	Parse float-looking strings as floats.
`true_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `true`.
`false_tokens`	`()`	sequence of strings	String tokens interpreted as boolean `false`.
`timestamp_patterns`	`()`	sequence of regex strings	Extra timestamp parsers.
`date_patterns`	`()`	sequence of regex strings	Extra date parsers.
`time_patterns`	`()`	sequence of regex strings	Extra time parsers.
`arrow_max_depth`	`32`	integer `>= 0`	Maximum Arrow container depth for object and array expansion.
`parquet_max_depth`	`15`	integer `>= 0`	Maximum Parquet/BigQuery RECORD depth for object expansion.
`scalar_object_key`	`default_key`	string	Key used when a scalar must be wrapped as an object.
`csv_has_header`	`True`	`bool`	Whether CSV input has a header.
`csv_delimiter`	`,`	single-character string	CSV input delimiter.
`input_text_encoding`	`utf-8`	text encoding name	Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input.
`xml_row_tag`	`None`	XML element tag name or `None`	Direct child XML element tag to stream as separate rows when reading XML input.
`on_error`	`emit_null_row`	`stop`, `skip_row`, `emit_null_row`, `quarantine`	Row-level error policy.
`batch_memory_limit_bytes`	`None`	positive integer bytes or `None`	Best-effort per-batch memory budget.
`read_chunk_bytes`	`1048576`	positive integer bytes	Chunk size for streaming text input reads.

Schema Inference Heuristics

Schema inference scans the full source before materialization whenever inference runs. It is not a sample-based inference step: in inferred mode and additive base_schema mode, every source row is consumed during inference and counted in Result.stats["inferred_rows"].

For each inferred row, the sanitizer applies two internal passes:

The shape pass discovers structural paths: field names, objects, arrays, and fields that must be flattened by depth limits.
The statistics pass collects scalar type evidence for the discovered shape: booleans, integers, floats, timestamps, dates, times, strings, nulls, and mixed-type conflicts.

When schema_mode="strict" is used with an explicit base_schema, the sanitizer skips inference and uses the schema contract directly; in that fast path, inferred_rows is 0. Strict mode only works with base_schema. Passing schema_mode="strict" without base_schema raises an exception.

Separating shape discovery from scalar statistics keeps list and struct decisions stable across messy inputs. If one row has an object and another row has a scalar at the same field, the structural shape wins and the scalar is wrapped under scalar_object_key (default_key by default). If one row has a list and another row has a scalar at the same field, the list shape wins and the scalar is wrapped as a single list element.

Scalar inference is conservative:

Nulls do not choose a type by themselves.
Boolean JSON values infer bool.
Numeric JSON values infer int64 or float64.
Strings can infer booleans, integers, floats, timestamps, dates, or times when the configured token and parser options match.
Mixed scalar kinds fall back to string.
Objects or arrays observed where a scalar is required are stringified.

List inference is stricter than top-level object inference. Lists remain typed only when their element shape is conflict-free. Lists of scalars and lists of structs are supported; nested lists or conflicts inside a list element fall back to list<string> so each list column has one stable element type.

Base Schema Enforcement

base_schema is an output contract and only accepts a pyarrow.Schema. The sanitizer converts it to the same internal logical schema representation used by inference before planning materialization.

import pyarrow as pa
import schema_sanitizer as ss

user_schema = pa.schema(
    [
        pa.field("id", pa.int64(), nullable=False),
        pa.field("email", pa.string()),
    ]
)

result = ss.read_jsonl(
    "data/users.jsonl",
    base_schema=user_schema,
    schema_mode="strict",
)

schema_mode="strict" uses base_schema as the complete output schema. The inference loop is skipped, so Result.stats["inferred_rows"] is 0. Strict mode only works when base_schema is provided; otherwise the call raises an exception before materialization. Extra source fields are rejected because they are not present in the strict contract.

schema_mode="additive" requires inference. The full source is scanned, then the inferred schema is reconciled with base_schema: fields already present in base_schema keep their declared types and nullable flags, while newly observed fields are added from inference. Row values that cannot be coerced into the declared base field type are handled by on_error.

For fields present in base_schema, the base type wins even when source rows contain conflicting values. A value such as "unknown" in a base int64 field, an object in a base scalar field, or a scalar in a base struct/list field is a materialization conflict. The row is stopped, skipped, quarantined, or replaced with a null row according to on_error. For fields not present in base_schema, conflicts are resolved by the normal inference heuristics before the field is added: mixed scalar kinds fall back to string, object/scalar and list/scalar conflicts use the wrapping rules, and conflicting list element shapes fall back to list<string>.

column_order controls only output field order after reconciliation. column_order="base_schema_first" preserves base fields first, then appends new fields. column_order="sorted" emits fields in lexicographic order.

Max Depth Enforcement

Depth enforcement uses two independent limits because Arrow and Parquet/BigQuery count nested data differently:

arrow_max_depth defaults to 32. It counts Arrow container depth: struct and list containers count, while scalar leaves and top-level field wrappers do not.
parquet_max_depth defaults to 15. It counts Parquet/BigQuery RECORD depth: struct containers count, while list containers, scalar leaves, and top-level field wrappers do not.

The sanitizer flattens a named field to <name>_flattened when keeping that field's full nested value would exceed either limit. The flattened value is stored as a string.

Depth examples:

Shape	`arrow_schema_depth`	`parquet_schema_depth`
`id: int64`	0	0
`user: struct<id: int64>`	1	1
`tags: list<string>`	1	0
`authors: list<struct<name: string>>`	2	1
`asset: struct<authors: list<struct<name: string>>>`	3	2

Use arrow_max_depth as a defensive complexity limit for Arrow/Parquet container nesting. Use parquet_max_depth=15 when the output Parquet will be read by BigQuery external tables, where the practical limit is nested RECORD depth rather than physical list wrapper depth.

The reported Result.stats["arrow_schema_depth"] and Result.stats["parquet_schema_depth"] use the same counting rules as the enforcement options.

Quarantine Rows Pipeline

Use on_error="quarantine" when you want clean output to continue while keeping failed rows for inspection or replay. Rows that fail materialization are dropped from clean_data or the converter output file and appended to Result.bad_rows.

bad_rows is a PyArrow table with diagnostic metadata:

Column	What it contains
`row_index`	Zero-based source row index.
`source_offset`	Byte offset or source-relative offset when available.
`code` / `code_str`	Machine-readable diagnostic code.
`path_id`	Internal field path id associated with the failure.
`detail`	Human-readable error detail.
`context_snippet`	Short preview of the offending source row.
`raw_row`	Full raw source row text when available.

For in-memory reads, inspect result.bad_rows directly:

result = ss.read_jsonl(
    "data/events.jsonl",
    base_schema=event_schema,
    schema_mode="strict",
    on_error="quarantine",
)

clean = result.clean_data
bad_rows = result.bad_rows
print(result.stats["quarantined_rows"])

For file-to-file converters, the clean output is written to output_path and the same Result.bad_rows table carries quarantined rows:

result = ss.to_parquet(
    "raw/events.jsonl",
    "clean/events.parquet",
    base_schema=event_schema,
    schema_mode="strict",
    on_error="quarantine",
)

bad_rows = result.bad_rows

Quarantine is row-level. If one field in a row cannot be coerced into the output schema, the whole row is excluded from clean output and recorded once in bad_rows. In contrast, on_error="skip_row" drops the row without retaining it, and on_error="emit_null_row" keeps row count stable by writing a null row instead of recording it in bad_rows.

Memory Safety Measures

The sanitizer is designed to process large local files and PyArrow filesystem URI inputs without requiring the whole clean dataset to live in Python memory.

File-to-file converters stream sanitized batches directly to the output file. Result.clean_data is None for converters, so the clean table is not materialized in memory.
PyArrow filesystem file inputs are opened as seekable streams. CSV, JSON, JSON Lines, NDJSON, and XML URI inputs are not copied to a temporary file; their bytes are read by the same chunked native scanner used for local files.
PyArrow filesystem outputs are opened with pyarrow.fs.open_output_stream. CSV, JSON Lines, and Parquet converters write incrementally to that stream instead of staging the full output in a local temporary file.
CSV, JSON, JSON Lines, and NDJSON readers use read_chunk_bytes to bound input chunks while scanning.
XML without xml_row_tag is parsed into a native document tree before row emission, so batch_memory_limit_bytes limits the accumulated document size before the tree is built.
XML with xml_row_tag streams matching direct child elements. The scanner reads bounded chunks, discards completed row slices, and raises SchemaSanitizerResourceError if the active XML buffer exceeds batch_memory_limit_bytes.
Local and PyArrow filesystem folder readers (read_json_folder and read_xml_folder) list direct child files only, then compact one source document at a time into a local temporary JSON Lines or XML stream. The temp file is the bridge that lets many single-document files reuse the normal streaming sanitizer pipeline without building one large Python object.
Folder temp streams contain only the compacted input representation, not the final clean dataset. With batch_memory_limit_bytes, each source document is checked before it is decoded and added to that stream. If a PyArrow filesystem does not report a child file size, the child is read in bounded chunks and the reader stops at batch_memory_limit_bytes + 1 bytes before raising SchemaSanitizerResourceError.
Folder temp files are deleted when the read finishes, and partially written temp files are deleted if compaction raises an exception. If the Python process is killed externally, for example with SIGKILL, the operating system may not give schema-sanitizer a chance to run that cleanup.
Parquet inputs are decoded by PyArrow into record batches and exposed to the native JSON frontend through a seekable JSON Lines byte reader. Rows are produced incrementally; the Parquet-to-JSONL adapter does not stage a full conversion file.
XML DTD and entity declarations are rejected. The XML frontend does not load external entities or expand document-defined entities.
batch_memory_limit_bytes maps to the native per-batch memory_limit_bytes budget. It reduces inference and output batch sizes instead of changing the final schema.
For already-resident Python inputs, batch_memory_limit_bytes is enforced as a preflight resource guard. If the Python payload is already larger than the configured limit, the call raises SchemaSanitizerResourceError before native ingestion starts.
arrow_max_depth and parquet_max_depth cap nested expansion. Values beyond those limits are flattened to strings, preventing unbounded container nesting from creating very wide or deeply nested Arrow/Parquet schemas.
Native parsing and materialization use owned streams, arenas, and Arrow C Data resources that are closed when the Result, stream, or sink is closed or dropped. Table-producing readers force stream materialization and close native resources before returning.

Configured resource-limit failures raise SchemaSanitizerResourceError and include limit_name="memory_limit_bytes" in their detail payload when available. True allocator failures are reported separately as SchemaSanitizerOutOfMemoryError.

PyArrow Filesystem Integration

When PyArrow is installed, every file reader and file-to-file converter can use pyarrow.fs URI strings. This covers read_csv, read_json, read_json_folder, read_jsonl, read_xml, read_xml_folder, read_parquet, to_csv, to_jsonl, and to_parquet. Supported URI input extensions include csv, json, jsonl, ndjson, xml, parquet, and pq. Supported URI converter output extensions include csv, jsonl, and parquet.

For normal local files, prefer a regular path:

events = ss.read_jsonl("/home/user/data/events.jsonl")

Regular local paths are the simplest and usually best choice for local disk access. They avoid PyArrow URI parsing and filesystem dispatch.

file:// is PyArrow's local-filesystem URI scheme. On Linux and WSL, absolute local paths use three slashes: file:///home/user/data/events.jsonl. That URI points to the same file as /home/user/data/events.jsonl, but it is opened through pyarrow.fs.LocalFileSystem. Use it when you specifically want to test the PyArrow filesystem route or when your code passes filesystem URIs consistently across local and cloud storage. Do not write file://home/user/...; that form has home in the URI host position instead of being an absolute local path.

Local form	Example	Opens through	Best use
Regular local path	`/home/user/data/events.jsonl`	schema-sanitizer local path handling	Default for local disk files.
Local PyArrow URI	`file:///home/user/data/events.jsonl`	`pyarrow.fs.LocalFileSystem`	Testing or URI-only code paths.

Common URI forms:

Storage	Example URI
Local file through PyArrow	`file:///home/user/data/events.jsonl`
Amazon S3	`s3://raw-bucket/events/2026-06-12.jsonl`
Amazon S3 folder	`s3://raw-bucket/events/2026-06-12/`
Google Cloud Storage	`gs://raw-bucket/assets/2026-06-12.parquet`
Google Cloud Storage folder	`gs://raw-bucket/assets/2026-06-12/`
Google Cloud Storage alias	`gcs://raw-bucket/assets/2026-06-12.xml`
Azure Data Lake Storage Gen2	`abfs://container@account.dfs.core.windows.net/events/2026-06-12.jsonl`
Azure Data Lake Storage Gen2 folder	`abfs://container@account.dfs.core.windows.net/events/2026-06-12/`

Cloud URI support depends on the installed PyArrow build and the normal provider credentials/configuration available to PyArrow.

import schema_sanitizer as ss

events = ss.read_jsonl("s3://raw-bucket/events/2026-06-12.jsonl")
assets = ss.read_parquet("gs://raw-bucket/assets/2026-06-12.parquet")
daily_events = ss.read_json_folder("s3://raw-bucket/events/2026-06-12/")

ss.to_parquet(
    "s3://raw-bucket/events/2026-06-12.jsonl",
    "gs://clean-bucket/events/2026-06-12.parquet",
)

URI file inputs are opened as seekable PyArrow files. CSV, JSON, JSON Lines, NDJSON, and XML bytes are fed directly to the native chunk scanner. Parquet is decoded with pyarrow.parquet into batches, converted incrementally to JSON Lines bytes, and then fed to the same native sanitizer path. No single-file URI input is copied to a temporary file by schema-sanitizer.

Folder URI inputs are listed with non-recursive pyarrow.fs.FileSelector. read_json_folder filters direct .json child files and read_xml_folder filters direct .xml child files. The matching children are sorted by filename, then compacted one document at a time into a local temporary stream before the normal sanitizer pipeline reads that stream.

URI outputs are opened with pyarrow.fs.open_output_stream. CSV and Parquet writers stream Arrow batches to that output stream, and JSON Lines writes UTF-8 bytes incrementally. The output URI is not staged through a local temporary file.

Supported Inputs

Supported inputs are intentionally file-oriented:

Normal local file paths for read_csv, read_json, read_jsonl, read_xml, read_parquet, to_csv, to_jsonl, and to_parquet.
PyArrow filesystem file URI strings for the same single-file readers and converters when PyArrow is installed and can open the URI.
Normal local folders for read_json_folder and read_xml_folder.
PyArrow filesystem folder URI strings for read_json_folder and read_xml_folder; folder exploration is non-recursive.
Already-resident list[dict] rows through read_python.

Unsupported Inputs

Unsupported inputs include raw JSON or XML strings, bytes payloads, opened files, io.BytesIO, io.StringIO, custom reader objects, URLs that PyArrow cannot open as files, and recursive folder scans. Write those inputs to a local file first, or use read_python for in-memory list[dict] rows.

Examples

The examples/ directory contains tutorial notebooks and one cloud pipeline CLI example:

01_ingestion_and_core_api.ipynb
02_options_and_stats.ipynb
03_adapters_and_converters.ipynb
04_streaming_large_csv_to_parquet.ipynb
05_full_options_catalog_sweep.ipynb
06_xml_reading_and_memory.ipynb
07_gcs_jsonl_to_bigquery_parquet.py: GCS JSONL to BigQuery-compatible Parquet using an external table schema fetched through ADBC as base_schema, then creating or replacing the Hive-partitioned external table

Platform Notes

Published PyPI wheels target glibc-based Linux environments (manylinux_2_28). Alpine Linux uses musl, so Alpine users should use a glibc-based Python environment or build from source.

Development

Install the project for local development:

pip install -e .[dev]

Run the tests:

pytest

Build the native core directly with CMake:

cmake -S . -B build/dev -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build/dev

License

schema-sanitizer is licensed under the Apache License 2.0. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bgallan

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.2

Jun 16, 2026

0.1.1

Jun 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_sanitizer-0.1.2.tar.gz (222.6 kB view details)

Uploaded Jun 16, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

schema_sanitizer-0.1.2-cp311-abi3-win_amd64.whl (510.8 kB view details)

Uploaded Jun 16, 2026 CPython 3.11+Windows x86-64

schema_sanitizer-0.1.2-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (402.6 kB view details)

Uploaded Jun 16, 2026 CPython 3.11+manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

schema_sanitizer-0.1.2-cp311-abi3-macosx_11_0_arm64.whl (318.7 kB view details)

Uploaded Jun 16, 2026 CPython 3.11+macOS 11.0+ ARM64

schema_sanitizer-0.1.2-cp311-abi3-macosx_10_9_x86_64.whl (331.4 kB view details)

Uploaded Jun 16, 2026 CPython 3.11+macOS 10.9+ x86-64

File details

Details for the file schema_sanitizer-0.1.2.tar.gz.

File metadata

Download URL: schema_sanitizer-0.1.2.tar.gz
Upload date: Jun 16, 2026
Size: 222.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for schema_sanitizer-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`591c210f553a2342086395347aa147b7a7005fe44b5c2bd4ea02fd17875aa017`
MD5	`535f91b021171bd8d6fa907cba82c7e2`
BLAKE2b-256	`38eeb0e76920f7785c8924f00a47bece10d3d3cec304c5e620ac0f5cafd5ee8c`

See more details on using hashes here.

File details

Details for the file schema_sanitizer-0.1.2-cp311-abi3-win_amd64.whl.

File metadata

Download URL: schema_sanitizer-0.1.2-cp311-abi3-win_amd64.whl
Upload date: Jun 16, 2026
Size: 510.8 kB
Tags: CPython 3.11+, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for schema_sanitizer-0.1.2-cp311-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`b5a2581727ff12d8dd5b0e0d75c89012a00a18418fb05c8651ff605b8191e42c`
MD5	`a66ab7fd8ad64775ad79fe470a89830f`
BLAKE2b-256	`1744e0eac9cad8010dcf77e1d0d62ec066c2a48a04cf40126a47f2238a94a3d0`

See more details on using hashes here.

File details

Details for the file schema_sanitizer-0.1.2-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

Download URL: schema_sanitizer-0.1.2-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Upload date: Jun 16, 2026
Size: 402.6 kB
Tags: CPython 3.11+, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for schema_sanitizer-0.1.2-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`217920ea3e809d404d3d63061b10639e795ad95aeffc4cfeb5feb5f6e704a674`
MD5	`3f18386160d7b01ffeb7e6ccd8b03d07`
BLAKE2b-256	`1add5cff465840edb698fdbb61b8df18912525ed7e2fdbf77255afa0b9e932b1`

See more details on using hashes here.

File details

Details for the file schema_sanitizer-0.1.2-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: schema_sanitizer-0.1.2-cp311-abi3-macosx_11_0_arm64.whl
Upload date: Jun 16, 2026
Size: 318.7 kB
Tags: CPython 3.11+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for schema_sanitizer-0.1.2-cp311-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`1c0092c42eb52af4f9f4ad6cd01569f70c40ad796439a78010bbe5eeb9cd5777`
MD5	`ef9b62981fda5ef35327a916524b0618`
BLAKE2b-256	`a3550d70c380b5c2054660bea256c5fb42255e954000f855f063785fe41ce4da`

See more details on using hashes here.

File details

Details for the file schema_sanitizer-0.1.2-cp311-abi3-macosx_10_9_x86_64.whl.

File metadata

Download URL: schema_sanitizer-0.1.2-cp311-abi3-macosx_10_9_x86_64.whl
Upload date: Jun 16, 2026
Size: 331.4 kB
Tags: CPython 3.11+, macOS 10.9+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for schema_sanitizer-0.1.2-cp311-abi3-macosx_10_9_x86_64.whl
Algorithm	Hash digest
SHA256	`847b9ae31f1cad504a6670b8e71a70179008b87689176ef9c234506d75a55796`
MD5	`5a283960593b5ca68163ef564db3cc2a`
BLAKE2b-256	`4b7873c96277e2014e2c484464818d72e6368d7cc654fb631febd5c0488fb3cd`

See more details on using hashes here.

schema-sanitizer 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

schema-sanitizer

Index

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes