Skip to main content

Spec-driven data sanitization for CSV, JSON, JSONL, XML, Parquet, and Python objects.

Project description

schema-sanitizer

Version 0.1.1: this project is still in a testing phase. Expect the core behavior to be exercised heavily before treating it as a stable production dependency.

schema-sanitizer turns extremely messy semistructured data into stable, consistent tables. It is built for CSV, JSON, JSON Lines, XML, Parquet, and Python rows whose real-world values do not agree on one neat schema: fields appear late, arrays and objects change shape, timestamps arrive in several formats, scalars collide with nested values, and malformed records still need a place to go.

The library's main purpose is to make ingestion predictable before data reaches analytics engines, warehouses, or incremental pipelines. It scans source data, infers a reconciled Arrow schema, converts compatible values into that schema, and isolates rows that cannot be represented cleanly. The result is a table that downstream tools can consume without rediscovering schema drift on every run.

The hard parts are handled explicitly:

  • Turning messy semistructured data into tables: mixed scalar, list, struct, null, date/time, and string values are reconciled into stable columns.
  • Schema reconciliation for incremental pipelines: base_schema lets later batches align to a previous PyArrow schema. Additive mode keeps known field types while accepting newly observed fields, and strict mode is available when the schema must not drift.
  • Memory safety: readers and converters use bounded batches, streaming writers, spill-to-disk paths where needed, depth limits, row-size budgets, and quarantine output so large or malformed inputs do not require loading the whole cleaned dataset into memory.
  • Max depth enforcement: Arrow and Parquet depth budgets can cap deeply nested records before they exceed downstream limits such as warehouse nesting constraints.

Every public reader and converter returns a Result object with clean data, bad rows, and stats.

It has two public workflows:

  • In-Memory Analytics: read_* functions return a Result whose clean_data is PyArrow, pandas, Polars, or DuckDB data.
  • File-To-File Converters: to_* functions stream sanitized files to CSV, JSON Lines, or Parquet and return a Result whose clean_data is None.
import schema_sanitizer as ss

events = ss.read_jsonl("raw/events.jsonl")
customers = ss.read_csv("raw/customers.csv", output_format="pandas")

table = events.clean_data
df = customers.clean_data

ss.to_parquet("raw/events.jsonl", "clean/events.parquet")

Index

Install

schema-sanitizer supports Python >=3.11.

For Arrow reads and file-to-file converters:

pip install 'schema-sanitizer[pyarrow]'

Install adapter extras for the in-memory analytics tools you use:

pip install 'schema-sanitizer[pyarrow,pandas]'
pip install 'schema-sanitizer[pyarrow,polars]'
pip install 'schema-sanitizer[pyarrow,duckdb]'
pip install 'schema-sanitizer[all]'

Import with an underscore:

import schema_sanitizer as ss

In-Memory Analytics

Use read_* when you want clean data back in Python with stats.

Function Input Typical use
read_csv(path, ...) Local or PyArrow FS .csv file Inspect or analyze CSV data.
read_json(path, ...) Local or PyArrow FS .json file Read JSON files into a table.
read_json_folder(path, ...) Local or PyArrow FS folder of .json files Read direct JSON file children as JSONL rows.
read_jsonl(path, ...) Local or PyArrow FS .jsonl / .ndjson file Read JSON Lines or NDJSON event and log data.
read_xml(path, ...) Local or PyArrow FS .xml file Read XML documents through the native sanitizer pipeline.
read_xml_folder(path, ...) Local or PyArrow FS folder of .xml files Read direct XML file children as XML document rows.
read_parquet(path, ...) Local or PyArrow FS .parquet / .pq file Read Parquet through the same cleaning pipeline.
read_python(rows, ...) list[dict] Clean rows already in memory.

Readers always return a Result. By default, result.clean_data is a PyArrow table.

result = ss.read_jsonl("data/events.jsonl")

print(result.clean_data.schema)
print(result.clean_data.num_rows)
print(result.stats)

Choose another in-memory analytics target with output_format.

pandas_result = ss.read_csv("data/customers.csv", output_format="pandas")
polars_result = ss.read_csv("data/customers.csv", output_format="polars")
duckdb_result = ss.read_csv("data/customers.csv", output_format="duckdb")

pandas_df = pandas_result.clean_data
polars_df = polars_result.clean_data
duckdb_rel = duckdb_result.clean_data

Accepted output_format values are pyarrow, pandas, polars, and duckdb.

Use read_python for rows that are already in memory.

rows = [
    {"id": 1, "active": "yes", "score": "10.5"},
    {"id": 2, "active": "no", "score": 8},
]

result = ss.read_python(
    rows,
    true_tokens=("yes",),
    false_tokens=("no",),
)

table = result.clean_data

File-To-File Converters

Use to_* when you want a sanitized output file and do not need clean data in memory. These functions stream sanitized output and return a Result with clean_data set to None, plus bad rows and stats.

Function Output Typical use
to_csv(input_path, output_path, ...) CSV Produce a flat file for spreadsheets or downstream text tools.
to_jsonl(input_path, output_path, ...) JSON Lines Produce one cleaned JSON object per line.
to_parquet(input_path, output_path, ...) Parquet Produce a typed columnar file for analytics systems.
result = ss.to_parquet("raw/orders.csv", "clean/orders.parquet")

assert result.clean_data is None
print(result.stats)

ss.to_csv("raw/events.jsonl", "clean/events.csv")
ss.to_jsonl("raw/orders.parquet", "clean/orders.jsonl")

Converters infer the input format from the input file extension. If the input path has no useful extension, pass input_format.

ss.to_parquet("raw/events", "clean/events.parquet", input_format="jsonl")

Accepted input_format values are auto, csv, json, jsonl, ndjson, xml, and parquet.

Result Object

All public read_* and to_* functions return schema_sanitizer.Result.

For readers, result.clean_data contains the requested clean in-memory output. For converters, clean data is written to output_path, so result.clean_data is always None.

result = ss.read_csv("data/customers.csv", output_format="pandas")

df = result.clean_data
stats = result.stats
bad_rows = result.bad_rows
Property or method What it returns
clean_data Clean data in the requested reader output_format: PyArrow table, pandas DataFrame, Polars DataFrame, or DuckDB relation. Always None for to_* converters.
stats Dictionary of counters such as rows inferred, rows materialized, batches, skipped rows, quarantined rows, warnings, and errors.
bad_rows Quarantined rows as a pyarrow.Table. The table may be empty when no rows were quarantined.

Result Stats

result.stats is a plain dict. All properties are integers and default to 0 when the runtime did not report that counter.

Property What it means
inferred_rows Rows scanned while inferring the input schema.
inferred_bytes Approximate input bytes scanned while inferring the schema.
arrow_schema_depth Maximum Arrow container depth found during inference. Struct and list containers count; scalar leaves and top-level field wrappers do not.
parquet_schema_depth Maximum Parquet/BigQuery RECORD depth found during inference. Struct containers count; list containers and scalar leaves do not.
materialized_rows Clean rows materialized for read_* results or written by to_* converters.
batches Number of output batches materialized or written.
flattened_fields Nested fields flattened by the selected flattening options.
scalar_wrappings Scalar values wrapped to fit list or struct-like output shapes.
skipped_rows Rows dropped by on_error="skip_row".
quarantined_rows Rows dropped from clean output and stored in result.bad_rows.
warnings Non-fatal warnings reported by the runtime.
errors Fatal errors reported by the runtime.
soft_errors Recoverable row or value errors handled by policy.

Error Handling

By default, rows that fail materialization are kept as null rows. Choose a policy with on_error.

Policy Behavior
stop Raise an error as soon as a row cannot be processed.
skip_row Drop bad rows from the output.
emit_null_row Keep row count stable by emitting a null row.
quarantine Drop bad rows from the output and keep them in result.bad_rows.
result = ss.read_jsonl(
    "data/events.jsonl",
    on_error="quarantine",
)

clean = result.clean_data
print(result.stats)

bad_rows = result.bad_rows

Converters return the same Result shape as readers. Because the clean data is written to output_path, converter results always have clean_data is None.

result = ss.to_parquet(
    "raw/events.jsonl",
    "clean/events.parquet",
    on_error="quarantine",
)

print(result.stats)
bad_rows = result.bad_rows

Schema Control

Pass base_schema when the output must match or evolve from an expected contract.

import pyarrow as pa
import schema_sanitizer as ss

schema = pa.schema(
    [
        pa.field("id", pa.int64(), nullable=False),
        pa.field("email", pa.string()),
    ]
)

result = ss.read_jsonl(
    "data/users.jsonl",
    base_schema=schema,
    schema_mode="strict",
    on_error="quarantine",
)

table = result.clean_data
Mode Behavior
strict Output exactly base_schema. Requires base_schema; inference is skipped.
additive Keep base_schema field types and add newly observed fields.

column_order defaults to base_schema_first. Use column_order="sorted" for lexicographic field ordering.

Custom Tokens and Date/Time Patterns

Use true_tokens and false_tokens when boolean values use domain-specific strings. Use temporal regex options when dates or times do not match the built-in parsers.

result = ss.read_csv(
    "data/events.csv",
    true_tokens=("yes", "enabled", "1"),
    false_tokens=("no", "disabled", "0"),
    timestamp_patterns=(
        r"^(\d{4})/(\d{2})/(\d{2})[ T](\d{2}):(\d{2}):(\d{2})$",
    ),
    date_patterns=(
        r"^(\d{4})\.(\d{2})\.(\d{2})$",
    ),
    time_patterns=(
        r"^(\d{2})h(\d{2})m(\d{2})s$",
    ),
)

table = result.clean_data

For timestamp_patterns, capture groups 1-6 are year, month, day, hour, minute, and second. Optional group 7 may contain fractions, and group 8 may contain a timezone. For date_patterns, groups 1-3 are year, month, and day. For time_patterns, groups 1-3 are hour, minute, and second.

In-Memory Analytics Options

Each reader accepts the parameters listed in its section.

read_csv(path, ...)

Parameter Default Accepted values What it controls
path required str or path-like object Local CSV file to read.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers. Groups 1-6 map to year, month, day, hour, minute, second; group 7 may hold fractions and group 8 timezone.
date_patterns () sequence of regex strings Extra date parsers. Groups 1-3 map to year, month, day.
time_patterns () sequence of regex strings Extra time parsers. Groups 1-3 map to hour, minute, second.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
csv_has_header True bool Whether the first CSV row is a header.
csv_delimiter , single-character string CSV delimiter.
input_text_encoding utf-8 text encoding name Encoding used to decode CSV bytes.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming CSV reads.

read_json(path, ...)

Parameter Default Accepted values What it controls
path required str or path-like object Local JSON file to read.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
input_text_encoding utf-8 text encoding name Encoding used to decode JSON bytes.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming JSON reads.

read_json_folder(path, ...)

read_json_folder reads the direct .json children of a local folder or PyArrow filesystem folder URI in deterministic filename order. Folder exploration is not recursive. Each source file must contain one JSON document; the reader compacts those documents into a temporary JSON Lines stream and then runs the same sanitizer path used by read_json.

Parameter Default Accepted values What it controls
path required str or path-like object Local folder or PyArrow FS folder URI containing .json files.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
input_text_encoding utf-8 text encoding name Encoding used to decode each source JSON file.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-document and per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for the compacted JSON Lines stream.

read_jsonl(path, ...)

Parameter Default Accepted values What it controls
path required str or path-like object Local JSON Lines or NDJSON file to read.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
input_text_encoding utf-8 text encoding name Encoding used to decode JSON Lines or NDJSON bytes.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming JSON Lines or NDJSON reads.

read_xml(path, ...)

read_xml parses a local XML document in the native C++ frontend and sends the resulting rows through the same schema inference, cleaning, quarantine, and output adapter pipeline as the JSON and CSV readers.

By default, the root element is treated as one row, like a single JSON object. Pass xml_row_tag="row" when a file contains repeated direct child elements that should become separate rows; the XML scanner then streams each matching row element. Attributes become fields prefixed with @, repeated child tags become lists, and mixed element text is stored under #text.

result = ss.read_xml(
    "raw/orders.xml",
    xml_row_tag="order",
    read_chunk_bytes=1024 * 1024,
    batch_memory_limit_bytes=256 * 1024 * 1024,
)
Parameter Default Accepted values What it controls
path required str or path-like object Local XML file to read.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
input_text_encoding utf-8 text encoding name Encoding used to decode XML bytes when transcoding is needed.
xml_row_tag None XML element tag name or None Direct child element tag to stream as separate rows. None treats the whole document as one row.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming text input reads.

read_xml_folder(path, ...)

read_xml_folder reads the direct .xml children of a local folder or PyArrow filesystem folder URI in deterministic filename order. Folder exploration is not recursive. Each source file must contain one XML document, and all documents must use the same root tag unless you pass that tag explicitly as xml_row_tag. The reader wraps those documents in a temporary XML stream and then runs the same sanitizer path used by read_xml.

result = ss.read_xml_folder(
    "raw/order-events",
    xml_row_tag="order",
    batch_memory_limit_bytes=256 * 1024 * 1024,
)
Parameter Default Accepted values What it controls
path required str or path-like object Local folder or PyArrow FS folder URI containing .xml files.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
input_text_encoding utf-8 text encoding name Encoding used to decode each source XML file.
xml_row_tag None XML element tag name or None Expected XML document root tag. None infers it from the first file.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-document-row memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for the compacted XML stream.

read_parquet(path, ...)

Parameter Default Accepted values What it controls
path required str or path-like object Local Parquet file to read.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.

read_python(rows, ...)

Parameter Default Accepted values What it controls
rows required list[dict] In-memory rows to normalize.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort memory budget for the already-resident Python payload.

File-To-File Converter Options

Converters accept local or PyArrow FS URI output paths. Inputs can be local paths or PyArrow FS URI strings. They infer input format from the input extension unless you pass input_format.

to_csv(input_path, output_path, ...)

Parameter Default Accepted values What it controls
input_path required str or path-like object Local file or PyArrow FS URI to sanitize.
output_path required str or path-like object Local or PyArrow FS URI CSV file to create.
input_format auto auto, csv, json, jsonl, ndjson, xml, parquet Input format selector.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
csv_has_header True bool Whether CSV input has a header.
csv_delimiter , single-character string CSV input delimiter.
input_text_encoding utf-8 text encoding name Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input.
xml_row_tag None XML element tag name or None Direct child XML element tag to stream as separate rows when reading XML input.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming text input reads.

to_jsonl(input_path, output_path, ...)

Parameter Default Accepted values What it controls
input_path required str or path-like object Local file or PyArrow FS URI to sanitize.
output_path required str or path-like object Local or PyArrow FS URI JSON Lines file to create.
input_format auto auto, csv, json, jsonl, ndjson, xml, parquet Input format selector.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
csv_has_header True bool Whether CSV input has a header.
csv_delimiter , single-character string CSV input delimiter.
input_text_encoding utf-8 text encoding name Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input.
xml_row_tag None XML element tag name or None Direct child XML element tag to stream as separate rows when reading XML input.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming text input reads.

to_parquet(input_path, output_path, ...)

Parameter Default Accepted values What it controls
input_path required str or path-like object Local file or PyArrow FS URI to sanitize.
output_path required str or path-like object Local or PyArrow FS URI Parquet file to create.
input_format auto auto, csv, json, jsonl, ndjson, xml, parquet Input format selector.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
csv_has_header True bool Whether CSV input has a header.
csv_delimiter , single-character string CSV input delimiter.
input_text_encoding utf-8 text encoding name Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input.
xml_row_tag None XML element tag name or None Direct child XML element tag to stream as separate rows when reading XML input.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming text input reads.

Schema Inference Heuristics

Schema inference scans the full source before materialization whenever inference runs. It is not a sample-based inference step: in inferred mode and additive base_schema mode, every source row is consumed during inference and counted in Result.stats["inferred_rows"].

For each inferred row, the sanitizer applies two internal passes:

  1. The shape pass discovers structural paths: field names, objects, arrays, and fields that must be flattened by depth limits.
  2. The statistics pass collects scalar type evidence for the discovered shape: booleans, integers, floats, timestamps, dates, times, strings, nulls, and mixed-type conflicts.

When schema_mode="strict" is used with an explicit base_schema, the sanitizer skips inference and uses the schema contract directly; in that fast path, inferred_rows is 0. Strict mode only works with base_schema. Passing schema_mode="strict" without base_schema raises an exception.

Separating shape discovery from scalar statistics keeps list and struct decisions stable across messy inputs. If one row has an object and another row has a scalar at the same field, the structural shape wins and the scalar is wrapped under scalar_object_key (default_key by default). If one row has a list and another row has a scalar at the same field, the list shape wins and the scalar is wrapped as a single list element.

Scalar inference is conservative:

  • Nulls do not choose a type by themselves.
  • Boolean JSON values infer bool.
  • Numeric JSON values infer int64 or float64.
  • Strings can infer booleans, integers, floats, timestamps, dates, or times when the configured token and parser options match.
  • Mixed scalar kinds fall back to string.
  • Objects or arrays observed where a scalar is required are stringified.

List inference is stricter than top-level object inference. Lists remain typed only when their element shape is conflict-free. Lists of scalars and lists of structs are supported; nested lists or conflicts inside a list element fall back to list<string> so each list column has one stable element type.

Base Schema Enforcement

base_schema is an output contract and only accepts a pyarrow.Schema. The sanitizer converts it to the same internal logical schema representation used by inference before planning materialization.

import pyarrow as pa
import schema_sanitizer as ss

user_schema = pa.schema(
    [
        pa.field("id", pa.int64(), nullable=False),
        pa.field("email", pa.string()),
    ]
)

result = ss.read_jsonl(
    "data/users.jsonl",
    base_schema=user_schema,
    schema_mode="strict",
)

schema_mode="strict" uses base_schema as the complete output schema. The inference loop is skipped, so Result.stats["inferred_rows"] is 0. Strict mode only works when base_schema is provided; otherwise the call raises an exception before materialization. Extra source fields are rejected because they are not present in the strict contract.

schema_mode="additive" requires inference. The full source is scanned, then the inferred schema is reconciled with base_schema: fields already present in base_schema keep their declared types and nullable flags, while newly observed fields are added from inference. Row values that cannot be coerced into the declared base field type are handled by on_error.

For fields present in base_schema, the base type wins even when source rows contain conflicting values. A value such as "unknown" in a base int64 field, an object in a base scalar field, or a scalar in a base struct/list field is a materialization conflict. The row is stopped, skipped, quarantined, or replaced with a null row according to on_error. For fields not present in base_schema, conflicts are resolved by the normal inference heuristics before the field is added: mixed scalar kinds fall back to string, object/scalar and list/scalar conflicts use the wrapping rules, and conflicting list element shapes fall back to list<string>.

column_order controls only output field order after reconciliation. column_order="base_schema_first" preserves base fields first, then appends new fields. column_order="sorted" emits fields in lexicographic order.

Max Depth Enforcement

Depth enforcement uses two independent limits because Arrow and Parquet/BigQuery count nested data differently:

  • arrow_max_depth defaults to 32. It counts Arrow container depth: struct and list containers count, while scalar leaves and top-level field wrappers do not.
  • parquet_max_depth defaults to 15. It counts Parquet/BigQuery RECORD depth: struct containers count, while list containers, scalar leaves, and top-level field wrappers do not.

The sanitizer flattens a named field to <name>_flattened when keeping that field's full nested value would exceed either limit. The flattened value is stored as a string.

Depth examples:

Shape arrow_schema_depth parquet_schema_depth
id: int64 0 0
user: struct<id: int64> 1 1
tags: list<string> 1 0
authors: list<struct<name: string>> 2 1
asset: struct<authors: list<struct<name: string>>> 3 2

Use arrow_max_depth as a defensive complexity limit for Arrow/Parquet container nesting. Use parquet_max_depth=15 when the output Parquet will be read by BigQuery external tables, where the practical limit is nested RECORD depth rather than physical list wrapper depth.

The reported Result.stats["arrow_schema_depth"] and Result.stats["parquet_schema_depth"] use the same counting rules as the enforcement options.

Quarantine Rows Pipeline

Use on_error="quarantine" when you want clean output to continue while keeping failed rows for inspection or replay. Rows that fail materialization are dropped from clean_data or the converter output file and appended to Result.bad_rows.

bad_rows is a PyArrow table with diagnostic metadata:

Column What it contains
row_index Zero-based source row index.
source_offset Byte offset or source-relative offset when available.
code / code_str Machine-readable diagnostic code.
path_id Internal field path id associated with the failure.
detail Human-readable error detail.
context_snippet Short preview of the offending source row.
raw_row Full raw source row text when available.

For in-memory reads, inspect result.bad_rows directly:

result = ss.read_jsonl(
    "data/events.jsonl",
    base_schema=event_schema,
    schema_mode="strict",
    on_error="quarantine",
)

clean = result.clean_data
bad_rows = result.bad_rows
print(result.stats["quarantined_rows"])

For file-to-file converters, the clean output is written to output_path and the same Result.bad_rows table carries quarantined rows:

result = ss.to_parquet(
    "raw/events.jsonl",
    "clean/events.parquet",
    base_schema=event_schema,
    schema_mode="strict",
    on_error="quarantine",
)

bad_rows = result.bad_rows

Quarantine is row-level. If one field in a row cannot be coerced into the output schema, the whole row is excluded from clean output and recorded once in bad_rows. In contrast, on_error="skip_row" drops the row without retaining it, and on_error="emit_null_row" keeps row count stable by writing a null row instead of recording it in bad_rows.

Memory Safety Measures

The sanitizer is designed to process large local files and PyArrow filesystem URI inputs without requiring the whole clean dataset to live in Python memory.

  • File-to-file converters stream sanitized batches directly to the output file. Result.clean_data is None for converters, so the clean table is not materialized in memory.
  • PyArrow filesystem file inputs are opened as seekable streams. CSV, JSON, JSON Lines, NDJSON, and XML URI inputs are not copied to a temporary file; their bytes are read by the same chunked native scanner used for local files.
  • PyArrow filesystem outputs are opened with pyarrow.fs.open_output_stream. CSV, JSON Lines, and Parquet converters write incrementally to that stream instead of staging the full output in a local temporary file.
  • CSV, JSON, JSON Lines, and NDJSON readers use read_chunk_bytes to bound input chunks while scanning.
  • XML without xml_row_tag is parsed into a native document tree before row emission, so batch_memory_limit_bytes limits the accumulated document size before the tree is built.
  • XML with xml_row_tag streams matching direct child elements. The scanner reads bounded chunks, discards completed row slices, and raises SchemaSanitizerResourceError if the active XML buffer exceeds batch_memory_limit_bytes.
  • Local and PyArrow filesystem folder readers (read_json_folder and read_xml_folder) list direct child files only, then compact one source document at a time into a local temporary JSON Lines or XML stream. The temp file is the bridge that lets many single-document files reuse the normal streaming sanitizer pipeline without building one large Python object.
  • Folder temp streams contain only the compacted input representation, not the final clean dataset. With batch_memory_limit_bytes, each source document is checked before it is decoded and added to that stream. If a PyArrow filesystem does not report a child file size, the child is read in bounded chunks and the reader stops at batch_memory_limit_bytes + 1 bytes before raising SchemaSanitizerResourceError.
  • Folder temp files are deleted when the read finishes, and partially written temp files are deleted if compaction raises an exception. If the Python process is killed externally, for example with SIGKILL, the operating system may not give schema-sanitizer a chance to run that cleanup.
  • Parquet inputs are decoded by PyArrow into record batches and exposed to the native JSON frontend through a seekable JSON Lines byte reader. Rows are produced incrementally; the Parquet-to-JSONL adapter does not stage a full conversion file.
  • XML DTD and entity declarations are rejected. The XML frontend does not load external entities or expand document-defined entities.
  • batch_memory_limit_bytes maps to the native per-batch memory_limit_bytes budget. It reduces inference and output batch sizes instead of changing the final schema.
  • For already-resident Python inputs, batch_memory_limit_bytes is enforced as a preflight resource guard. If the Python payload is already larger than the configured limit, the call raises SchemaSanitizerResourceError before native ingestion starts.
  • arrow_max_depth and parquet_max_depth cap nested expansion. Values beyond those limits are flattened to strings, preventing unbounded container nesting from creating very wide or deeply nested Arrow/Parquet schemas.
  • Native parsing and materialization use owned streams, arenas, and Arrow C Data resources that are closed when the Result, stream, or sink is closed or dropped. Table-producing readers force stream materialization and close native resources before returning.

Configured resource-limit failures raise SchemaSanitizerResourceError and include limit_name="memory_limit_bytes" in their detail payload when available. True allocator failures are reported separately as SchemaSanitizerOutOfMemoryError.

PyArrow Filesystem Integration

When PyArrow is installed, every file reader and file-to-file converter can use pyarrow.fs URI strings. This covers read_csv, read_json, read_json_folder, read_jsonl, read_xml, read_xml_folder, read_parquet, to_csv, to_jsonl, and to_parquet. Supported URI input extensions include csv, json, jsonl, ndjson, xml, parquet, and pq. Supported URI converter output extensions include csv, jsonl, and parquet.

For normal local files, prefer a regular path:

events = ss.read_jsonl("/home/user/data/events.jsonl")

Regular local paths are the simplest and usually best choice for local disk access. They avoid PyArrow URI parsing and filesystem dispatch.

file:// is PyArrow's local-filesystem URI scheme. On Linux and WSL, absolute local paths use three slashes: file:///home/user/data/events.jsonl. That URI points to the same file as /home/user/data/events.jsonl, but it is opened through pyarrow.fs.LocalFileSystem. Use it when you specifically want to test the PyArrow filesystem route or when your code passes filesystem URIs consistently across local and cloud storage. Do not write file://home/user/...; that form has home in the URI host position instead of being an absolute local path.

Local form Example Opens through Best use
Regular local path /home/user/data/events.jsonl schema-sanitizer local path handling Default for local disk files.
Local PyArrow URI file:///home/user/data/events.jsonl pyarrow.fs.LocalFileSystem Testing or URI-only code paths.

Common URI forms:

Storage Example URI
Local file through PyArrow file:///home/user/data/events.jsonl
Amazon S3 s3://raw-bucket/events/2026-06-12.jsonl
Amazon S3 folder s3://raw-bucket/events/2026-06-12/
Google Cloud Storage gs://raw-bucket/assets/2026-06-12.parquet
Google Cloud Storage folder gs://raw-bucket/assets/2026-06-12/
Google Cloud Storage alias gcs://raw-bucket/assets/2026-06-12.xml
Azure Data Lake Storage Gen2 abfs://container@account.dfs.core.windows.net/events/2026-06-12.jsonl
Azure Data Lake Storage Gen2 folder abfs://container@account.dfs.core.windows.net/events/2026-06-12/

Cloud URI support depends on the installed PyArrow build and the normal provider credentials/configuration available to PyArrow.

import schema_sanitizer as ss

events = ss.read_jsonl("s3://raw-bucket/events/2026-06-12.jsonl")
assets = ss.read_parquet("gs://raw-bucket/assets/2026-06-12.parquet")
daily_events = ss.read_json_folder("s3://raw-bucket/events/2026-06-12/")

ss.to_parquet(
    "s3://raw-bucket/events/2026-06-12.jsonl",
    "gs://clean-bucket/events/2026-06-12.parquet",
)

URI file inputs are opened as seekable PyArrow files. CSV, JSON, JSON Lines, NDJSON, and XML bytes are fed directly to the native chunk scanner. Parquet is decoded with pyarrow.parquet into batches, converted incrementally to JSON Lines bytes, and then fed to the same native sanitizer path. No single-file URI input is copied to a temporary file by schema-sanitizer.

Folder URI inputs are listed with non-recursive pyarrow.fs.FileSelector. read_json_folder filters direct .json child files and read_xml_folder filters direct .xml child files. The matching children are sorted by filename, then compacted one document at a time into a local temporary stream before the normal sanitizer pipeline reads that stream.

URI outputs are opened with pyarrow.fs.open_output_stream. CSV and Parquet writers stream Arrow batches to that output stream, and JSON Lines writes UTF-8 bytes incrementally. The output URI is not staged through a local temporary file.

Supported Inputs

Supported inputs are intentionally file-oriented:

  • Normal local file paths for read_csv, read_json, read_jsonl, read_xml, read_parquet, to_csv, to_jsonl, and to_parquet.
  • PyArrow filesystem file URI strings for the same single-file readers and converters when PyArrow is installed and can open the URI.
  • Normal local folders for read_json_folder and read_xml_folder.
  • PyArrow filesystem folder URI strings for read_json_folder and read_xml_folder; folder exploration is non-recursive.
  • Already-resident list[dict] rows through read_python.

Unsupported Inputs

Unsupported inputs include raw JSON or XML strings, bytes payloads, opened files, io.BytesIO, io.StringIO, custom reader objects, URLs that PyArrow cannot open as files, and recursive folder scans. Write those inputs to a local file first, or use read_python for in-memory list[dict] rows.

Examples

The examples/ directory contains tutorial notebooks and one cloud pipeline CLI example:

  • 01_ingestion_and_core_api.ipynb
  • 02_options_and_stats.ipynb
  • 03_adapters_and_converters.ipynb
  • 04_streaming_large_csv_to_parquet.ipynb
  • 05_full_options_catalog_sweep.ipynb
  • 06_xml_reading_and_memory.ipynb
  • 07_gcs_jsonl_to_silver_parquet.py: GCS JSONL to GCS Parquet using a BigQuery external table schema fetched through ADBC as base_schema

Platform Notes

Published PyPI wheels target glibc-based Linux environments (manylinux_2_28). Alpine Linux uses musl, so Alpine users should use a glibc-based Python environment or build from source.

Development

Install the project for local development:

pip install -e .[dev]

Run the tests:

pytest

Build the native core directly with CMake:

cmake -S . -B build/dev -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build/dev

License

schema-sanitizer is licensed under the Apache License 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_sanitizer-0.1.1.tar.gz (213.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

schema_sanitizer-0.1.1-cp311-abi3-win_amd64.whl (508.8 kB view details)

Uploaded CPython 3.11+Windows x86-64

schema_sanitizer-0.1.1-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (400.6 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

schema_sanitizer-0.1.1-cp311-abi3-macosx_11_0_arm64.whl (316.4 kB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

schema_sanitizer-0.1.1-cp311-abi3-macosx_10_9_x86_64.whl (329.2 kB view details)

Uploaded CPython 3.11+macOS 10.9+ x86-64

File details

Details for the file schema_sanitizer-0.1.1.tar.gz.

File metadata

  • Download URL: schema_sanitizer-0.1.1.tar.gz
  • Upload date:
  • Size: 213.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for schema_sanitizer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4f88011863af82c1376451f9716e51fcbac75de9af337cc8ef825c10a8eea4be
MD5 10cb6cb7f53d078ad9ebf49084d02db3
BLAKE2b-256 49e8385382a82df231feee7355fd9874ab26693abd7f5c517018d2b80f7193e6

See more details on using hashes here.

File details

Details for the file schema_sanitizer-0.1.1-cp311-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for schema_sanitizer-0.1.1-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3e15e4d6afd9bcc467ae88df45abdacc0b85b63e518697b000461b9aed521d91
MD5 8a8ec247ad6652780dfb07136941d4cb
BLAKE2b-256 ba2d22b03f7bf0a92ed070dfa2a060ccbcb7a6f2ccdd21cd359fdaa431da1485

See more details on using hashes here.

File details

Details for the file schema_sanitizer-0.1.1-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for schema_sanitizer-0.1.1-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6bc65bddb4f37e63844da0dd68078173ef5cba92b6262e4c7c1c6b1c72987564
MD5 69ff62f33d81065a98cae29bc63b86a0
BLAKE2b-256 670c9163da378f4171759fc9b1776b9a2bffe3847ee447e016a592fcb6ebeb84

See more details on using hashes here.

File details

Details for the file schema_sanitizer-0.1.1-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for schema_sanitizer-0.1.1-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 682b8eb91cc58f635d62d2992a16b5083fe74292fe80ec38873f85e68dd77a70
MD5 9cb96affe358f2aad66d5e35c2b6a642
BLAKE2b-256 c1d75af651765ca4a781765a8879741a097359c92d37ce2f0db5525b9c65bf35

See more details on using hashes here.

File details

Details for the file schema_sanitizer-0.1.1-cp311-abi3-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for schema_sanitizer-0.1.1-cp311-abi3-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e1a904da3084060cba136c29d771ea03e05eb5d28ab734b4ae97f85a4634397f
MD5 c6caa3659e4adce2956cf79650d04844
BLAKE2b-256 bbddc07278dbc733762e7ec5bd57797590a3aff2921006518c524e0cd1ed4484

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page