Skip to main content

Spec-driven data sanitization for CSV, JSON, JSONL, XML, Parquet, and Python objects.

Project description

schema-sanitizer

Version 0.1.2: this project is still in a testing phase. Expect the core behavior to be exercised heavily before treating it as a stable production dependency.

The extension is currently being tuned and tested for generating Parquet files and schemas used by BigQuery external tables.

schema-sanitizer turns extremely messy semistructured data into stable, consistent tables. It is built for CSV, JSON, JSON Lines, XML, Parquet, and Python rows whose real-world values do not agree on one neat schema: fields appear late, arrays and objects change shape, timestamps arrive in several formats, scalars collide with nested values, and malformed records still need a place to go.

The library's main purpose is to make ingestion predictable before data reaches analytics engines, warehouses, or incremental pipelines. It scans source data, infers a reconciled Arrow schema, converts compatible values into that schema, and isolates rows that cannot be represented cleanly. The result is a table that downstream tools can consume without rediscovering schema drift on every run.

The hard parts are handled explicitly:

  • Turning messy semistructured data into tables: mixed scalar, list, struct, null, date/time, and string values are reconciled into stable columns.
  • Schema reconciliation for incremental pipelines: base_schema lets later batches align to a previous PyArrow schema. Additive mode keeps known field types while accepting newly observed fields, and strict mode is available when the schema must not drift.
  • Memory safety: readers and converters use bounded batches, streaming writers, spill-to-disk paths where needed, depth limits, row-size budgets, and quarantine output so large or malformed inputs do not require loading the whole cleaned dataset into memory.
  • Max depth enforcement: Arrow and Parquet depth budgets can cap deeply nested records before they exceed downstream limits such as warehouse nesting constraints.

Every public reader and converter returns a Result object with clean data, bad rows, and stats.

It has two public workflows:

  • In-Memory Analytics: read_* functions return a Result whose clean_data is PyArrow, pandas, Polars, or DuckDB data.
  • File-To-File Converters: to_* functions stream sanitized files to CSV, JSON Lines, or Parquet and return a Result whose clean_data is None.
import schema_sanitizer as ss

events = ss.read_jsonl("raw/events.jsonl")
customers = ss.read_csv("raw/customers.csv", output_format="pandas")

table = events.clean_data
df = customers.clean_data

ss.to_parquet("raw/events.jsonl", "clean/events.parquet")

Index

Install

schema-sanitizer supports Python >=3.11.

For Arrow reads and file-to-file converters:

pip install 'schema-sanitizer[pyarrow]'

Install adapter extras for the in-memory analytics tools you use:

pip install 'schema-sanitizer[pyarrow,pandas]'
pip install 'schema-sanitizer[pyarrow,polars]'
pip install 'schema-sanitizer[pyarrow,duckdb]'
pip install 'schema-sanitizer[all]'

Import with an underscore:

import schema_sanitizer as ss

In-Memory Analytics

Use read_* when you want clean data back in Python with stats.

Function Input Typical use
read_csv(path, ...) Local or PyArrow FS .csv file Inspect or analyze CSV data.
read_json(path, ...) Local or PyArrow FS .json file Read JSON files into a table.
read_json_folder(path, ...) Local or PyArrow FS folder of .json files Read direct JSON file children as JSONL rows.
read_jsonl(path, ...) Local or PyArrow FS .jsonl / .ndjson file Read JSON Lines or NDJSON event and log data.
read_xml(path, ...) Local or PyArrow FS .xml file Read XML documents through the native sanitizer pipeline.
read_xml_folder(path, ...) Local or PyArrow FS folder of .xml files Read direct XML file children as XML document rows.
read_parquet(path, ...) Local or PyArrow FS .parquet / .pq file Read Parquet through the same cleaning pipeline.
read_python(rows, ...) list[dict] Clean rows already in memory.

Readers always return a Result. By default, result.clean_data is a PyArrow table.

result = ss.read_jsonl("data/events.jsonl")

print(result.clean_data.schema)
print(result.clean_data.num_rows)
print(result.stats)

Choose another in-memory analytics target with output_format.

pandas_result = ss.read_csv("data/customers.csv", output_format="pandas")
polars_result = ss.read_csv("data/customers.csv", output_format="polars")
duckdb_result = ss.read_csv("data/customers.csv", output_format="duckdb")

pandas_df = pandas_result.clean_data
polars_df = polars_result.clean_data
duckdb_rel = duckdb_result.clean_data

Accepted output_format values are pyarrow, pandas, polars, and duckdb.

Use read_python for rows that are already in memory.

rows = [
    {"id": 1, "active": "yes", "score": "10.5"},
    {"id": 2, "active": "no", "score": 8},
]

result = ss.read_python(
    rows,
    true_tokens=("yes",),
    false_tokens=("no",),
)

table = result.clean_data

File-To-File Converters

Use to_* when you want a sanitized output file and do not need clean data in memory. These functions stream sanitized output and return a Result with clean_data set to None, plus bad rows and stats.

Function Output Typical use
to_csv(input_path, output_path, ...) CSV Produce a flat file for spreadsheets or downstream text tools.
to_jsonl(input_path, output_path, ...) JSON Lines Produce one cleaned JSON object per line.
to_parquet(input_path, output_path, ...) Parquet Produce a typed columnar file for analytics systems.
result = ss.to_parquet("raw/orders.csv", "clean/orders.parquet")

assert result.clean_data is None
print(result.stats)

ss.to_csv("raw/events.jsonl", "clean/events.csv")
ss.to_jsonl("raw/orders.parquet", "clean/orders.jsonl")

Converters infer the input format from the input file extension. If the input path has no useful extension, pass input_format.

ss.to_parquet("raw/events", "clean/events.parquet", input_format="jsonl")

Accepted input_format values are auto, csv, json, jsonl, ndjson, xml, and parquet.

Result Object

All public read_* and to_* functions return schema_sanitizer.Result.

For readers, result.clean_data contains the requested clean in-memory output. For converters, clean data is written to output_path, so result.clean_data is always None.

result = ss.read_csv("data/customers.csv", output_format="pandas")

df = result.clean_data
stats = result.stats
bad_rows = result.bad_rows
Property or method What it returns
clean_data Clean data in the requested reader output_format: PyArrow table, pandas DataFrame, Polars DataFrame, or DuckDB relation. Always None for to_* converters.
stats Dictionary of counters such as rows inferred, rows materialized, batches, skipped rows, quarantined rows, warnings, and errors.
bad_rows Quarantined rows as a pyarrow.Table. The table may be empty when no rows were quarantined.

Result Stats

result.stats is a plain dict. All properties are integers and default to 0 when the runtime did not report that counter.

Property What it means
inferred_rows Rows scanned while inferring the input schema.
inferred_bytes Approximate input bytes scanned while inferring the schema.
arrow_schema_depth Maximum Arrow container depth found during inference. Struct and list containers count; scalar leaves and top-level field wrappers do not.
parquet_schema_depth Maximum Parquet/BigQuery RECORD depth found during inference. Struct containers count; list containers and scalar leaves do not.
materialized_rows Clean rows materialized for read_* results or written by to_* converters.
batches Number of output batches materialized or written.
flattened_fields Nested fields flattened by the selected flattening options.
scalar_wrappings Scalar values wrapped to fit list or struct-like output shapes.
skipped_rows Rows dropped by on_error="skip_row".
quarantined_rows Rows dropped from clean output and stored in result.bad_rows.
warnings Non-fatal warnings reported by the runtime.
errors Fatal errors reported by the runtime.
soft_errors Recoverable row or value errors handled by policy.

Error Handling

By default, rows that fail materialization are kept as null rows. Choose a policy with on_error.

Policy Behavior
stop Raise an error as soon as a row cannot be processed.
skip_row Drop bad rows from the output.
emit_null_row Keep row count stable by emitting a null row.
quarantine Drop bad rows from the output and keep them in result.bad_rows.
result = ss.read_jsonl(
    "data/events.jsonl",
    on_error="quarantine",
)

clean = result.clean_data
print(result.stats)

bad_rows = result.bad_rows

Converters return the same Result shape as readers. Because the clean data is written to output_path, converter results always have clean_data is None.

result = ss.to_parquet(
    "raw/events.jsonl",
    "clean/events.parquet",
    on_error="quarantine",
)

print(result.stats)
bad_rows = result.bad_rows

Schema Control

Pass base_schema when the output must match or evolve from an expected contract.

import pyarrow as pa
import schema_sanitizer as ss

schema = pa.schema(
    [
        pa.field("id", pa.int64(), nullable=False),
        pa.field("email", pa.string()),
    ]
)

result = ss.read_jsonl(
    "data/users.jsonl",
    base_schema=schema,
    schema_mode="strict",
    on_error="quarantine",
)

table = result.clean_data
Mode Behavior
strict Output exactly base_schema. Requires base_schema; inference is skipped.
additive Keep base_schema field types and add newly observed fields.

column_order defaults to base_schema_first. Use column_order="sorted" for lexicographic field ordering.

Timestamp Precision

Timestamp strings are parsed internally with nanosecond precision, then written to the output Arrow schema using timestamp_precision.

result = ss.read_jsonl(
    "data/events.jsonl",
    timestamp_precision="TIMESTAMP_MICROS",
)

ss.to_parquet(
    "raw/events.jsonl",
    "clean/events.parquet",
    timestamp_precision="TIMESTAMP_MICROS",
)

Accepted values are TIMESTAMP_MILLIS, TIMESTAMP_MICROS, and TIMESTAMP_NANOS. The default is TIMESTAMP_MICROS because it is compatible with BigQuery Parquet external tables. Selecting TIMESTAMP_NANOS preserves nanosecond Arrow/Parquet timestamps, but some downstream engines, including BigQuery, do not support Parquet TIMESTAMP_NANOS.

When parsed timestamp strings contain finer precision than the selected output unit, the value is truncated to that unit. Integer values coerced into timestamp fields are interpreted as already being in the selected output unit.

Custom Tokens and Date/Time Patterns

Use true_tokens and false_tokens when boolean values use domain-specific strings. Use temporal regex options when dates or times do not match the built-in parsers.

result = ss.read_csv(
    "data/events.csv",
    true_tokens=("yes", "enabled", "1"),
    false_tokens=("no", "disabled", "0"),
    timestamp_patterns=(
        r"^(\d{4})/(\d{2})/(\d{2})[ T](\d{2}):(\d{2}):(\d{2})$",
    ),
    date_patterns=(
        r"^(\d{4})\.(\d{2})\.(\d{2})$",
    ),
    time_patterns=(
        r"^(\d{2})h(\d{2})m(\d{2})s$",
    ),
)

table = result.clean_data

For timestamp_patterns, capture groups 1-6 are year, month, day, hour, minute, and second. Optional group 7 may contain fractions, and group 8 may contain a timezone. For date_patterns, groups 1-3 are year, month, and day. For time_patterns, groups 1-3 are hour, minute, and second.

In-Memory Analytics Options

Each reader accepts the parameters listed in its section.

read_csv(path, ...)

Parameter Default Accepted values What it controls
path required str or path-like object Local CSV file to read.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
timestamp_precision TIMESTAMP_MICROS TIMESTAMP_MILLIS, TIMESTAMP_MICROS, TIMESTAMP_NANOS Output Arrow/Parquet timestamp unit.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers. Groups 1-6 map to year, month, day, hour, minute, second; group 7 may hold fractions and group 8 timezone.
date_patterns () sequence of regex strings Extra date parsers. Groups 1-3 map to year, month, day.
time_patterns () sequence of regex strings Extra time parsers. Groups 1-3 map to hour, minute, second.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
csv_has_header True bool Whether the first CSV row is a header.
csv_delimiter , single-character string CSV delimiter.
input_text_encoding utf-8 text encoding name Encoding used to decode CSV bytes.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming CSV reads.

read_json(path, ...)

Parameter Default Accepted values What it controls
path required str or path-like object Local JSON file to read.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
timestamp_precision TIMESTAMP_MICROS TIMESTAMP_MILLIS, TIMESTAMP_MICROS, TIMESTAMP_NANOS Output Arrow/Parquet timestamp unit.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
input_text_encoding utf-8 text encoding name Encoding used to decode JSON bytes.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming JSON reads.

read_json_folder(path, ...)

read_json_folder reads the direct .json children of a local folder or PyArrow filesystem folder URI in deterministic filename order. Folder exploration is not recursive. Each source file must contain one JSON document; the reader compacts those documents into a temporary JSON Lines stream and then runs the same sanitizer path used by read_json.

Parameter Default Accepted values What it controls
path required str or path-like object Local folder or PyArrow FS folder URI containing .json files.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
timestamp_precision TIMESTAMP_MICROS TIMESTAMP_MILLIS, TIMESTAMP_MICROS, TIMESTAMP_NANOS Output Arrow/Parquet timestamp unit.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
input_text_encoding utf-8 text encoding name Encoding used to decode each source JSON file.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-document and per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for the compacted JSON Lines stream.

read_jsonl(path, ...)

Parameter Default Accepted values What it controls
path required str or path-like object Local JSON Lines or NDJSON file to read.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
timestamp_precision TIMESTAMP_MICROS TIMESTAMP_MILLIS, TIMESTAMP_MICROS, TIMESTAMP_NANOS Output Arrow/Parquet timestamp unit.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
input_text_encoding utf-8 text encoding name Encoding used to decode JSON Lines or NDJSON bytes.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming JSON Lines or NDJSON reads.

read_xml(path, ...)

read_xml parses a local XML document in the native C++ frontend and sends the resulting rows through the same schema inference, cleaning, quarantine, and output adapter pipeline as the JSON and CSV readers.

By default, the root element is treated as one row, like a single JSON object. Pass xml_row_tag="row" when a file contains repeated direct child elements that should become separate rows; the XML scanner then streams each matching row element. Attributes become fields prefixed with @, repeated child tags become lists, and mixed element text is stored under #text.

result = ss.read_xml(
    "raw/orders.xml",
    xml_row_tag="order",
    read_chunk_bytes=1024 * 1024,
    batch_memory_limit_bytes=256 * 1024 * 1024,
)
Parameter Default Accepted values What it controls
path required str or path-like object Local XML file to read.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
timestamp_precision TIMESTAMP_MICROS TIMESTAMP_MILLIS, TIMESTAMP_MICROS, TIMESTAMP_NANOS Output Arrow/Parquet timestamp unit.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
input_text_encoding utf-8 text encoding name Encoding used to decode XML bytes when transcoding is needed.
xml_row_tag None XML element tag name or None Direct child element tag to stream as separate rows. None treats the whole document as one row.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming text input reads.

read_xml_folder(path, ...)

read_xml_folder reads the direct .xml children of a local folder or PyArrow filesystem folder URI in deterministic filename order. Folder exploration is not recursive. Each source file must contain one XML document, and all documents must use the same root tag unless you pass that tag explicitly as xml_row_tag. The reader wraps those documents in a temporary XML stream and then runs the same sanitizer path used by read_xml.

result = ss.read_xml_folder(
    "raw/order-events",
    xml_row_tag="order",
    batch_memory_limit_bytes=256 * 1024 * 1024,
)
Parameter Default Accepted values What it controls
path required str or path-like object Local folder or PyArrow FS folder URI containing .xml files.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
timestamp_precision TIMESTAMP_MICROS TIMESTAMP_MILLIS, TIMESTAMP_MICROS, TIMESTAMP_NANOS Output Arrow/Parquet timestamp unit.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
input_text_encoding utf-8 text encoding name Encoding used to decode each source XML file.
xml_row_tag None XML element tag name or None Expected XML document root tag. None infers it from the first file.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-document-row memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for the compacted XML stream.

read_parquet(path, ...)

Parameter Default Accepted values What it controls
path required str or path-like object Local Parquet file to read.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
timestamp_precision TIMESTAMP_MICROS TIMESTAMP_MILLIS, TIMESTAMP_MICROS, TIMESTAMP_NANOS Output Arrow/Parquet timestamp unit.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.

read_python(rows, ...)

Parameter Default Accepted values What it controls
rows required list[dict] In-memory rows to normalize.
output_format pyarrow pyarrow, pandas, polars, duckdb Type stored in Result.clean_data.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
timestamp_precision TIMESTAMP_MICROS TIMESTAMP_MILLIS, TIMESTAMP_MICROS, TIMESTAMP_NANOS Output Arrow/Parquet timestamp unit.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort memory budget for the already-resident Python payload.

File-To-File Converter Options

Converters accept local or PyArrow FS URI output paths. Inputs can be local paths or PyArrow FS URI strings. They infer input format from the input extension unless you pass input_format.

to_csv(input_path, output_path, ...)

Parameter Default Accepted values What it controls
input_path required str or path-like object Local file or PyArrow FS URI to sanitize.
output_path required str or path-like object Local or PyArrow FS URI CSV file to create.
input_format auto auto, csv, json, jsonl, ndjson, xml, parquet Input format selector.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
timestamp_precision TIMESTAMP_MICROS TIMESTAMP_MILLIS, TIMESTAMP_MICROS, TIMESTAMP_NANOS Output Arrow/Parquet timestamp unit.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
csv_has_header True bool Whether CSV input has a header.
csv_delimiter , single-character string CSV input delimiter.
input_text_encoding utf-8 text encoding name Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input.
xml_row_tag None XML element tag name or None Direct child XML element tag to stream as separate rows when reading XML input.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming text input reads.

to_jsonl(input_path, output_path, ...)

Parameter Default Accepted values What it controls
input_path required str or path-like object Local file or PyArrow FS URI to sanitize.
output_path required str or path-like object Local or PyArrow FS URI JSON Lines file to create.
input_format auto auto, csv, json, jsonl, ndjson, xml, parquet Input format selector.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
timestamp_precision TIMESTAMP_MICROS TIMESTAMP_MILLIS, TIMESTAMP_MICROS, TIMESTAMP_NANOS Output Arrow/Parquet timestamp unit.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
csv_has_header True bool Whether CSV input has a header.
csv_delimiter , single-character string CSV input delimiter.
input_text_encoding utf-8 text encoding name Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input.
xml_row_tag None XML element tag name or None Direct child XML element tag to stream as separate rows when reading XML input.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming text input reads.

to_parquet(input_path, output_path, ...)

Parameter Default Accepted values What it controls
input_path required str or path-like object Local file or PyArrow FS URI to sanitize.
output_path required str or path-like object Local or PyArrow FS URI Parquet file to create.
input_format auto auto, csv, json, jsonl, ndjson, xml, parquet Input format selector.
base_schema None pyarrow.Schema or None Optional base output contract.
schema_mode additive additive, strict How inferred fields reconcile with base_schema.
column_order base_schema_first base_schema_first, sorted Output field ordering.
timestamp_precision TIMESTAMP_MICROS TIMESTAMP_MILLIS, TIMESTAMP_MICROS, TIMESTAMP_NANOS Output Arrow/Parquet timestamp unit.
parse_integers True bool Parse integer-looking strings as integers.
parse_floats True bool Parse float-looking strings as floats.
true_tokens () sequence of strings String tokens interpreted as boolean true.
false_tokens () sequence of strings String tokens interpreted as boolean false.
timestamp_patterns () sequence of regex strings Extra timestamp parsers.
date_patterns () sequence of regex strings Extra date parsers.
time_patterns () sequence of regex strings Extra time parsers.
arrow_max_depth 32 integer >= 0 Maximum Arrow container depth for object and array expansion.
parquet_max_depth 15 integer >= 0 Maximum Parquet/BigQuery RECORD depth for object expansion.
scalar_object_key default_key string Key used when a scalar must be wrapped as an object.
csv_has_header True bool Whether CSV input has a header.
csv_delimiter , single-character string CSV input delimiter.
input_text_encoding utf-8 text encoding name Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input.
xml_row_tag None XML element tag name or None Direct child XML element tag to stream as separate rows when reading XML input.
on_error emit_null_row stop, skip_row, emit_null_row, quarantine Row-level error policy.
batch_memory_limit_bytes None positive integer bytes or None Best-effort per-batch memory budget.
read_chunk_bytes 1048576 positive integer bytes Chunk size for streaming text input reads.

Schema Inference Heuristics

Schema inference scans the full source before materialization whenever inference runs. It is not a sample-based inference step: in inferred mode and additive base_schema mode, every source row is consumed during inference and counted in Result.stats["inferred_rows"].

For each inferred row, the sanitizer applies two internal passes:

  1. The shape pass discovers structural paths: field names, objects, arrays, and fields that must be flattened by depth limits.
  2. The statistics pass collects scalar type evidence for the discovered shape: booleans, integers, floats, timestamps, dates, times, strings, nulls, and mixed-type conflicts.

When schema_mode="strict" is used with an explicit base_schema, the sanitizer skips inference and uses the schema contract directly; in that fast path, inferred_rows is 0. Strict mode only works with base_schema. Passing schema_mode="strict" without base_schema raises an exception.

Separating shape discovery from scalar statistics keeps list and struct decisions stable across messy inputs. If one row has an object and another row has a scalar at the same field, the structural shape wins and the scalar is wrapped under scalar_object_key (default_key by default). If one row has a list and another row has a scalar at the same field, the list shape wins and the scalar is wrapped as a single list element.

Scalar inference is conservative:

  • Nulls do not choose a type by themselves.
  • Boolean JSON values infer bool.
  • Numeric JSON values infer int64 or float64.
  • Strings can infer booleans, integers, floats, timestamps, dates, or times when the configured token and parser options match.
  • Mixed scalar kinds fall back to string.
  • Objects or arrays observed where a scalar is required are stringified.

List inference is stricter than top-level object inference. Lists remain typed only when their element shape is conflict-free. Lists of scalars and lists of structs are supported; nested lists or conflicts inside a list element fall back to list<string> so each list column has one stable element type.

Base Schema Enforcement

base_schema is an output contract and only accepts a pyarrow.Schema. The sanitizer converts it to the same internal logical schema representation used by inference before planning materialization.

import pyarrow as pa
import schema_sanitizer as ss

user_schema = pa.schema(
    [
        pa.field("id", pa.int64(), nullable=False),
        pa.field("email", pa.string()),
    ]
)

result = ss.read_jsonl(
    "data/users.jsonl",
    base_schema=user_schema,
    schema_mode="strict",
)

schema_mode="strict" uses base_schema as the complete output schema. The inference loop is skipped, so Result.stats["inferred_rows"] is 0. Strict mode only works when base_schema is provided; otherwise the call raises an exception before materialization. Extra source fields are rejected because they are not present in the strict contract.

schema_mode="additive" requires inference. The full source is scanned, then the inferred schema is reconciled with base_schema: fields already present in base_schema keep their declared types and nullable flags, while newly observed fields are added from inference. Row values that cannot be coerced into the declared base field type are handled by on_error.

For fields present in base_schema, the base type wins even when source rows contain conflicting values. A value such as "unknown" in a base int64 field, an object in a base scalar field, or a scalar in a base struct/list field is a materialization conflict. The row is stopped, skipped, quarantined, or replaced with a null row according to on_error. For fields not present in base_schema, conflicts are resolved by the normal inference heuristics before the field is added: mixed scalar kinds fall back to string, object/scalar and list/scalar conflicts use the wrapping rules, and conflicting list element shapes fall back to list<string>.

column_order controls only output field order after reconciliation. column_order="base_schema_first" preserves base fields first, then appends new fields. column_order="sorted" emits fields in lexicographic order.

Max Depth Enforcement

Depth enforcement uses two independent limits because Arrow and Parquet/BigQuery count nested data differently:

  • arrow_max_depth defaults to 32. It counts Arrow container depth: struct and list containers count, while scalar leaves and top-level field wrappers do not.
  • parquet_max_depth defaults to 15. It counts Parquet/BigQuery RECORD depth: struct containers count, while list containers, scalar leaves, and top-level field wrappers do not.

The sanitizer flattens a named field to <name>_flattened when keeping that field's full nested value would exceed either limit. The flattened value is stored as a string.

Depth examples:

Shape arrow_schema_depth parquet_schema_depth
id: int64 0 0
user: struct<id: int64> 1 1
tags: list<string> 1 0
authors: list<struct<name: string>> 2 1
asset: struct<authors: list<struct<name: string>>> 3 2

Use arrow_max_depth as a defensive complexity limit for Arrow/Parquet container nesting. Use parquet_max_depth=15 when the output Parquet will be read by BigQuery external tables, where the practical limit is nested RECORD depth rather than physical list wrapper depth.

The reported Result.stats["arrow_schema_depth"] and Result.stats["parquet_schema_depth"] use the same counting rules as the enforcement options.

Quarantine Rows Pipeline

Use on_error="quarantine" when you want clean output to continue while keeping failed rows for inspection or replay. Rows that fail materialization are dropped from clean_data or the converter output file and appended to Result.bad_rows.

bad_rows is a PyArrow table with diagnostic metadata:

Column What it contains
row_index Zero-based source row index.
source_offset Byte offset or source-relative offset when available.
code / code_str Machine-readable diagnostic code.
path_id Internal field path id associated with the failure.
detail Human-readable error detail.
context_snippet Short preview of the offending source row.
raw_row Full raw source row text when available.

For in-memory reads, inspect result.bad_rows directly:

result = ss.read_jsonl(
    "data/events.jsonl",
    base_schema=event_schema,
    schema_mode="strict",
    on_error="quarantine",
)

clean = result.clean_data
bad_rows = result.bad_rows
print(result.stats["quarantined_rows"])

For file-to-file converters, the clean output is written to output_path and the same Result.bad_rows table carries quarantined rows:

result = ss.to_parquet(
    "raw/events.jsonl",
    "clean/events.parquet",
    base_schema=event_schema,
    schema_mode="strict",
    on_error="quarantine",
)

bad_rows = result.bad_rows

Quarantine is row-level. If one field in a row cannot be coerced into the output schema, the whole row is excluded from clean output and recorded once in bad_rows. In contrast, on_error="skip_row" drops the row without retaining it, and on_error="emit_null_row" keeps row count stable by writing a null row instead of recording it in bad_rows.

Memory Safety Measures

The sanitizer is designed to process large local files and PyArrow filesystem URI inputs without requiring the whole clean dataset to live in Python memory.

  • File-to-file converters stream sanitized batches directly to the output file. Result.clean_data is None for converters, so the clean table is not materialized in memory.
  • PyArrow filesystem file inputs are opened as seekable streams. CSV, JSON, JSON Lines, NDJSON, and XML URI inputs are not copied to a temporary file; their bytes are read by the same chunked native scanner used for local files.
  • PyArrow filesystem outputs are opened with pyarrow.fs.open_output_stream. CSV, JSON Lines, and Parquet converters write incrementally to that stream instead of staging the full output in a local temporary file.
  • CSV, JSON, JSON Lines, and NDJSON readers use read_chunk_bytes to bound input chunks while scanning.
  • XML without xml_row_tag is parsed into a native document tree before row emission, so batch_memory_limit_bytes limits the accumulated document size before the tree is built.
  • XML with xml_row_tag streams matching direct child elements. The scanner reads bounded chunks, discards completed row slices, and raises SchemaSanitizerResourceError if the active XML buffer exceeds batch_memory_limit_bytes.
  • Local and PyArrow filesystem folder readers (read_json_folder and read_xml_folder) list direct child files only, then compact one source document at a time into a local temporary JSON Lines or XML stream. The temp file is the bridge that lets many single-document files reuse the normal streaming sanitizer pipeline without building one large Python object.
  • Folder temp streams contain only the compacted input representation, not the final clean dataset. With batch_memory_limit_bytes, each source document is checked before it is decoded and added to that stream. If a PyArrow filesystem does not report a child file size, the child is read in bounded chunks and the reader stops at batch_memory_limit_bytes + 1 bytes before raising SchemaSanitizerResourceError.
  • Folder temp files are deleted when the read finishes, and partially written temp files are deleted if compaction raises an exception. If the Python process is killed externally, for example with SIGKILL, the operating system may not give schema-sanitizer a chance to run that cleanup.
  • Parquet inputs are decoded by PyArrow into record batches and exposed to the native JSON frontend through a seekable JSON Lines byte reader. Rows are produced incrementally; the Parquet-to-JSONL adapter does not stage a full conversion file.
  • XML DTD and entity declarations are rejected. The XML frontend does not load external entities or expand document-defined entities.
  • batch_memory_limit_bytes maps to the native per-batch memory_limit_bytes budget. It reduces inference and output batch sizes instead of changing the final schema.
  • For already-resident Python inputs, batch_memory_limit_bytes is enforced as a preflight resource guard. If the Python payload is already larger than the configured limit, the call raises SchemaSanitizerResourceError before native ingestion starts.
  • arrow_max_depth and parquet_max_depth cap nested expansion. Values beyond those limits are flattened to strings, preventing unbounded container nesting from creating very wide or deeply nested Arrow/Parquet schemas.
  • Native parsing and materialization use owned streams, arenas, and Arrow C Data resources that are closed when the Result, stream, or sink is closed or dropped. Table-producing readers force stream materialization and close native resources before returning.

Configured resource-limit failures raise SchemaSanitizerResourceError and include limit_name="memory_limit_bytes" in their detail payload when available. True allocator failures are reported separately as SchemaSanitizerOutOfMemoryError.

PyArrow Filesystem Integration

When PyArrow is installed, every file reader and file-to-file converter can use pyarrow.fs URI strings. This covers read_csv, read_json, read_json_folder, read_jsonl, read_xml, read_xml_folder, read_parquet, to_csv, to_jsonl, and to_parquet. Supported URI input extensions include csv, json, jsonl, ndjson, xml, parquet, and pq. Supported URI converter output extensions include csv, jsonl, and parquet.

For normal local files, prefer a regular path:

events = ss.read_jsonl("/home/user/data/events.jsonl")

Regular local paths are the simplest and usually best choice for local disk access. They avoid PyArrow URI parsing and filesystem dispatch.

file:// is PyArrow's local-filesystem URI scheme. On Linux and WSL, absolute local paths use three slashes: file:///home/user/data/events.jsonl. That URI points to the same file as /home/user/data/events.jsonl, but it is opened through pyarrow.fs.LocalFileSystem. Use it when you specifically want to test the PyArrow filesystem route or when your code passes filesystem URIs consistently across local and cloud storage. Do not write file://home/user/...; that form has home in the URI host position instead of being an absolute local path.

Local form Example Opens through Best use
Regular local path /home/user/data/events.jsonl schema-sanitizer local path handling Default for local disk files.
Local PyArrow URI file:///home/user/data/events.jsonl pyarrow.fs.LocalFileSystem Testing or URI-only code paths.

Common URI forms:

Storage Example URI
Local file through PyArrow file:///home/user/data/events.jsonl
Amazon S3 s3://raw-bucket/events/2026-06-12.jsonl
Amazon S3 folder s3://raw-bucket/events/2026-06-12/
Google Cloud Storage gs://raw-bucket/assets/2026-06-12.parquet
Google Cloud Storage folder gs://raw-bucket/assets/2026-06-12/
Google Cloud Storage alias gcs://raw-bucket/assets/2026-06-12.xml
Azure Data Lake Storage Gen2 abfs://container@account.dfs.core.windows.net/events/2026-06-12.jsonl
Azure Data Lake Storage Gen2 folder abfs://container@account.dfs.core.windows.net/events/2026-06-12/

Cloud URI support depends on the installed PyArrow build and the normal provider credentials/configuration available to PyArrow.

import schema_sanitizer as ss

events = ss.read_jsonl("s3://raw-bucket/events/2026-06-12.jsonl")
assets = ss.read_parquet("gs://raw-bucket/assets/2026-06-12.parquet")
daily_events = ss.read_json_folder("s3://raw-bucket/events/2026-06-12/")

ss.to_parquet(
    "s3://raw-bucket/events/2026-06-12.jsonl",
    "gs://clean-bucket/events/2026-06-12.parquet",
)

URI file inputs are opened as seekable PyArrow files. CSV, JSON, JSON Lines, NDJSON, and XML bytes are fed directly to the native chunk scanner. Parquet is decoded with pyarrow.parquet into batches, converted incrementally to JSON Lines bytes, and then fed to the same native sanitizer path. No single-file URI input is copied to a temporary file by schema-sanitizer.

Folder URI inputs are listed with non-recursive pyarrow.fs.FileSelector. read_json_folder filters direct .json child files and read_xml_folder filters direct .xml child files. The matching children are sorted by filename, then compacted one document at a time into a local temporary stream before the normal sanitizer pipeline reads that stream.

URI outputs are opened with pyarrow.fs.open_output_stream. CSV and Parquet writers stream Arrow batches to that output stream, and JSON Lines writes UTF-8 bytes incrementally. The output URI is not staged through a local temporary file.

Supported Inputs

Supported inputs are intentionally file-oriented:

  • Normal local file paths for read_csv, read_json, read_jsonl, read_xml, read_parquet, to_csv, to_jsonl, and to_parquet.
  • PyArrow filesystem file URI strings for the same single-file readers and converters when PyArrow is installed and can open the URI.
  • Normal local folders for read_json_folder and read_xml_folder.
  • PyArrow filesystem folder URI strings for read_json_folder and read_xml_folder; folder exploration is non-recursive.
  • Already-resident list[dict] rows through read_python.

Unsupported Inputs

Unsupported inputs include raw JSON or XML strings, bytes payloads, opened files, io.BytesIO, io.StringIO, custom reader objects, URLs that PyArrow cannot open as files, and recursive folder scans. Write those inputs to a local file first, or use read_python for in-memory list[dict] rows.

Examples

The examples/ directory contains tutorial notebooks and one cloud pipeline CLI example:

  • 01_ingestion_and_core_api.ipynb
  • 02_options_and_stats.ipynb
  • 03_adapters_and_converters.ipynb
  • 04_streaming_large_csv_to_parquet.ipynb
  • 05_full_options_catalog_sweep.ipynb
  • 06_xml_reading_and_memory.ipynb
  • 07_gcs_jsonl_to_bigquery_parquet.py: GCS JSONL to BigQuery-compatible Parquet using an external table schema fetched through ADBC as base_schema, then creating or replacing the Hive-partitioned external table

Platform Notes

Published PyPI wheels target glibc-based Linux environments (manylinux_2_28). Alpine Linux uses musl, so Alpine users should use a glibc-based Python environment or build from source.

Development

Install the project for local development:

pip install -e .[dev]

Run the tests:

pytest

Build the native core directly with CMake:

cmake -S . -B build/dev -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build/dev

License

schema-sanitizer is licensed under the Apache License 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_sanitizer-0.1.2.tar.gz (222.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

schema_sanitizer-0.1.2-cp311-abi3-win_amd64.whl (510.8 kB view details)

Uploaded CPython 3.11+Windows x86-64

schema_sanitizer-0.1.2-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (402.6 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

schema_sanitizer-0.1.2-cp311-abi3-macosx_11_0_arm64.whl (318.7 kB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

schema_sanitizer-0.1.2-cp311-abi3-macosx_10_9_x86_64.whl (331.4 kB view details)

Uploaded CPython 3.11+macOS 10.9+ x86-64

File details

Details for the file schema_sanitizer-0.1.2.tar.gz.

File metadata

  • Download URL: schema_sanitizer-0.1.2.tar.gz
  • Upload date:
  • Size: 222.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for schema_sanitizer-0.1.2.tar.gz
Algorithm Hash digest
SHA256 591c210f553a2342086395347aa147b7a7005fe44b5c2bd4ea02fd17875aa017
MD5 535f91b021171bd8d6fa907cba82c7e2
BLAKE2b-256 38eeb0e76920f7785c8924f00a47bece10d3d3cec304c5e620ac0f5cafd5ee8c

See more details on using hashes here.

File details

Details for the file schema_sanitizer-0.1.2-cp311-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for schema_sanitizer-0.1.2-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b5a2581727ff12d8dd5b0e0d75c89012a00a18418fb05c8651ff605b8191e42c
MD5 a66ab7fd8ad64775ad79fe470a89830f
BLAKE2b-256 1744e0eac9cad8010dcf77e1d0d62ec066c2a48a04cf40126a47f2238a94a3d0

See more details on using hashes here.

File details

Details for the file schema_sanitizer-0.1.2-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for schema_sanitizer-0.1.2-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 217920ea3e809d404d3d63061b10639e795ad95aeffc4cfeb5feb5f6e704a674
MD5 3f18386160d7b01ffeb7e6ccd8b03d07
BLAKE2b-256 1add5cff465840edb698fdbb61b8df18912525ed7e2fdbf77255afa0b9e932b1

See more details on using hashes here.

File details

Details for the file schema_sanitizer-0.1.2-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for schema_sanitizer-0.1.2-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1c0092c42eb52af4f9f4ad6cd01569f70c40ad796439a78010bbe5eeb9cd5777
MD5 ef9b62981fda5ef35327a916524b0618
BLAKE2b-256 a3550d70c380b5c2054660bea256c5fb42255e954000f855f063785fe41ce4da

See more details on using hashes here.

File details

Details for the file schema_sanitizer-0.1.2-cp311-abi3-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for schema_sanitizer-0.1.2-cp311-abi3-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 847b9ae31f1cad504a6670b8e71a70179008b87689176ef9c234506d75a55796
MD5 5a283960593b5ca68163ef564db3cc2a
BLAKE2b-256 4b7873c96277e2014e2c484464818d72e6368d7cc654fb631febd5c0488fb3cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page