Spec-driven data sanitization for CSV, JSON, JSONL, XML, Parquet, and Python objects.
Project description
schema-sanitizer
Version 0.1.1: this project is still in a testing phase. Expect the core behavior to be exercised heavily before treating it as a stable production dependency.
schema-sanitizer turns extremely messy semistructured data into stable,
consistent tables. It is built for CSV, JSON, JSON Lines, XML, Parquet, and
Python rows whose real-world values do not agree on one neat schema: fields
appear late, arrays and objects change shape, timestamps arrive in several
formats, scalars collide with nested values, and malformed records still need a
place to go.
The library's main purpose is to make ingestion predictable before data reaches analytics engines, warehouses, or incremental pipelines. It scans source data, infers a reconciled Arrow schema, converts compatible values into that schema, and isolates rows that cannot be represented cleanly. The result is a table that downstream tools can consume without rediscovering schema drift on every run.
The hard parts are handled explicitly:
- Turning messy semistructured data into tables: mixed scalar, list, struct, null, date/time, and string values are reconciled into stable columns.
- Schema reconciliation for incremental pipelines:
base_schemalets later batches align to a previous PyArrow schema. Additive mode keeps known field types while accepting newly observed fields, and strict mode is available when the schema must not drift. - Memory safety: readers and converters use bounded batches, streaming writers, spill-to-disk paths where needed, depth limits, row-size budgets, and quarantine output so large or malformed inputs do not require loading the whole cleaned dataset into memory.
- Max depth enforcement: Arrow and Parquet depth budgets can cap deeply nested records before they exceed downstream limits such as warehouse nesting constraints.
Every public reader and converter returns a Result object with clean data,
bad rows, and stats.
It has two public workflows:
- In-Memory Analytics:
read_*functions return aResultwhoseclean_datais PyArrow, pandas, Polars, or DuckDB data. - File-To-File Converters:
to_*functions stream sanitized files to CSV, JSON Lines, or Parquet and return aResultwhoseclean_dataisNone.
import schema_sanitizer as ss
events = ss.read_jsonl("raw/events.jsonl")
customers = ss.read_csv("raw/customers.csv", output_format="pandas")
table = events.clean_data
df = customers.clean_data
ss.to_parquet("raw/events.jsonl", "clean/events.parquet")
Index
- Install
- In-Memory Analytics
- File-To-File Converters
- Result Object
- Error Handling
- Schema Control
- Custom Tokens and Date/Time Patterns
- In-Memory Analytics Options
- File-To-File Converter Options
- Schema Inference Heuristics
- Base Schema Enforcement
- Max Depth Enforcement
- Quarantine Rows Pipeline
- Memory Safety Measures
- PyArrow Filesystem Integration
- Supported Inputs
- Unsupported Inputs
- Examples
- Platform Notes
- Development
- License
Install
schema-sanitizer supports Python >=3.11.
For Arrow reads and file-to-file converters:
pip install 'schema-sanitizer[pyarrow]'
Install adapter extras for the in-memory analytics tools you use:
pip install 'schema-sanitizer[pyarrow,pandas]'
pip install 'schema-sanitizer[pyarrow,polars]'
pip install 'schema-sanitizer[pyarrow,duckdb]'
pip install 'schema-sanitizer[all]'
Import with an underscore:
import schema_sanitizer as ss
In-Memory Analytics
Use read_* when you want clean data back in Python with stats.
| Function | Input | Typical use |
|---|---|---|
read_csv(path, ...) |
Local or PyArrow FS .csv file |
Inspect or analyze CSV data. |
read_json(path, ...) |
Local or PyArrow FS .json file |
Read JSON files into a table. |
read_json_folder(path, ...) |
Local or PyArrow FS folder of .json files |
Read direct JSON file children as JSONL rows. |
read_jsonl(path, ...) |
Local or PyArrow FS .jsonl / .ndjson file |
Read JSON Lines or NDJSON event and log data. |
read_xml(path, ...) |
Local or PyArrow FS .xml file |
Read XML documents through the native sanitizer pipeline. |
read_xml_folder(path, ...) |
Local or PyArrow FS folder of .xml files |
Read direct XML file children as XML document rows. |
read_parquet(path, ...) |
Local or PyArrow FS .parquet / .pq file |
Read Parquet through the same cleaning pipeline. |
read_python(rows, ...) |
list[dict] |
Clean rows already in memory. |
Readers always return a Result. By default, result.clean_data is a PyArrow table.
result = ss.read_jsonl("data/events.jsonl")
print(result.clean_data.schema)
print(result.clean_data.num_rows)
print(result.stats)
Choose another in-memory analytics target with output_format.
pandas_result = ss.read_csv("data/customers.csv", output_format="pandas")
polars_result = ss.read_csv("data/customers.csv", output_format="polars")
duckdb_result = ss.read_csv("data/customers.csv", output_format="duckdb")
pandas_df = pandas_result.clean_data
polars_df = polars_result.clean_data
duckdb_rel = duckdb_result.clean_data
Accepted output_format values are pyarrow, pandas, polars, and duckdb.
Use read_python for rows that are already in memory.
rows = [
{"id": 1, "active": "yes", "score": "10.5"},
{"id": 2, "active": "no", "score": 8},
]
result = ss.read_python(
rows,
true_tokens=("yes",),
false_tokens=("no",),
)
table = result.clean_data
File-To-File Converters
Use to_* when you want a sanitized output file and do not need clean data in
memory. These functions stream sanitized output and return a Result with
clean_data set to None, plus bad rows and stats.
| Function | Output | Typical use |
|---|---|---|
to_csv(input_path, output_path, ...) |
CSV | Produce a flat file for spreadsheets or downstream text tools. |
to_jsonl(input_path, output_path, ...) |
JSON Lines | Produce one cleaned JSON object per line. |
to_parquet(input_path, output_path, ...) |
Parquet | Produce a typed columnar file for analytics systems. |
result = ss.to_parquet("raw/orders.csv", "clean/orders.parquet")
assert result.clean_data is None
print(result.stats)
ss.to_csv("raw/events.jsonl", "clean/events.csv")
ss.to_jsonl("raw/orders.parquet", "clean/orders.jsonl")
Converters infer the input format from the input file extension. If the input
path has no useful extension, pass input_format.
ss.to_parquet("raw/events", "clean/events.parquet", input_format="jsonl")
Accepted input_format values are auto, csv, json, jsonl, ndjson,
xml, and parquet.
Result Object
All public read_* and to_* functions return schema_sanitizer.Result.
For readers, result.clean_data contains the requested clean in-memory output.
For converters, clean data is written to output_path, so result.clean_data
is always None.
result = ss.read_csv("data/customers.csv", output_format="pandas")
df = result.clean_data
stats = result.stats
bad_rows = result.bad_rows
| Property or method | What it returns |
|---|---|
clean_data |
Clean data in the requested reader output_format: PyArrow table, pandas DataFrame, Polars DataFrame, or DuckDB relation. Always None for to_* converters. |
stats |
Dictionary of counters such as rows inferred, rows materialized, batches, skipped rows, quarantined rows, warnings, and errors. |
bad_rows |
Quarantined rows as a pyarrow.Table. The table may be empty when no rows were quarantined. |
Result Stats
result.stats is a plain dict. All properties are integers and default to
0 when the runtime did not report that counter.
| Property | What it means |
|---|---|
inferred_rows |
Rows scanned while inferring the input schema. |
inferred_bytes |
Approximate input bytes scanned while inferring the schema. |
arrow_schema_depth |
Maximum Arrow container depth found during inference. Struct and list containers count; scalar leaves and top-level field wrappers do not. |
parquet_schema_depth |
Maximum Parquet/BigQuery RECORD depth found during inference. Struct containers count; list containers and scalar leaves do not. |
materialized_rows |
Clean rows materialized for read_* results or written by to_* converters. |
batches |
Number of output batches materialized or written. |
flattened_fields |
Nested fields flattened by the selected flattening options. |
scalar_wrappings |
Scalar values wrapped to fit list or struct-like output shapes. |
skipped_rows |
Rows dropped by on_error="skip_row". |
quarantined_rows |
Rows dropped from clean output and stored in result.bad_rows. |
warnings |
Non-fatal warnings reported by the runtime. |
errors |
Fatal errors reported by the runtime. |
soft_errors |
Recoverable row or value errors handled by policy. |
Error Handling
By default, rows that fail materialization are kept as null rows. Choose a
policy with on_error.
| Policy | Behavior |
|---|---|
stop |
Raise an error as soon as a row cannot be processed. |
skip_row |
Drop bad rows from the output. |
emit_null_row |
Keep row count stable by emitting a null row. |
quarantine |
Drop bad rows from the output and keep them in result.bad_rows. |
result = ss.read_jsonl(
"data/events.jsonl",
on_error="quarantine",
)
clean = result.clean_data
print(result.stats)
bad_rows = result.bad_rows
Converters return the same Result shape as readers. Because the clean data is
written to output_path, converter results always have clean_data is None.
result = ss.to_parquet(
"raw/events.jsonl",
"clean/events.parquet",
on_error="quarantine",
)
print(result.stats)
bad_rows = result.bad_rows
Schema Control
Pass base_schema when the output must match or evolve from an expected
contract.
import pyarrow as pa
import schema_sanitizer as ss
schema = pa.schema(
[
pa.field("id", pa.int64(), nullable=False),
pa.field("email", pa.string()),
]
)
result = ss.read_jsonl(
"data/users.jsonl",
base_schema=schema,
schema_mode="strict",
on_error="quarantine",
)
table = result.clean_data
| Mode | Behavior |
|---|---|
strict |
Output exactly base_schema. Requires base_schema; inference is skipped. |
additive |
Keep base_schema field types and add newly observed fields. |
column_order defaults to base_schema_first. Use column_order="sorted" for
lexicographic field ordering.
Custom Tokens and Date/Time Patterns
Use true_tokens and false_tokens when boolean values use domain-specific
strings. Use temporal regex options when dates or times do not match the built-in
parsers.
result = ss.read_csv(
"data/events.csv",
true_tokens=("yes", "enabled", "1"),
false_tokens=("no", "disabled", "0"),
timestamp_patterns=(
r"^(\d{4})/(\d{2})/(\d{2})[ T](\d{2}):(\d{2}):(\d{2})$",
),
date_patterns=(
r"^(\d{4})\.(\d{2})\.(\d{2})$",
),
time_patterns=(
r"^(\d{2})h(\d{2})m(\d{2})s$",
),
)
table = result.clean_data
For timestamp_patterns, capture groups 1-6 are year, month, day, hour,
minute, and second. Optional group 7 may contain fractions, and group 8 may
contain a timezone. For date_patterns, groups 1-3 are year, month, and day.
For time_patterns, groups 1-3 are hour, minute, and second.
In-Memory Analytics Options
Each reader accepts the parameters listed in its section.
read_csv(path, ...)
| Parameter | Default | Accepted values | What it controls |
|---|---|---|---|
path |
required | str or path-like object |
Local CSV file to read. |
output_format |
pyarrow |
pyarrow, pandas, polars, duckdb |
Type stored in Result.clean_data. |
base_schema |
None |
pyarrow.Schema or None |
Optional base output contract. |
schema_mode |
additive |
additive, strict |
How inferred fields reconcile with base_schema. |
column_order |
base_schema_first |
base_schema_first, sorted |
Output field ordering. |
parse_integers |
True |
bool |
Parse integer-looking strings as integers. |
parse_floats |
True |
bool |
Parse float-looking strings as floats. |
true_tokens |
() |
sequence of strings | String tokens interpreted as boolean true. |
false_tokens |
() |
sequence of strings | String tokens interpreted as boolean false. |
timestamp_patterns |
() |
sequence of regex strings | Extra timestamp parsers. Groups 1-6 map to year, month, day, hour, minute, second; group 7 may hold fractions and group 8 timezone. |
date_patterns |
() |
sequence of regex strings | Extra date parsers. Groups 1-3 map to year, month, day. |
time_patterns |
() |
sequence of regex strings | Extra time parsers. Groups 1-3 map to hour, minute, second. |
arrow_max_depth |
32 |
integer >= 0 |
Maximum Arrow container depth for object and array expansion. |
parquet_max_depth |
15 |
integer >= 0 |
Maximum Parquet/BigQuery RECORD depth for object expansion. |
scalar_object_key |
default_key |
string | Key used when a scalar must be wrapped as an object. |
csv_has_header |
True |
bool |
Whether the first CSV row is a header. |
csv_delimiter |
, |
single-character string | CSV delimiter. |
input_text_encoding |
utf-8 |
text encoding name | Encoding used to decode CSV bytes. |
on_error |
emit_null_row |
stop, skip_row, emit_null_row, quarantine |
Row-level error policy. |
batch_memory_limit_bytes |
None |
positive integer bytes or None |
Best-effort per-batch memory budget. |
read_chunk_bytes |
1048576 |
positive integer bytes | Chunk size for streaming CSV reads. |
read_json(path, ...)
| Parameter | Default | Accepted values | What it controls |
|---|---|---|---|
path |
required | str or path-like object |
Local JSON file to read. |
output_format |
pyarrow |
pyarrow, pandas, polars, duckdb |
Type stored in Result.clean_data. |
base_schema |
None |
pyarrow.Schema or None |
Optional base output contract. |
schema_mode |
additive |
additive, strict |
How inferred fields reconcile with base_schema. |
column_order |
base_schema_first |
base_schema_first, sorted |
Output field ordering. |
parse_integers |
True |
bool |
Parse integer-looking strings as integers. |
parse_floats |
True |
bool |
Parse float-looking strings as floats. |
true_tokens |
() |
sequence of strings | String tokens interpreted as boolean true. |
false_tokens |
() |
sequence of strings | String tokens interpreted as boolean false. |
timestamp_patterns |
() |
sequence of regex strings | Extra timestamp parsers. |
date_patterns |
() |
sequence of regex strings | Extra date parsers. |
time_patterns |
() |
sequence of regex strings | Extra time parsers. |
arrow_max_depth |
32 |
integer >= 0 |
Maximum Arrow container depth for object and array expansion. |
parquet_max_depth |
15 |
integer >= 0 |
Maximum Parquet/BigQuery RECORD depth for object expansion. |
scalar_object_key |
default_key |
string | Key used when a scalar must be wrapped as an object. |
input_text_encoding |
utf-8 |
text encoding name | Encoding used to decode JSON bytes. |
on_error |
emit_null_row |
stop, skip_row, emit_null_row, quarantine |
Row-level error policy. |
batch_memory_limit_bytes |
None |
positive integer bytes or None |
Best-effort per-batch memory budget. |
read_chunk_bytes |
1048576 |
positive integer bytes | Chunk size for streaming JSON reads. |
read_json_folder(path, ...)
read_json_folder reads the direct .json children of a local folder or
PyArrow filesystem folder URI in deterministic filename order. Folder
exploration is not recursive. Each source file must contain one JSON document;
the reader compacts those documents into a temporary JSON Lines stream and then
runs the same sanitizer path used by read_json.
| Parameter | Default | Accepted values | What it controls |
|---|---|---|---|
path |
required | str or path-like object |
Local folder or PyArrow FS folder URI containing .json files. |
output_format |
pyarrow |
pyarrow, pandas, polars, duckdb |
Type stored in Result.clean_data. |
base_schema |
None |
pyarrow.Schema or None |
Optional base output contract. |
schema_mode |
additive |
additive, strict |
How inferred fields reconcile with base_schema. |
column_order |
base_schema_first |
base_schema_first, sorted |
Output field ordering. |
parse_integers |
True |
bool |
Parse integer-looking strings as integers. |
parse_floats |
True |
bool |
Parse float-looking strings as floats. |
true_tokens |
() |
sequence of strings | String tokens interpreted as boolean true. |
false_tokens |
() |
sequence of strings | String tokens interpreted as boolean false. |
timestamp_patterns |
() |
sequence of regex strings | Extra timestamp parsers. |
date_patterns |
() |
sequence of regex strings | Extra date parsers. |
time_patterns |
() |
sequence of regex strings | Extra time parsers. |
arrow_max_depth |
32 |
integer >= 0 |
Maximum Arrow container depth for object and array expansion. |
parquet_max_depth |
15 |
integer >= 0 |
Maximum Parquet/BigQuery RECORD depth for object expansion. |
scalar_object_key |
default_key |
string | Key used when a scalar must be wrapped as an object. |
input_text_encoding |
utf-8 |
text encoding name | Encoding used to decode each source JSON file. |
on_error |
emit_null_row |
stop, skip_row, emit_null_row, quarantine |
Row-level error policy. |
batch_memory_limit_bytes |
None |
positive integer bytes or None |
Best-effort per-document and per-batch memory budget. |
read_chunk_bytes |
1048576 |
positive integer bytes | Chunk size for the compacted JSON Lines stream. |
read_jsonl(path, ...)
| Parameter | Default | Accepted values | What it controls |
|---|---|---|---|
path |
required | str or path-like object |
Local JSON Lines or NDJSON file to read. |
output_format |
pyarrow |
pyarrow, pandas, polars, duckdb |
Type stored in Result.clean_data. |
base_schema |
None |
pyarrow.Schema or None |
Optional base output contract. |
schema_mode |
additive |
additive, strict |
How inferred fields reconcile with base_schema. |
column_order |
base_schema_first |
base_schema_first, sorted |
Output field ordering. |
parse_integers |
True |
bool |
Parse integer-looking strings as integers. |
parse_floats |
True |
bool |
Parse float-looking strings as floats. |
true_tokens |
() |
sequence of strings | String tokens interpreted as boolean true. |
false_tokens |
() |
sequence of strings | String tokens interpreted as boolean false. |
timestamp_patterns |
() |
sequence of regex strings | Extra timestamp parsers. |
date_patterns |
() |
sequence of regex strings | Extra date parsers. |
time_patterns |
() |
sequence of regex strings | Extra time parsers. |
arrow_max_depth |
32 |
integer >= 0 |
Maximum Arrow container depth for object and array expansion. |
parquet_max_depth |
15 |
integer >= 0 |
Maximum Parquet/BigQuery RECORD depth for object expansion. |
scalar_object_key |
default_key |
string | Key used when a scalar must be wrapped as an object. |
input_text_encoding |
utf-8 |
text encoding name | Encoding used to decode JSON Lines or NDJSON bytes. |
on_error |
emit_null_row |
stop, skip_row, emit_null_row, quarantine |
Row-level error policy. |
batch_memory_limit_bytes |
None |
positive integer bytes or None |
Best-effort per-batch memory budget. |
read_chunk_bytes |
1048576 |
positive integer bytes | Chunk size for streaming JSON Lines or NDJSON reads. |
read_xml(path, ...)
read_xml parses a local XML document in the native C++ frontend and sends the
resulting rows through the same schema inference, cleaning, quarantine, and
output adapter pipeline as the JSON and CSV readers.
By default, the root element is treated as one row, like a single JSON object.
Pass xml_row_tag="row" when a file contains repeated direct child elements
that should become separate rows; the XML scanner then streams each matching
row element. Attributes become fields prefixed with @, repeated child tags
become lists, and mixed element text is stored under #text.
result = ss.read_xml(
"raw/orders.xml",
xml_row_tag="order",
read_chunk_bytes=1024 * 1024,
batch_memory_limit_bytes=256 * 1024 * 1024,
)
| Parameter | Default | Accepted values | What it controls |
|---|---|---|---|
path |
required | str or path-like object |
Local XML file to read. |
output_format |
pyarrow |
pyarrow, pandas, polars, duckdb |
Type stored in Result.clean_data. |
base_schema |
None |
pyarrow.Schema or None |
Optional base output contract. |
schema_mode |
additive |
additive, strict |
How inferred fields reconcile with base_schema. |
column_order |
base_schema_first |
base_schema_first, sorted |
Output field ordering. |
parse_integers |
True |
bool |
Parse integer-looking strings as integers. |
parse_floats |
True |
bool |
Parse float-looking strings as floats. |
true_tokens |
() |
sequence of strings | String tokens interpreted as boolean true. |
false_tokens |
() |
sequence of strings | String tokens interpreted as boolean false. |
timestamp_patterns |
() |
sequence of regex strings | Extra timestamp parsers. |
date_patterns |
() |
sequence of regex strings | Extra date parsers. |
time_patterns |
() |
sequence of regex strings | Extra time parsers. |
arrow_max_depth |
32 |
integer >= 0 |
Maximum Arrow container depth for object and array expansion. |
parquet_max_depth |
15 |
integer >= 0 |
Maximum Parquet/BigQuery RECORD depth for object expansion. |
scalar_object_key |
default_key |
string | Key used when a scalar must be wrapped as an object. |
input_text_encoding |
utf-8 |
text encoding name | Encoding used to decode XML bytes when transcoding is needed. |
xml_row_tag |
None |
XML element tag name or None |
Direct child element tag to stream as separate rows. None treats the whole document as one row. |
on_error |
emit_null_row |
stop, skip_row, emit_null_row, quarantine |
Row-level error policy. |
batch_memory_limit_bytes |
None |
positive integer bytes or None |
Best-effort per-batch memory budget. |
read_chunk_bytes |
1048576 |
positive integer bytes | Chunk size for streaming text input reads. |
read_xml_folder(path, ...)
read_xml_folder reads the direct .xml children of a local folder or PyArrow
filesystem folder URI in deterministic filename order. Folder exploration is
not recursive. Each source file must contain one XML document, and all
documents must use the same root tag unless you pass that tag explicitly as
xml_row_tag. The reader wraps those documents in a temporary XML stream and
then runs the same sanitizer path used by read_xml.
result = ss.read_xml_folder(
"raw/order-events",
xml_row_tag="order",
batch_memory_limit_bytes=256 * 1024 * 1024,
)
| Parameter | Default | Accepted values | What it controls |
|---|---|---|---|
path |
required | str or path-like object |
Local folder or PyArrow FS folder URI containing .xml files. |
output_format |
pyarrow |
pyarrow, pandas, polars, duckdb |
Type stored in Result.clean_data. |
base_schema |
None |
pyarrow.Schema or None |
Optional base output contract. |
schema_mode |
additive |
additive, strict |
How inferred fields reconcile with base_schema. |
column_order |
base_schema_first |
base_schema_first, sorted |
Output field ordering. |
parse_integers |
True |
bool |
Parse integer-looking strings as integers. |
parse_floats |
True |
bool |
Parse float-looking strings as floats. |
true_tokens |
() |
sequence of strings | String tokens interpreted as boolean true. |
false_tokens |
() |
sequence of strings | String tokens interpreted as boolean false. |
timestamp_patterns |
() |
sequence of regex strings | Extra timestamp parsers. |
date_patterns |
() |
sequence of regex strings | Extra date parsers. |
time_patterns |
() |
sequence of regex strings | Extra time parsers. |
arrow_max_depth |
32 |
integer >= 0 |
Maximum Arrow container depth for object and array expansion. |
parquet_max_depth |
15 |
integer >= 0 |
Maximum Parquet/BigQuery RECORD depth for object expansion. |
scalar_object_key |
default_key |
string | Key used when a scalar must be wrapped as an object. |
input_text_encoding |
utf-8 |
text encoding name | Encoding used to decode each source XML file. |
xml_row_tag |
None |
XML element tag name or None |
Expected XML document root tag. None infers it from the first file. |
on_error |
emit_null_row |
stop, skip_row, emit_null_row, quarantine |
Row-level error policy. |
batch_memory_limit_bytes |
None |
positive integer bytes or None |
Best-effort per-document-row memory budget. |
read_chunk_bytes |
1048576 |
positive integer bytes | Chunk size for the compacted XML stream. |
read_parquet(path, ...)
| Parameter | Default | Accepted values | What it controls |
|---|---|---|---|
path |
required | str or path-like object |
Local Parquet file to read. |
output_format |
pyarrow |
pyarrow, pandas, polars, duckdb |
Type stored in Result.clean_data. |
base_schema |
None |
pyarrow.Schema or None |
Optional base output contract. |
schema_mode |
additive |
additive, strict |
How inferred fields reconcile with base_schema. |
column_order |
base_schema_first |
base_schema_first, sorted |
Output field ordering. |
arrow_max_depth |
32 |
integer >= 0 |
Maximum Arrow container depth for object and array expansion. |
parquet_max_depth |
15 |
integer >= 0 |
Maximum Parquet/BigQuery RECORD depth for object expansion. |
scalar_object_key |
default_key |
string | Key used when a scalar must be wrapped as an object. |
on_error |
emit_null_row |
stop, skip_row, emit_null_row, quarantine |
Row-level error policy. |
batch_memory_limit_bytes |
None |
positive integer bytes or None |
Best-effort per-batch memory budget. |
read_python(rows, ...)
| Parameter | Default | Accepted values | What it controls |
|---|---|---|---|
rows |
required | list[dict] |
In-memory rows to normalize. |
output_format |
pyarrow |
pyarrow, pandas, polars, duckdb |
Type stored in Result.clean_data. |
base_schema |
None |
pyarrow.Schema or None |
Optional base output contract. |
schema_mode |
additive |
additive, strict |
How inferred fields reconcile with base_schema. |
column_order |
base_schema_first |
base_schema_first, sorted |
Output field ordering. |
parse_integers |
True |
bool |
Parse integer-looking strings as integers. |
parse_floats |
True |
bool |
Parse float-looking strings as floats. |
true_tokens |
() |
sequence of strings | String tokens interpreted as boolean true. |
false_tokens |
() |
sequence of strings | String tokens interpreted as boolean false. |
timestamp_patterns |
() |
sequence of regex strings | Extra timestamp parsers. |
date_patterns |
() |
sequence of regex strings | Extra date parsers. |
time_patterns |
() |
sequence of regex strings | Extra time parsers. |
arrow_max_depth |
32 |
integer >= 0 |
Maximum Arrow container depth for object and array expansion. |
parquet_max_depth |
15 |
integer >= 0 |
Maximum Parquet/BigQuery RECORD depth for object expansion. |
scalar_object_key |
default_key |
string | Key used when a scalar must be wrapped as an object. |
on_error |
emit_null_row |
stop, skip_row, emit_null_row, quarantine |
Row-level error policy. |
batch_memory_limit_bytes |
None |
positive integer bytes or None |
Best-effort memory budget for the already-resident Python payload. |
File-To-File Converter Options
Converters accept local or PyArrow FS URI output paths. Inputs can be local
paths or PyArrow FS URI strings. They infer input format from the input
extension unless you pass
input_format.
to_csv(input_path, output_path, ...)
| Parameter | Default | Accepted values | What it controls |
|---|---|---|---|
input_path |
required | str or path-like object |
Local file or PyArrow FS URI to sanitize. |
output_path |
required | str or path-like object |
Local or PyArrow FS URI CSV file to create. |
input_format |
auto |
auto, csv, json, jsonl, ndjson, xml, parquet |
Input format selector. |
base_schema |
None |
pyarrow.Schema or None |
Optional base output contract. |
schema_mode |
additive |
additive, strict |
How inferred fields reconcile with base_schema. |
column_order |
base_schema_first |
base_schema_first, sorted |
Output field ordering. |
parse_integers |
True |
bool |
Parse integer-looking strings as integers. |
parse_floats |
True |
bool |
Parse float-looking strings as floats. |
true_tokens |
() |
sequence of strings | String tokens interpreted as boolean true. |
false_tokens |
() |
sequence of strings | String tokens interpreted as boolean false. |
timestamp_patterns |
() |
sequence of regex strings | Extra timestamp parsers. |
date_patterns |
() |
sequence of regex strings | Extra date parsers. |
time_patterns |
() |
sequence of regex strings | Extra time parsers. |
arrow_max_depth |
32 |
integer >= 0 |
Maximum Arrow container depth for object and array expansion. |
parquet_max_depth |
15 |
integer >= 0 |
Maximum Parquet/BigQuery RECORD depth for object expansion. |
scalar_object_key |
default_key |
string | Key used when a scalar must be wrapped as an object. |
csv_has_header |
True |
bool |
Whether CSV input has a header. |
csv_delimiter |
, |
single-character string | CSV input delimiter. |
input_text_encoding |
utf-8 |
text encoding name | Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input. |
xml_row_tag |
None |
XML element tag name or None |
Direct child XML element tag to stream as separate rows when reading XML input. |
on_error |
emit_null_row |
stop, skip_row, emit_null_row, quarantine |
Row-level error policy. |
batch_memory_limit_bytes |
None |
positive integer bytes or None |
Best-effort per-batch memory budget. |
read_chunk_bytes |
1048576 |
positive integer bytes | Chunk size for streaming text input reads. |
to_jsonl(input_path, output_path, ...)
| Parameter | Default | Accepted values | What it controls |
|---|---|---|---|
input_path |
required | str or path-like object |
Local file or PyArrow FS URI to sanitize. |
output_path |
required | str or path-like object |
Local or PyArrow FS URI JSON Lines file to create. |
input_format |
auto |
auto, csv, json, jsonl, ndjson, xml, parquet |
Input format selector. |
base_schema |
None |
pyarrow.Schema or None |
Optional base output contract. |
schema_mode |
additive |
additive, strict |
How inferred fields reconcile with base_schema. |
column_order |
base_schema_first |
base_schema_first, sorted |
Output field ordering. |
parse_integers |
True |
bool |
Parse integer-looking strings as integers. |
parse_floats |
True |
bool |
Parse float-looking strings as floats. |
true_tokens |
() |
sequence of strings | String tokens interpreted as boolean true. |
false_tokens |
() |
sequence of strings | String tokens interpreted as boolean false. |
timestamp_patterns |
() |
sequence of regex strings | Extra timestamp parsers. |
date_patterns |
() |
sequence of regex strings | Extra date parsers. |
time_patterns |
() |
sequence of regex strings | Extra time parsers. |
arrow_max_depth |
32 |
integer >= 0 |
Maximum Arrow container depth for object and array expansion. |
parquet_max_depth |
15 |
integer >= 0 |
Maximum Parquet/BigQuery RECORD depth for object expansion. |
scalar_object_key |
default_key |
string | Key used when a scalar must be wrapped as an object. |
csv_has_header |
True |
bool |
Whether CSV input has a header. |
csv_delimiter |
, |
single-character string | CSV input delimiter. |
input_text_encoding |
utf-8 |
text encoding name | Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input. |
xml_row_tag |
None |
XML element tag name or None |
Direct child XML element tag to stream as separate rows when reading XML input. |
on_error |
emit_null_row |
stop, skip_row, emit_null_row, quarantine |
Row-level error policy. |
batch_memory_limit_bytes |
None |
positive integer bytes or None |
Best-effort per-batch memory budget. |
read_chunk_bytes |
1048576 |
positive integer bytes | Chunk size for streaming text input reads. |
to_parquet(input_path, output_path, ...)
| Parameter | Default | Accepted values | What it controls |
|---|---|---|---|
input_path |
required | str or path-like object |
Local file or PyArrow FS URI to sanitize. |
output_path |
required | str or path-like object |
Local or PyArrow FS URI Parquet file to create. |
input_format |
auto |
auto, csv, json, jsonl, ndjson, xml, parquet |
Input format selector. |
base_schema |
None |
pyarrow.Schema or None |
Optional base output contract. |
schema_mode |
additive |
additive, strict |
How inferred fields reconcile with base_schema. |
column_order |
base_schema_first |
base_schema_first, sorted |
Output field ordering. |
parse_integers |
True |
bool |
Parse integer-looking strings as integers. |
parse_floats |
True |
bool |
Parse float-looking strings as floats. |
true_tokens |
() |
sequence of strings | String tokens interpreted as boolean true. |
false_tokens |
() |
sequence of strings | String tokens interpreted as boolean false. |
timestamp_patterns |
() |
sequence of regex strings | Extra timestamp parsers. |
date_patterns |
() |
sequence of regex strings | Extra date parsers. |
time_patterns |
() |
sequence of regex strings | Extra time parsers. |
arrow_max_depth |
32 |
integer >= 0 |
Maximum Arrow container depth for object and array expansion. |
parquet_max_depth |
15 |
integer >= 0 |
Maximum Parquet/BigQuery RECORD depth for object expansion. |
scalar_object_key |
default_key |
string | Key used when a scalar must be wrapped as an object. |
csv_has_header |
True |
bool |
Whether CSV input has a header. |
csv_delimiter |
, |
single-character string | CSV input delimiter. |
input_text_encoding |
utf-8 |
text encoding name | Encoding used to decode CSV, JSON, JSON Lines, NDJSON, or XML input. |
xml_row_tag |
None |
XML element tag name or None |
Direct child XML element tag to stream as separate rows when reading XML input. |
on_error |
emit_null_row |
stop, skip_row, emit_null_row, quarantine |
Row-level error policy. |
batch_memory_limit_bytes |
None |
positive integer bytes or None |
Best-effort per-batch memory budget. |
read_chunk_bytes |
1048576 |
positive integer bytes | Chunk size for streaming text input reads. |
Schema Inference Heuristics
Schema inference scans the full source before materialization whenever inference
runs. It is not a sample-based inference step: in inferred mode and additive
base_schema mode, every source row is consumed during inference and counted in
Result.stats["inferred_rows"].
For each inferred row, the sanitizer applies two internal passes:
- The shape pass discovers structural paths: field names, objects, arrays, and fields that must be flattened by depth limits.
- The statistics pass collects scalar type evidence for the discovered shape: booleans, integers, floats, timestamps, dates, times, strings, nulls, and mixed-type conflicts.
When schema_mode="strict" is used with an explicit base_schema, the
sanitizer skips inference and uses the schema contract directly; in that fast
path, inferred_rows is 0. Strict mode only works with base_schema. Passing
schema_mode="strict" without base_schema raises an exception.
Separating shape discovery from scalar statistics keeps list and struct
decisions stable across messy inputs. If one row has an object and another row
has a scalar at the same field, the structural shape wins and the scalar is
wrapped under scalar_object_key (default_key by default). If one row has a
list and another row has a scalar at the same field, the list shape wins and the
scalar is wrapped as a single list element.
Scalar inference is conservative:
- Nulls do not choose a type by themselves.
- Boolean JSON values infer
bool. - Numeric JSON values infer
int64orfloat64. - Strings can infer booleans, integers, floats, timestamps, dates, or times when the configured token and parser options match.
- Mixed scalar kinds fall back to
string. - Objects or arrays observed where a scalar is required are stringified.
List inference is stricter than top-level object inference. Lists remain typed
only when their element shape is conflict-free. Lists of scalars and lists of
structs are supported; nested lists or conflicts inside a list element fall back
to list<string> so each list column has one stable element type.
Base Schema Enforcement
base_schema is an output contract and only accepts a pyarrow.Schema. The
sanitizer converts it to the same internal logical schema representation used by
inference before planning materialization.
import pyarrow as pa
import schema_sanitizer as ss
user_schema = pa.schema(
[
pa.field("id", pa.int64(), nullable=False),
pa.field("email", pa.string()),
]
)
result = ss.read_jsonl(
"data/users.jsonl",
base_schema=user_schema,
schema_mode="strict",
)
schema_mode="strict" uses base_schema as the complete output schema. The
inference loop is skipped, so Result.stats["inferred_rows"] is 0. Strict
mode only works when base_schema is provided; otherwise the call raises an
exception before materialization. Extra source fields are rejected because they
are not present in the strict contract.
schema_mode="additive" requires inference. The full source is scanned, then
the inferred schema is reconciled with base_schema: fields already present in
base_schema keep their declared types and nullable flags, while newly observed
fields are added from inference. Row values that cannot be coerced into the
declared base field type are handled by on_error.
For fields present in base_schema, the base type wins even when source rows
contain conflicting values. A value such as "unknown" in a base int64 field,
an object in a base scalar field, or a scalar in a base struct/list field is a
materialization conflict. The row is stopped, skipped, quarantined, or replaced
with a null row according to on_error. For fields not present in
base_schema, conflicts are resolved by the normal inference heuristics before
the field is added: mixed scalar kinds fall back to string, object/scalar and
list/scalar conflicts use the wrapping rules, and conflicting list element
shapes fall back to list<string>.
column_order controls only output field order after reconciliation.
column_order="base_schema_first" preserves base fields first, then appends new
fields. column_order="sorted" emits fields in lexicographic order.
Max Depth Enforcement
Depth enforcement uses two independent limits because Arrow and Parquet/BigQuery count nested data differently:
arrow_max_depthdefaults to32. It counts Arrow container depth:structandlistcontainers count, while scalar leaves and top-level field wrappers do not.parquet_max_depthdefaults to15. It counts Parquet/BigQuery RECORD depth:structcontainers count, whilelistcontainers, scalar leaves, and top-level field wrappers do not.
The sanitizer flattens a named field to <name>_flattened when keeping that
field's full nested value would exceed either limit. The flattened value is
stored as a string.
Depth examples:
| Shape | arrow_schema_depth |
parquet_schema_depth |
|---|---|---|
id: int64 |
0 | 0 |
user: struct<id: int64> |
1 | 1 |
tags: list<string> |
1 | 0 |
authors: list<struct<name: string>> |
2 | 1 |
asset: struct<authors: list<struct<name: string>>> |
3 | 2 |
Use arrow_max_depth as a defensive complexity limit for Arrow/Parquet
container nesting. Use parquet_max_depth=15 when the output Parquet will be
read by BigQuery external tables, where the practical limit is nested RECORD
depth rather than physical list wrapper depth.
The reported Result.stats["arrow_schema_depth"] and
Result.stats["parquet_schema_depth"] use the same counting rules as the
enforcement options.
Quarantine Rows Pipeline
Use on_error="quarantine" when you want clean output to continue while keeping
failed rows for inspection or replay. Rows that fail materialization are dropped
from clean_data or the converter output file and appended to
Result.bad_rows.
bad_rows is a PyArrow table with diagnostic metadata:
| Column | What it contains |
|---|---|
row_index |
Zero-based source row index. |
source_offset |
Byte offset or source-relative offset when available. |
code / code_str |
Machine-readable diagnostic code. |
path_id |
Internal field path id associated with the failure. |
detail |
Human-readable error detail. |
context_snippet |
Short preview of the offending source row. |
raw_row |
Full raw source row text when available. |
For in-memory reads, inspect result.bad_rows directly:
result = ss.read_jsonl(
"data/events.jsonl",
base_schema=event_schema,
schema_mode="strict",
on_error="quarantine",
)
clean = result.clean_data
bad_rows = result.bad_rows
print(result.stats["quarantined_rows"])
For file-to-file converters, the clean output is written to output_path and
the same Result.bad_rows table carries quarantined rows:
result = ss.to_parquet(
"raw/events.jsonl",
"clean/events.parquet",
base_schema=event_schema,
schema_mode="strict",
on_error="quarantine",
)
bad_rows = result.bad_rows
Quarantine is row-level. If one field in a row cannot be coerced into the output
schema, the whole row is excluded from clean output and recorded once in
bad_rows. In contrast, on_error="skip_row" drops the row without retaining
it, and on_error="emit_null_row" keeps row count stable by writing a null row
instead of recording it in bad_rows.
Memory Safety Measures
The sanitizer is designed to process large local files and PyArrow filesystem URI inputs without requiring the whole clean dataset to live in Python memory.
- File-to-file converters stream sanitized batches directly to the output file.
Result.clean_dataisNonefor converters, so the clean table is not materialized in memory. - PyArrow filesystem file inputs are opened as seekable streams. CSV, JSON, JSON Lines, NDJSON, and XML URI inputs are not copied to a temporary file; their bytes are read by the same chunked native scanner used for local files.
- PyArrow filesystem outputs are opened with
pyarrow.fs.open_output_stream. CSV, JSON Lines, and Parquet converters write incrementally to that stream instead of staging the full output in a local temporary file. - CSV, JSON, JSON Lines, and NDJSON readers use
read_chunk_bytesto bound input chunks while scanning. - XML without
xml_row_tagis parsed into a native document tree before row emission, sobatch_memory_limit_byteslimits the accumulated document size before the tree is built. - XML with
xml_row_tagstreams matching direct child elements. The scanner reads bounded chunks, discards completed row slices, and raisesSchemaSanitizerResourceErrorif the active XML buffer exceedsbatch_memory_limit_bytes. - Local and PyArrow filesystem folder readers (
read_json_folderandread_xml_folder) list direct child files only, then compact one source document at a time into a local temporary JSON Lines or XML stream. The temp file is the bridge that lets many single-document files reuse the normal streaming sanitizer pipeline without building one large Python object. - Folder temp streams contain only the compacted input representation, not the
final clean dataset. With
batch_memory_limit_bytes, each source document is checked before it is decoded and added to that stream. If a PyArrow filesystem does not report a child file size, the child is read in bounded chunks and the reader stops atbatch_memory_limit_bytes + 1bytes before raisingSchemaSanitizerResourceError. - Folder temp files are deleted when the read finishes, and partially written
temp files are deleted if compaction raises an exception. If the Python
process is killed externally, for example with
SIGKILL, the operating system may not giveschema-sanitizera chance to run that cleanup. - Parquet inputs are decoded by PyArrow into record batches and exposed to the native JSON frontend through a seekable JSON Lines byte reader. Rows are produced incrementally; the Parquet-to-JSONL adapter does not stage a full conversion file.
- XML DTD and entity declarations are rejected. The XML frontend does not load external entities or expand document-defined entities.
batch_memory_limit_bytesmaps to the native per-batchmemory_limit_bytesbudget. It reduces inference and output batch sizes instead of changing the final schema.- For already-resident Python inputs,
batch_memory_limit_bytesis enforced as a preflight resource guard. If the Python payload is already larger than the configured limit, the call raisesSchemaSanitizerResourceErrorbefore native ingestion starts. arrow_max_depthandparquet_max_depthcap nested expansion. Values beyond those limits are flattened to strings, preventing unbounded container nesting from creating very wide or deeply nested Arrow/Parquet schemas.- Native parsing and materialization use owned streams, arenas, and Arrow C Data
resources that are closed when the
Result, stream, or sink is closed or dropped. Table-producing readers force stream materialization and close native resources before returning.
Configured resource-limit failures raise SchemaSanitizerResourceError and
include limit_name="memory_limit_bytes" in their detail payload when
available. True allocator failures are reported separately as
SchemaSanitizerOutOfMemoryError.
PyArrow Filesystem Integration
When PyArrow is installed, every file reader and file-to-file converter can use
pyarrow.fs URI strings. This covers read_csv, read_json,
read_json_folder, read_jsonl, read_xml, read_xml_folder,
read_parquet, to_csv, to_jsonl, and to_parquet. Supported URI input
extensions include csv, json, jsonl, ndjson, xml, parquet, and
pq. Supported URI converter output extensions include csv, jsonl, and
parquet.
For normal local files, prefer a regular path:
events = ss.read_jsonl("/home/user/data/events.jsonl")
Regular local paths are the simplest and usually best choice for local disk access. They avoid PyArrow URI parsing and filesystem dispatch.
file:// is PyArrow's local-filesystem URI scheme. On Linux and WSL, absolute
local paths use three slashes: file:///home/user/data/events.jsonl. That URI
points to the same file as /home/user/data/events.jsonl, but it is opened
through pyarrow.fs.LocalFileSystem. Use it when you specifically want to test
the PyArrow filesystem route or when your code passes filesystem URIs
consistently across local and cloud storage. Do not write file://home/user/...;
that form has home in the URI host position instead of being an absolute local
path.
| Local form | Example | Opens through | Best use |
|---|---|---|---|
| Regular local path | /home/user/data/events.jsonl |
schema-sanitizer local path handling | Default for local disk files. |
| Local PyArrow URI | file:///home/user/data/events.jsonl |
pyarrow.fs.LocalFileSystem |
Testing or URI-only code paths. |
Common URI forms:
| Storage | Example URI |
|---|---|
| Local file through PyArrow | file:///home/user/data/events.jsonl |
| Amazon S3 | s3://raw-bucket/events/2026-06-12.jsonl |
| Amazon S3 folder | s3://raw-bucket/events/2026-06-12/ |
| Google Cloud Storage | gs://raw-bucket/assets/2026-06-12.parquet |
| Google Cloud Storage folder | gs://raw-bucket/assets/2026-06-12/ |
| Google Cloud Storage alias | gcs://raw-bucket/assets/2026-06-12.xml |
| Azure Data Lake Storage Gen2 | abfs://container@account.dfs.core.windows.net/events/2026-06-12.jsonl |
| Azure Data Lake Storage Gen2 folder | abfs://container@account.dfs.core.windows.net/events/2026-06-12/ |
Cloud URI support depends on the installed PyArrow build and the normal provider credentials/configuration available to PyArrow.
import schema_sanitizer as ss
events = ss.read_jsonl("s3://raw-bucket/events/2026-06-12.jsonl")
assets = ss.read_parquet("gs://raw-bucket/assets/2026-06-12.parquet")
daily_events = ss.read_json_folder("s3://raw-bucket/events/2026-06-12/")
ss.to_parquet(
"s3://raw-bucket/events/2026-06-12.jsonl",
"gs://clean-bucket/events/2026-06-12.parquet",
)
URI file inputs are opened as seekable PyArrow files. CSV, JSON, JSON Lines,
NDJSON, and XML bytes are fed directly to the native chunk scanner. Parquet is
decoded with pyarrow.parquet into batches, converted incrementally to JSON
Lines bytes, and then fed to the same native sanitizer path. No single-file URI
input is copied to a temporary file by schema-sanitizer.
Folder URI inputs are listed with non-recursive pyarrow.fs.FileSelector.
read_json_folder filters direct .json child files and read_xml_folder
filters direct .xml child files. The matching children are sorted by
filename, then compacted one document at a time into a local temporary stream
before the normal sanitizer pipeline reads that stream.
URI outputs are opened with pyarrow.fs.open_output_stream. CSV and Parquet
writers stream Arrow batches to that output stream, and JSON Lines writes UTF-8
bytes incrementally. The output URI is not staged through a local temporary file.
Supported Inputs
Supported inputs are intentionally file-oriented:
- Normal local file paths for
read_csv,read_json,read_jsonl,read_xml,read_parquet,to_csv,to_jsonl, andto_parquet. - PyArrow filesystem file URI strings for the same single-file readers and converters when PyArrow is installed and can open the URI.
- Normal local folders for
read_json_folderandread_xml_folder. - PyArrow filesystem folder URI strings for
read_json_folderandread_xml_folder; folder exploration is non-recursive. - Already-resident
list[dict]rows throughread_python.
Unsupported Inputs
Unsupported inputs include raw JSON or XML strings, bytes payloads, opened
files, io.BytesIO, io.StringIO, custom reader objects, URLs that PyArrow
cannot open as files, and recursive folder scans. Write those inputs to a local
file first, or use read_python for in-memory list[dict] rows.
Examples
The examples/ directory contains tutorial notebooks and one cloud pipeline
CLI example:
01_ingestion_and_core_api.ipynb02_options_and_stats.ipynb03_adapters_and_converters.ipynb04_streaming_large_csv_to_parquet.ipynb05_full_options_catalog_sweep.ipynb06_xml_reading_and_memory.ipynb07_gcs_jsonl_to_silver_parquet.py: GCS JSONL to GCS Parquet using a BigQuery external table schema fetched through ADBC asbase_schema
Platform Notes
Published PyPI wheels target glibc-based Linux environments
(manylinux_2_28). Alpine Linux uses musl, so Alpine users should use a
glibc-based Python environment or build from source.
Development
Install the project for local development:
pip install -e .[dev]
Run the tests:
pytest
Build the native core directly with CMake:
cmake -S . -B build/dev -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build/dev
License
schema-sanitizer is licensed under the Apache License 2.0. See
LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file schema_sanitizer-0.1.1.tar.gz.
File metadata
- Download URL: schema_sanitizer-0.1.1.tar.gz
- Upload date:
- Size: 213.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f88011863af82c1376451f9716e51fcbac75de9af337cc8ef825c10a8eea4be
|
|
| MD5 |
10cb6cb7f53d078ad9ebf49084d02db3
|
|
| BLAKE2b-256 |
49e8385382a82df231feee7355fd9874ab26693abd7f5c517018d2b80f7193e6
|
File details
Details for the file schema_sanitizer-0.1.1-cp311-abi3-win_amd64.whl.
File metadata
- Download URL: schema_sanitizer-0.1.1-cp311-abi3-win_amd64.whl
- Upload date:
- Size: 508.8 kB
- Tags: CPython 3.11+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e15e4d6afd9bcc467ae88df45abdacc0b85b63e518697b000461b9aed521d91
|
|
| MD5 |
8a8ec247ad6652780dfb07136941d4cb
|
|
| BLAKE2b-256 |
ba2d22b03f7bf0a92ed070dfa2a060ccbcb7a6f2ccdd21cd359fdaa431da1485
|
File details
Details for the file schema_sanitizer-0.1.1-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: schema_sanitizer-0.1.1-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 400.6 kB
- Tags: CPython 3.11+, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6bc65bddb4f37e63844da0dd68078173ef5cba92b6262e4c7c1c6b1c72987564
|
|
| MD5 |
69ff62f33d81065a98cae29bc63b86a0
|
|
| BLAKE2b-256 |
670c9163da378f4171759fc9b1776b9a2bffe3847ee447e016a592fcb6ebeb84
|
File details
Details for the file schema_sanitizer-0.1.1-cp311-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: schema_sanitizer-0.1.1-cp311-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 316.4 kB
- Tags: CPython 3.11+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
682b8eb91cc58f635d62d2992a16b5083fe74292fe80ec38873f85e68dd77a70
|
|
| MD5 |
9cb96affe358f2aad66d5e35c2b6a642
|
|
| BLAKE2b-256 |
c1d75af651765ca4a781765a8879741a097359c92d37ce2f0db5525b9c65bf35
|
File details
Details for the file schema_sanitizer-0.1.1-cp311-abi3-macosx_10_9_x86_64.whl.
File metadata
- Download URL: schema_sanitizer-0.1.1-cp311-abi3-macosx_10_9_x86_64.whl
- Upload date:
- Size: 329.2 kB
- Tags: CPython 3.11+, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1a904da3084060cba136c29d771ea03e05eb5d28ab734b4ae97f85a4634397f
|
|
| MD5 |
c6caa3659e4adce2956cf79650d04844
|
|
| BLAKE2b-256 |
bbddc07278dbc733762e7ec5bd57797590a3aff2921006518c524e0cd1ed4484
|