MVP: Detect and infer schemas from files/dirs/DataFrames; emit YAML/JSON/TXT/Spark StructType

These details have not been verified by PyPI

Project description

schema-classifier

PySchemaClassifier — a Python library and CLI to detect file/table/dataframe formats, infer/extract schemas, and emit schemas (Spark StructType-like dict, YAML, JSON, TXT).MVP focuses on single-level compression, core formats (CSV/JSON/XML/Parquet/Avro/ORC + Delta/Iceberg/Hudi metadata), sampling policies, and robust exceptions.

Status

This is a design-locked skeleton for MVP implementation. Modules are scaffolded with docstrings and TODO markers.

Quick Start

# create and activate venv
python -m venv .venv
source .venv/bin/activate 
##or 
.\.venv\Scripts\activate

# editable install
pip install -e .
pip install -e .[orc] 

# run CLI (prints skeleton info)
schema-detect --help

# Try running below commands to test this framework

default fmt: yaml
default --output-dir .
default --output-file schema.yml

schema-detect tests/data/csv/sales_header.csv
schema-detect tests/data/csv/sales_no_header.csv --fmt yaml --output-file schema_no_header.yml
schema-detect tests/data/csv/very_wide.csv --fmt yaml --output-file schema_wide.yml
schema-detect tests/data/csv/sales_utf8_sig.csv --fmt yaml --output-file schema_utf8.yml
schema-detect tests/data/orc/TestOrcFile.testDate1900.orc --fmt yaml --output-file schema_orc.yml
schema-detect tests/data/avro/weather.avro --fmt yaml --output-file schema_avro.yml
schema-detect tests/data/parquet/v0.7.1.all-named-index.parquet --fmt yaml --output-file schema_pqt.yml
schema-detect tests/data/delta/people_countries_delta_dask/ --fmt yaml --output-file schema_delta.yml
schema-detect tests/data/json/events.ndjson --fmt yaml --output-file schema_json.yml
schema-detect tests/data/xml/books.xml --fmt yaml --output-file schema_xml.yml
schema-detect tests/data/csv/ --multi-file-fmt txt
schema-detect tests/data/csv/sales_20250101.csv --fmt json --output-file schema_date.json

## To print the schema on CLI
schema-detect tests/data/json/events.ndjson --fmt dict

## To test Python APIs
python .\tests\unit\combine_run_schema.py

## To build the image
.\build.ps1 -Target [test|prod]
##or

CLI Overview (MVP)

Single command: schema-detect <path> with write options and detection/sampling knobs.

Key flags (subset):

--detection-mode {trust_hint,verify_hint,auto_detect} (default: trust_hint)
--coverage-mode {any,max,full} (default: max)
--sample-records (default: 500)
--sample-bytes (default: 5MB for any; full capped at 100MB)
--output-dir, --output-file, --fmt {yaml,json,txt,dict}
--zip-max-size (default: 500MB), --zip-max-members (default: 100)
--max-file-size (default: 50GB)
--sample-total-bytes-cap (soft cap default: 1GB)
--max-workers (default: os.cpu_count())
--retries (default: 3), --timeout-seconds (default: 180)
--log-json (opt-in), -v/--verbose

CSV knobs (MVP): --csv.header {auto,true,false} (auto flips to true when confidence ≥ 0.80), --csv.delimiter, --csv.quote, --csv.escape, --encoding (utf-8/utf-8-sig/utf-16le/utf-16be).

schema-classifier/
├─ README.md
├─ LICENSE
├─ pyproject.toml                   # packaging, deps, console script
├─ .gitignore
├─ src/
│  └─ pyschemaclassifier/          # library: prefer 'PySchemaClassifier' (or 'open_pyschemaclassifier' if name taken)
│     ├─ __init__.py
│     ├─ cli.py                    # CLI: schema-detect entrypoint
│     ├─ infer.py                  # Orchestrator: classify → detect → normalize → emit
│     ├─ config.py                 # Config model + load/merge logic (flags override YAML)
│     ├─ logging_utils.py          # Colored logs, JSON logs, verbosity levels
│     ├─ exceptions.py             # ArgumentError + taxonomy (DetectionError, etc.)
│     ├─ models/
│     │  ├─ __init__.py
│     │  └─ schema.py              # Normalized schema model + Spark StructType JSON conversion
│     ├─ detection/
│     │  ├─ __init__.py
│     │  ├─ classifier.py          # extension/magic bytes / table markers (delta/_delta_log, iceberg metadata.json, .hoodie)
│     │  ├─ compression.py         # gzip/bz2/xz/zstd/zip one-level handling; size/member caps; corruption checks
│     │  ├─ sampling.py            # Sampling state machine (records/bytes, coverage_mode, error budget)
│     │  ├─ csv.py                 # Basic delimiter/quote/escape/BOM/encoding; header auto w/ ≥0.80
│     │  ├─ json.py                # NDJSON vs JSON object/array; recursive inference; unions off by default
│     │  ├─ xml.py                 # Basic element→object; arrays via repeated elements (iterparse)
│     │  ├─ parquet.py             # Footer-based extraction via pyarrow; logical type mapping
│     │  ├─ avro.py                # Schema extraction via fastavro
│     │  ├─ orc.py                 # Schema via pyorc
│     │  ├─ delta.py               # Latest snapshot from _delta_log JSON (names in schema, IDs in metadata)
│     │  ├─ iceberg.py             # Parse metadata.json; partition transforms to metadata
│     │  └─ hudi.py                # COW support; raise TableFormatError for MOR
│     ├─ writers/
│     │  ├─ __init__.py
│     │  ├─ yaml.py                # schema.yml writer (default)
│     │  ├─ json.py                # Pretty JSON (schema + meta)
│     │  ├─ txt.py                 # Human-friendly text summary
│     │  └─ struct.py              # Return dict exactly matching Spark StructType.jsonValue()
│     ├─ dataframe/
│     │  ├─ __init__.py
│     │  ├─ pandas.py              # detect_schema_from_df(pd.DataFrame)
│     │  └─ spark.py               # detect_schema_from_df(Spark DataFrame)
│     └─ utils/
│        ├─ __init__.py
│        ├─ io.py                  # Safe open/stream, retries (3), timeouts (180s), size pre-checks (50 GB)
│        ├─ path.py                # Path utilities, dir traversal, per-file sampling selection
│        └─ metrics.py             # Confidence scoring; delimiter stability; provenance/meta helpers
├─ tests/
│  ├─ conftest.py
│  ├─ unit/
│  │  ├─ test_cli.py
│  │  ├─ test_config.py
│  │  ├─ test_exceptions.py
│  │  ├─ test_sampling.py
│  │  ├─ test_csv.py
│  │  ├─ test_json.py
│  │  ├─ test_xml.py
│  │  ├─ test_parquet.py
│  │  ├─ test_avro.py
│  │  ├─ test_orc.py
│  │  ├─ test_delta.py
│  │  ├─ test_iceberg.py
│  │  └─ test_hudi.py
│  ├─ integration/
│  │  ├─ test_directory_mode.py
│  │  ├─ test_zip_container.py
│  │  └─ test_parallel_sampling.py
│  ├─ data/                        # Small fixtures per format + compression (MVP-focused)
│  │  ├─ csv/
│  │  │  ├─ sales_header.csv
│  │  │  ├─ sales_no_header.csv
│  │  │  ├─ sales_utf8_sig.csv
│  │  │  ├─ sales.csv.gz
│  │  │  └─ very_wide.csv
│  │  ├─ json/
│  │  │  ├─ events.ndjson
│  │  │  └─ events.ndjson.gz
│  │  ├─ xml/
│  │  │  └─ books.xml
│  │  ├─ parquet/
│  │  │  └─ sample.parquet
│  │  ├─ avro/
│  │  │  └─ sample.avro
│  │  ├─ orc/
│  │  │  └─ sample.orc
│  │  ├─ delta/
│  │  │  └─ _delta_log/           # minimal commit/checkpoint JSONs
│  │  ├─ iceberg/
│  │  │  └─ metadata.json
│  │  ├─ hudi/
│  │  │  └─ .hoodie/              # COW minimal markers
│  │  └─ containers/
│  │     └─ sample.zip            # Small multi-entry zip within 500MB limit
│  └─ golden/
│     ├─ csv_sales_header.schema.yml
│     ├─ csv_sales_no_header.schema.yml
│     ├─ json_events_ndjson.schema.yml
│     ├─ parquet_sample.schema.yml
│     └─ … (one per format/compression)
├─ examples/
│  ├─ configs/
│  │  └─ mvp_defaults.yml         # Shows all overridable knobs; flags override
│  ├─ cli/
│  │  ├─ detect_csv.sh            # Illustrative commands (no actual code execution here)
│  │  ├─ detect_json.sh
│  │  └─ detect_parquet.sh
│  └─ api/
│     ├─ detect_from_path.md      # Usage examples (text only)
│     └─ detect_from_df.md        # Spark/pandas API examples (text only)
└─ docs/
   ├─ mvp_overview.md             # Short doc; full docs post-MVP
   └─ config_reference.md         # Flags and YAML keys

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.1

Jan 10, 2026

1.0.0

Jan 10, 2026

0.1.1

Jan 2, 2026

This version

0.1.0

Jan 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_classifier-0.1.0.tar.gz (18.1 kB view details)

Uploaded Jan 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

schema_classifier-0.1.0-py3-none-any.whl (25.3 kB view details)

Uploaded Jan 2, 2026 Python 3

File details

Details for the file schema_classifier-0.1.0.tar.gz.

File metadata

Download URL: schema_classifier-0.1.0.tar.gz
Upload date: Jan 2, 2026
Size: 18.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for schema_classifier-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`02329509de250e908a75c2773114328442678cd2e9878eccaf6ab41c78e3a3af`
MD5	`467cccdc750bf360161aa27b9f3290bb`
BLAKE2b-256	`014d2ebd77d42d234734c1681aac61a5a3395edaa8acf12b5b522d760d047486`

See more details on using hashes here.

File details

Details for the file schema_classifier-0.1.0-py3-none-any.whl.

File metadata

Download URL: schema_classifier-0.1.0-py3-none-any.whl
Upload date: Jan 2, 2026
Size: 25.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for schema_classifier-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7a762a0f277935bca1c7749913bf09467cf04f13c00b150a57a48232ac44040b`
MD5	`60f7cd6ccf81d12ae30d886f390ccd82`
BLAKE2b-256	`0c19af0ce4f05d762095de2dccfd81971babb9c7f156dd280357b8f2c7670ec2`

See more details on using hashes here.

schema-classifier 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

schema-classifier

Status

Quick Start

CLI Overview (MVP)

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes