MVP: Detect and infer schemas from files/dirs/DataFrames; emit YAML/JSON/TXT/Spark StructType
Project description
schema-classifier
PySchemaClassifier — a Python library and CLI to detect file/table/dataframe formats, infer/extract schemas, and emit schemas (Spark StructType-like dict, YAML, JSON, TXT).MVP focuses on single-level compression, core formats (CSV/JSON/XML/Parquet/Avro/ORC + Delta/Iceberg/Hudi metadata), sampling policies, and robust exceptions.
Status
This is a design-locked skeleton for MVP implementation. Modules are scaffolded with docstrings and TODO markers.
Quick Start
# create and activate venv
python -m venv .venv
source .venv/bin/activate
##or
.\.venv\Scripts\activate
# editable install
pip install -e .
pip install -e .[orc]
# run CLI (prints skeleton info)
schema-detect --help
# Try running below commands to test this framework
default fmt: yaml
default --output-dir .
default --output-file schema.yml
schema-detect tests/data/csv/sales_header.csv
schema-detect tests/data/csv/sales_no_header.csv --fmt yaml --output-file schema_no_header.yml
schema-detect tests/data/csv/very_wide.csv --fmt yaml --output-file schema_wide.yml
schema-detect tests/data/csv/sales_utf8_sig.csv --fmt yaml --output-file schema_utf8.yml
schema-detect tests/data/orc/TestOrcFile.testDate1900.orc --fmt yaml --output-file schema_orc.yml
schema-detect tests/data/avro/weather.avro --fmt yaml --output-file schema_avro.yml
schema-detect tests/data/parquet/v0.7.1.all-named-index.parquet --fmt yaml --output-file schema_pqt.yml
schema-detect tests/data/delta/people_countries_delta_dask/ --fmt yaml --output-file schema_delta.yml
schema-detect tests/data/json/events.ndjson --fmt yaml --output-file schema_json.yml
schema-detect tests/data/xml/books.xml --fmt yaml --output-file schema_xml.yml
schema-detect tests/data/csv/ --multi-file-fmt txt
schema-detect tests/data/csv/sales_20250101.csv --fmt json --output-file schema_date.json
## To print the schema on CLI
schema-detect tests/data/json/events.ndjson --fmt dict
## To test Python APIs
python .\tests\unit\combine_run_schema.py
## To build the image
.\build.ps1 -Target [test|prod]
##or
CLI Overview (MVP)
Single command: schema-detect <path> with write options and detection/sampling knobs.
Key flags (subset):
--detection-mode {trust_hint,verify_hint,auto_detect}(default:trust_hint)--coverage-mode {any,max,full}(default:max)--sample-records(default: 500)--sample-bytes(default: 5MB forany;fullcapped at 100MB)--output-dir,--output-file,--fmt {yaml,json,txt,dict}--zip-max-size(default: 500MB),--zip-max-members(default: 100)--max-file-size(default: 50GB)--sample-total-bytes-cap(soft cap default: 1GB)--max-workers(default: os.cpu_count())--retries(default: 3),--timeout-seconds(default: 180)--log-json(opt-in),-v/--verbose
CSV knobs (MVP): --csv.header {auto,true,false} (auto flips to true when confidence ≥ 0.80), --csv.delimiter, --csv.quote, --csv.escape, --encoding (utf-8/utf-8-sig/utf-16le/utf-16be).
schema-classifier/ ├─ README.md ├─ LICENSE ├─ pyproject.toml # packaging, deps, console script ├─ .gitignore ├─ src/ │ └─ pyschemaclassifier/ # library: prefer 'PySchemaClassifier' (or 'open_pyschemaclassifier' if name taken) │ ├─ __init__.py │ ├─ cli.py # CLI: schema-detect entrypoint │ ├─ infer.py # Orchestrator: classify → detect → normalize → emit │ ├─ config.py # Config model + load/merge logic (flags override YAML) │ ├─ logging_utils.py # Colored logs, JSON logs, verbosity levels │ ├─ exceptions.py # ArgumentError + taxonomy (DetectionError, etc.) │ ├─ models/ │ │ ├─ __init__.py │ │ └─ schema.py # Normalized schema model + Spark StructType JSON conversion │ ├─ detection/ │ │ ├─ __init__.py │ │ ├─ classifier.py # extension/magic bytes / table markers (delta/_delta_log, iceberg metadata.json, .hoodie) │ │ ├─ compression.py # gzip/bz2/xz/zstd/zip one-level handling; size/member caps; corruption checks │ │ ├─ sampling.py # Sampling state machine (records/bytes, coverage_mode, error budget) │ │ ├─ csv.py # Basic delimiter/quote/escape/BOM/encoding; header auto w/ ≥0.80 │ │ ├─ json.py # NDJSON vs JSON object/array; recursive inference; unions off by default │ │ ├─ xml.py # Basic element→object; arrays via repeated elements (iterparse) │ │ ├─ parquet.py # Footer-based extraction via pyarrow; logical type mapping │ │ ├─ avro.py # Schema extraction via fastavro │ │ ├─ orc.py # Schema via pyorc │ │ ├─ delta.py # Latest snapshot from _delta_log JSON (names in schema, IDs in metadata) │ │ ├─ iceberg.py # Parse metadata.json; partition transforms to metadata │ │ └─ hudi.py # COW support; raise TableFormatError for MOR │ ├─ writers/ │ │ ├─ __init__.py │ │ ├─ yaml.py # schema.yml writer (default) │ │ ├─ json.py # Pretty JSON (schema + meta) │ │ ├─ txt.py # Human-friendly text summary │ │ └─ struct.py # Return dict exactly matching Spark StructType.jsonValue() │ ├─ dataframe/ │ │ ├─ __init__.py │ │ ├─ pandas.py # detect_schema_from_df(pd.DataFrame) │ │ └─ spark.py # detect_schema_from_df(Spark DataFrame) │ └─ utils/ │ ├─ __init__.py │ ├─ io.py # Safe open/stream, retries (3), timeouts (180s), size pre-checks (50 GB) │ ├─ path.py # Path utilities, dir traversal, per-file sampling selection │ └─ metrics.py # Confidence scoring; delimiter stability; provenance/meta helpers ├─ tests/ │ ├─ conftest.py │ ├─ unit/ │ │ ├─ test_cli.py │ │ ├─ test_config.py │ │ ├─ test_exceptions.py │ │ ├─ test_sampling.py │ │ ├─ test_csv.py │ │ ├─ test_json.py │ │ ├─ test_xml.py │ │ ├─ test_parquet.py │ │ ├─ test_avro.py │ │ ├─ test_orc.py │ │ ├─ test_delta.py │ │ ├─ test_iceberg.py │ │ └─ test_hudi.py │ ├─ integration/ │ │ ├─ test_directory_mode.py │ │ ├─ test_zip_container.py │ │ └─ test_parallel_sampling.py │ ├─ data/ # Small fixtures per format + compression (MVP-focused) │ │ ├─ csv/ │ │ │ ├─ sales_header.csv │ │ │ ├─ sales_no_header.csv │ │ │ ├─ sales_utf8_sig.csv │ │ │ ├─ sales.csv.gz │ │ │ └─ very_wide.csv │ │ ├─ json/ │ │ │ ├─ events.ndjson │ │ │ └─ events.ndjson.gz │ │ ├─ xml/ │ │ │ └─ books.xml │ │ ├─ parquet/ │ │ │ └─ sample.parquet │ │ ├─ avro/ │ │ │ └─ sample.avro │ │ ├─ orc/ │ │ │ └─ sample.orc │ │ ├─ delta/ │ │ │ └─ _delta_log/ # minimal commit/checkpoint JSONs │ │ ├─ iceberg/ │ │ │ └─ metadata.json │ │ ├─ hudi/ │ │ │ └─ .hoodie/ # COW minimal markers │ │ └─ containers/ │ │ └─ sample.zip # Small multi-entry zip within 500MB limit │ └─ golden/ │ ├─ csv_sales_header.schema.yml │ ├─ csv_sales_no_header.schema.yml │ ├─ json_events_ndjson.schema.yml │ ├─ parquet_sample.schema.yml │ └─ … (one per format/compression) ├─ examples/ │ ├─ configs/ │ │ └─ mvp_defaults.yml # Shows all overridable knobs; flags override │ ├─ cli/ │ │ ├─ detect_csv.sh # Illustrative commands (no actual code execution here) │ │ ├─ detect_json.sh │ │ └─ detect_parquet.sh │ └─ api/ │ ├─ detect_from_path.md # Usage examples (text only) │ └─ detect_from_df.md # Spark/pandas API examples (text only) └─ docs/ ├─ mvp_overview.md # Short doc; full docs post-MVP └─ config_reference.md # Flags and YAML keys
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file schema_classifier-0.1.0.tar.gz.
File metadata
- Download URL: schema_classifier-0.1.0.tar.gz
- Upload date:
- Size: 18.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02329509de250e908a75c2773114328442678cd2e9878eccaf6ab41c78e3a3af
|
|
| MD5 |
467cccdc750bf360161aa27b9f3290bb
|
|
| BLAKE2b-256 |
014d2ebd77d42d234734c1681aac61a5a3395edaa8acf12b5b522d760d047486
|
File details
Details for the file schema_classifier-0.1.0-py3-none-any.whl.
File metadata
- Download URL: schema_classifier-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a762a0f277935bca1c7749913bf09467cf04f13c00b150a57a48232ac44040b
|
|
| MD5 |
60f7cd6ccf81d12ae30d886f390ccd82
|
|
| BLAKE2b-256 |
0c19af0ce4f05d762095de2dccfd81971babb9c7f156dd280357b8f2c7670ec2
|