Skip to main content

MVP: Detect and infer schemas from files/dirs/DataFrames; emit YAML/JSON/TXT/Spark StructType

Project description

schema-classifier

PySchemaClassifier — a Python library and CLI to detect file/table/dataframe formats, infer/extract schemas, and emit schemas (Spark StructType-like dict, YAML, JSON, TXT).MVP focuses on single-level compression, core formats (CSV/JSON/XML/Parquet/Avro/ORC + Delta/Iceberg/Hudi metadata), sampling policies, and robust exceptions.

Status

This is a design-locked skeleton for MVP implementation. Modules are scaffolded with docstrings and TODO markers.

Quick Start

# create and activate venv
python -m venv .venv
source .venv/bin/activate 
##or 
.\.venv\Scripts\activate

# editable install
pip install -e .
pip install -e .[orc] 

# run CLI (prints skeleton info)
schema-detect --help

# Try running below commands to test this framework

default fmt: yaml
default --output-dir .
default --output-file schema.yml

schema-detect tests/data/csv/sales_header.csv
schema-detect tests/data/csv/sales_no_header.csv --fmt yaml --output-file schema_no_header.yml
schema-detect tests/data/csv/very_wide.csv --fmt yaml --output-file schema_wide.yml
schema-detect tests/data/csv/sales_utf8_sig.csv --fmt yaml --output-file schema_utf8.yml
schema-detect tests/data/orc/TestOrcFile.testDate1900.orc --fmt yaml --output-file schema_orc.yml
schema-detect tests/data/avro/weather.avro --fmt yaml --output-file schema_avro.yml
schema-detect tests/data/parquet/v0.7.1.all-named-index.parquet --fmt yaml --output-file schema_pqt.yml
schema-detect tests/data/delta/people_countries_delta_dask/ --fmt yaml --output-file schema_delta.yml
schema-detect tests/data/json/events.ndjson --fmt yaml --output-file schema_json.yml
schema-detect tests/data/xml/books.xml --fmt yaml --output-file schema_xml.yml
schema-detect tests/data/csv/ --multi-file-fmt txt
schema-detect tests/data/csv/sales_20250101.csv --fmt json --output-file schema_date.json

## To print the schema on CLI
schema-detect tests/data/json/events.ndjson --fmt dict
## To test Python APIs
python .\tests\unit\combine_run_schema.py
## To build the image
.\build.ps1 -Target [test|prod]
##or 

CLI Overview (MVP)

Single command: schema-detect <path> with write options and detection/sampling knobs.

Key flags (subset):

  • --detection-mode {trust_hint,verify_hint,auto_detect} (default: trust_hint)
  • --coverage-mode {any,max,full} (default: max)
  • --sample-records (default: 500)
  • --sample-bytes (default: 5MB for any; full capped at 100MB)
  • --output-dir, --output-file, --fmt {yaml,json,txt,dict}
  • --zip-max-size (default: 500MB), --zip-max-members (default: 100)
  • --max-file-size (default: 50GB)
  • --sample-total-bytes-cap (soft cap default: 1GB)
  • --max-workers (default: os.cpu_count())
  • --retries (default: 3), --timeout-seconds (default: 180)
  • --log-json (opt-in), -v/--verbose

CSV knobs (MVP): --csv.header {auto,true,false} (auto flips to true when confidence ≥ 0.80), --csv.delimiter, --csv.quote, --csv.escape, --encoding (utf-8/utf-8-sig/utf-16le/utf-16be).

schema-classifier/
├─ README.md
├─ LICENSE
├─ pyproject.toml                   # packaging, deps, console script
├─ .gitignore
├─ src/
│  └─ pyschemaclassifier/          # library: prefer 'PySchemaClassifier' (or 'open_pyschemaclassifier' if name taken)
│     ├─ __init__.py
│     ├─ cli.py                    # CLI: schema-detect entrypoint
│     ├─ infer.py                  # Orchestrator: classify → detect → normalize → emit
│     ├─ config.py                 # Config model + load/merge logic (flags override YAML)
│     ├─ logging_utils.py          # Colored logs, JSON logs, verbosity levels
│     ├─ exceptions.py             # ArgumentError + taxonomy (DetectionError, etc.)
│     ├─ models/
│     │  ├─ __init__.py
│     │  └─ schema.py              # Normalized schema model + Spark StructType JSON conversion
│     ├─ detection/
│     │  ├─ __init__.py
│     │  ├─ classifier.py          # extension/magic bytes / table markers (delta/_delta_log, iceberg metadata.json, .hoodie)
│     │  ├─ compression.py         # gzip/bz2/xz/zstd/zip one-level handling; size/member caps; corruption checks
│     │  ├─ sampling.py            # Sampling state machine (records/bytes, coverage_mode, error budget)
│     │  ├─ csv.py                 # Basic delimiter/quote/escape/BOM/encoding; header auto w/ ≥0.80
│     │  ├─ json.py                # NDJSON vs JSON object/array; recursive inference; unions off by default
│     │  ├─ xml.py                 # Basic element→object; arrays via repeated elements (iterparse)
│     │  ├─ parquet.py             # Footer-based extraction via pyarrow; logical type mapping
│     │  ├─ avro.py                # Schema extraction via fastavro
│     │  ├─ orc.py                 # Schema via pyorc
│     │  ├─ delta.py               # Latest snapshot from _delta_log JSON (names in schema, IDs in metadata)
│     │  ├─ iceberg.py             # Parse metadata.json; partition transforms to metadata
│     │  └─ hudi.py                # COW support; raise TableFormatError for MOR
│     ├─ writers/
│     │  ├─ __init__.py
│     │  ├─ yaml.py                # schema.yml writer (default)
│     │  ├─ json.py                # Pretty JSON (schema + meta)
│     │  ├─ txt.py                 # Human-friendly text summary
│     │  └─ struct.py              # Return dict exactly matching Spark StructType.jsonValue()
│     ├─ dataframe/
│     │  ├─ __init__.py
│     │  ├─ pandas.py              # detect_schema_from_df(pd.DataFrame)
│     │  └─ spark.py               # detect_schema_from_df(Spark DataFrame)
│     └─ utils/
│        ├─ __init__.py
│        ├─ io.py                  # Safe open/stream, retries (3), timeouts (180s), size pre-checks (50 GB)
│        ├─ path.py                # Path utilities, dir traversal, per-file sampling selection
│        └─ metrics.py             # Confidence scoring; delimiter stability; provenance/meta helpers
├─ tests/
│  ├─ conftest.py
│  ├─ unit/
│  │  ├─ test_cli.py
│  │  ├─ test_config.py
│  │  ├─ test_exceptions.py
│  │  ├─ test_sampling.py
│  │  ├─ test_csv.py
│  │  ├─ test_json.py
│  │  ├─ test_xml.py
│  │  ├─ test_parquet.py
│  │  ├─ test_avro.py
│  │  ├─ test_orc.py
│  │  ├─ test_delta.py
│  │  ├─ test_iceberg.py
│  │  └─ test_hudi.py
│  ├─ integration/
│  │  ├─ test_directory_mode.py
│  │  ├─ test_zip_container.py
│  │  └─ test_parallel_sampling.py
│  ├─ data/                        # Small fixtures per format + compression (MVP-focused)
│  │  ├─ csv/
│  │  │  ├─ sales_header.csv
│  │  │  ├─ sales_no_header.csv
│  │  │  ├─ sales_utf8_sig.csv
│  │  │  ├─ sales.csv.gz
│  │  │  └─ very_wide.csv
│  │  ├─ json/
│  │  │  ├─ events.ndjson
│  │  │  └─ events.ndjson.gz
│  │  ├─ xml/
│  │  │  └─ books.xml
│  │  ├─ parquet/
│  │  │  └─ sample.parquet
│  │  ├─ avro/
│  │  │  └─ sample.avro
│  │  ├─ orc/
│  │  │  └─ sample.orc
│  │  ├─ delta/
│  │  │  └─ _delta_log/           # minimal commit/checkpoint JSONs
│  │  ├─ iceberg/
│  │  │  └─ metadata.json
│  │  ├─ hudi/
│  │  │  └─ .hoodie/              # COW minimal markers
│  │  └─ containers/
│  │     └─ sample.zip            # Small multi-entry zip within 500MB limit
│  └─ golden/
│     ├─ csv_sales_header.schema.yml
│     ├─ csv_sales_no_header.schema.yml
│     ├─ json_events_ndjson.schema.yml
│     ├─ parquet_sample.schema.yml
│     └─ … (one per format/compression)
├─ examples/
│  ├─ configs/
│  │  └─ mvp_defaults.yml         # Shows all overridable knobs; flags override
│  ├─ cli/
│  │  ├─ detect_csv.sh            # Illustrative commands (no actual code execution here)
│  │  ├─ detect_json.sh
│  │  └─ detect_parquet.sh
│  └─ api/
│     ├─ detect_from_path.md      # Usage examples (text only)
│     └─ detect_from_df.md        # Spark/pandas API examples (text only)
└─ docs/
   ├─ mvp_overview.md             # Short doc; full docs post-MVP
   └─ config_reference.md         # Flags and YAML keys


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_classifier-0.1.0.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

schema_classifier-0.1.0-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file schema_classifier-0.1.0.tar.gz.

File metadata

  • Download URL: schema_classifier-0.1.0.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for schema_classifier-0.1.0.tar.gz
Algorithm Hash digest
SHA256 02329509de250e908a75c2773114328442678cd2e9878eccaf6ab41c78e3a3af
MD5 467cccdc750bf360161aa27b9f3290bb
BLAKE2b-256 014d2ebd77d42d234734c1681aac61a5a3395edaa8acf12b5b522d760d047486

See more details on using hashes here.

File details

Details for the file schema_classifier-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for schema_classifier-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7a762a0f277935bca1c7749913bf09467cf04f13c00b150a57a48232ac44040b
MD5 60f7cd6ccf81d12ae30d886f390ccd82
BLAKE2b-256 0c19af0ce4f05d762095de2dccfd81971babb9c7f156dd280357b8f2c7670ec2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page