Detect and infer schemas from files/dirs/DataFrames; emit YAML/JSON/TXT/Spark StructType

These details have not been verified by PyPI

Project description

schema-classifier

PySchemaClassifier — a Python library and CLI to detect file/table/dataframe formats, infer/extract schemas, and emit schemas (Spark StructType-like dict, YAML, JSON, TXT).MVP focuses on single-level compression, core formats (CSV/JSON/XML/Parquet/Avro/ORC + Delta/Iceberg/Hudi metadata), sampling policies, and robust exceptions.

Status

This is a design-locked skeleton for MVP implementation. Modules are scaffolded with docstrings and TODO markers.

Quick Start

# create and activate venvpip install -e .
 
python -m venv .venv
source .venv/bin/activate 
##or 
.\.venv\Scripts\activate

# editable install
pip install -e .
pip install -e .[orc]
pip install -e .[dataframe]

# run CLI (prints skeleton info)
schema-detect --help

# Try running below commands to test this framework

## schema-detect defaults
# --fmt: yaml
# --output-dir .
# --output-file schema.yml

schema-detect tests/data/csv/sales_header.csv
schema-detect tests/data/csv/sales_no_header.csv --fmt yaml --output-file schema_no_header.yml
schema-detect tests/data/csv/very_wide.csv --fmt yaml --output-file schema_wide.yml
schema-detect tests/data/csv/sales_utf8_sig.csv --fmt yaml --output-file schema_utf8.yml
schema-detect tests/data/orc/TestOrcFile.testDate1900.orc --fmt yaml --output-file schema_orc.yml
schema-detect tests/data/avro/weather.avro --fmt yaml --output-file schema_avro.yml
schema-detect tests/data/parquet/v0.7.1.all-named-index.parquet --fmt yaml --output-file schema_pqt.yml
schema-detect tests/data/delta/people_countries_delta_dask/ --fmt yaml --output-file schema_delta.yml
schema-detect tests/data/json/events.ndjson --fmt yaml --output-file schema_json.yml
schema-detect tests/data/xml/books.xml --fmt yaml --output-file schema_xml.yml
schema-detect tests/data/csv/ --multi-file-fmt txt
schema-detect tests/data/csv/sales_20250101.csv --fmt json --output-file schema_date.json

## To print the schema on CLI
schema-detect tests/data/json/events.ndjson --fmt dict

## To verify the schemas
schema-verify ./tests/data/csv/sales_20250101.csv ./tests/data/csv/sales_20260101.csv --fmt txt --output verify_sales_2025_vs_2026.txt
schema-verify ./tests/data/csv/sales_20250101.csv ./tests/data/csv/sales_20260101.csv --fmt json --output verify_sales_2025_vs_2026.json
schema-verify ./tests/data/csv/sales_20250101.csv ./tests/data/csv/sales_20260101.csv --fmt yaml --output verify_sales_2025_vs_2026.yml
schema-verify ./tests/data/csv/sales_20250101.csv ./tests/data/csv/sales_header.csv --fmt json --output verify_sales_2025_vs_header.json
schema-verify ./tests/data/csv/sales_20250101.csv ./tests/data/csv/sales_header.csv --fmt yaml --output verify_sales_2025_vs_header.yml
schema-verify ./tests/data/csv/sales_20250101.csv ./tests/data/csv/sales_header.csv --fmt txt --output verify_sales_2025_vs_header.txt
schema-verify ./cli/schema_2025.yml ./cli/schema_2026.yml --fmt txt
schema-verify ./api/schema_orc.yml ./api/schema_avro.yml --fmt json

## To test Python APIs
python .\tests\unit\combined_api_detect_write.py
python .\tests\unit\combined_api_verify.py
python .\tests\unit\detect_df_schemas.py
python .\tests\unit\verify_df_schemas.py

## To build the image
.\build.ps1 -Target [test|prod]
##or

Key Features

Supports single file or directory mode (multi-file detection). Configurable via:

YAML config (--config)
Environment variables (PYSCH_*)
CLI flags (highest precedence)
Outputs schema in multiple formats: yaml, json, txt, or raw dict.

Key flags :

--config CONFIG Path to YAML config file for defaults.

--detection-mode {trust_hint,verify_hint,auto_detect} Detection strategy (default: trust_hint).

--coverage-mode {any,max,full} Sampling coverage (default: max).

--sample-records SAMPLE_RECORDS Number of records to sample (default: 500).

--sample-bytes SAMPLE_BYTES Byte-based sampling limit (default: 5MB).

--output-dir OUTPUT_DIR Directory for schema outputs.

--output-file OUTPUT_FILE File name for single-file mode (default: schema.yml).

--fmt {yaml,json,txt,dict} Output format (default: yaml).

--multi-file-fmt MULTI_FILE_FMT Optional suffix for multi-file outputs (e.g., schema → filename.schema.yaml).

Performance & Limits

--zip-max-size ZIP_MAX_SIZE (default: 500MB)

--zip-max-members ZIP_MAX_MEMBERS (default: 100)

--sample-total-bytes-cap SAMPLE_TOTAL_BYTES_CAP (default: 1GB)

--max-file-size MAX_FILE_SIZE (default: 50GB)

--max-workers MAX_WORKERS (default: os.cpu_count())

--retries RETRIES (default: 3)

--timeout-seconds TIMEOUT_SECONDS (default: 180)

Logging

--log-json → Structured JSON logs.

-v, --verbose → Increase verbosity.

CSV-Specific Knobs

--csv.header {auto,true,false} Header detection (auto flips to true if confidence ≥ 0.80).

--csv.delimiter CSV.DELIMITER Custom delimiter.

--csv.quote CSV.QUOTE Quote character.

--csv.escape CSV.ESCAPE Escape character.

--encoding {utf-8,utf-8-sig,utf-16le,utf-16be} File encoding.

Test cases

## Check this repository for examples for CLI & API and the schema files in respective folders
git checkout https://github.com/aashish72it/schema-classifier-test-cases

schema-classifier/
├─ README.md                        # Project overview, installation, usage
├─ LICENSE                          # License details
├─ build.ps1                        # build in windows
├─ build.sh                         # build in mac/linux
├─ pyproject.toml                   # Packaging metadata, dependencies, console script entrypoint
├─ .gitignore                       # Ignore build artifacts, venv, etc.
├─ src/
│  └─ pyschemaclassifier/           # Core library code
│     ├─ __init__.py
│     ├─ cli.py                     # CLI entrypoint: parses args, merges config, calls infer
│     ├─ infer.py                   # Orchestrator: classify → detect → normalize → emit
│     ├─ cli_verify.py              # CLI entrypoint for schema verify: parses args, merges config, calls infer
│     ├─ verify.py                  # Orchestrator: verify the schemas for 2 inputs(files/dir/pandas df/spark df)
│     ├─ config.py                  # Config model + load/merge logic (YAML/env/CLI precedence)
│     ├─ logging_utils.py           # Logging helpers (colored, JSON, verbosity)
│     ├─ exceptions.py              # Custom exceptions (ArgumentError, DetectionError, etc.)
│     ├─ models/
│     │  ├─ __init__.py
│     │  └─ schema.py               # Normalized schema model + Spark StructType JSON conversion
│     ├─ detection/                 # Format-specific schema detection
│     │  ├─ __init__.py
│     │  ├─ classifier.py           # Detect file type by extension/magic bytes/table markers
│     │  ├─ compression.py          # Handle gzip/bz2/xz/zstd/zip; size/member caps; corruption checks
│     │  ├─ sampling.py             # Sampling logic (records/bytes, coverage_mode, error budget)
│     │  ├─ csv.py                  # CSV detection: delimiter, header inference, BOM handling
│     │  ├─ json.py                 # NDJSON vs JSON array/object; recursive inference
│     │  ├─ xml.py                  # XML detection: element→object mapping, arrays via repeated tags
│     │  ├─ parquet.py              # Extract schema via PyArrow footer; logical type mapping
│     │  ├─ avro.py                 # Schema extraction via fastavro
│     │  ├─ orc.py                  # Schema extraction via pyorc
│     │  ├─ delta.py                # Delta Lake: parse _delta_log JSON for latest snapshot
│     │  ├─ iceberg.py              # Iceberg: parse metadata.json; partition transforms
│     │  └─ hudi.py                 # Hudi: COW support; raise TableFormatError for MOR
│     ├─ writers/                   # Output schema in different formats
│     │  ├─ __init__.py
│     │  ├─ yaml.py                 # Default writer: schema.yml
│     │  ├─ json.py                 # Pretty JSON output (schema + meta)
│     │  ├─ txt.py                  # Human-readable text summary
│     │  └─ struct.py               # Dict matching Spark StructType.jsonValue()
│     ├─ dataframe/                 # Detect schema from in-memory data
│     │  ├─ __init__.py
│     │  ├─ pandas.py               # detect_schema_from_df(pd.DataFrame)
│     │  └─ spark.py                # detect_schema_from_df(Spark DataFrame)
│     └─ utils/                     # Generic helpers
│        ├─ __init__.py
│        ├─ io.py                   # Safe file I/O, retries, timeouts, size checks
│        ├─ path.py                 # Path utilities, directory traversal, sampling selection
│        └─ metrics.py              # Confidence scoring, delimiter stability, provenance/meta helpers
├─ tests/
│  ├─ conftest.py                   # Pytest setup
│  ├─ unit/                         # Unit tests for individual modules
│  │  ├─ combined_cli_verify.py
│  │  ├─ combined_api_verify.py
│  │  ├─ combined_cli_detect.py
│  │  ├─ detect_df_schemas.py
│  │  ├─ verify_df_schemas.py
│  │  └─ combined_api_detect_write.py
│  ├─ integration/                  # End-to-end scenarios
│  │  ├─ test_directory_mode.py
│  │  ├─ test_zip_container.py
│  │  └─ test_parallel_sampling.py
│  ├─ data/                         # Test fixtures for all formats
│  │  ├─ csv/                       # CSV samples (header/no-header, BOM, wide)
│  │  ├─ json/                      # NDJSON samples
│  │  ├─ xml/                       # XML samples
│  │  ├─ parquet/                   # Parquet samples
│  │  ├─ avro/                      # Avro samples
│  │  ├─ orc/                       # ORC samples
│  │  ├─ delta/                     # Delta Lake minimal _delta_log
│  │  ├─ iceberg/                   # Iceberg metadata.json
│  │  ├─ hudi/                      # Hudi .hoodie markers
│  │  └─ containers/                # Compressed multi-entry zip
│  └─ golden/                       # Expected schema outputs for regression tests
├─ examples/
│  ├─ configs/
│  │  └─ mvp_defaults.yml           # Centralized defaults for detection/writing
│  ├─ cli/
│  │  ├─ detect_csv.sh              # Sample CLI commands for CSV
│  │  ├─ detect_json.sh             # Sample CLI commands for JSON
│  │  └─ detect_parquet.sh          # Sample CLI commands for Parquet
│  └─ api/
│     ├─ detect_from_path.md        # Programmatic usage: detect from file path
│     └─ detect_from_df.md          # Programmatic usage: detect from Spark/Pandas DataFrame
└─ docs/
   ├─ mvp_overview.md               # MVP architecture and goals
   └─ config_reference.md           # YAML keys, CLI flags, precedence

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.1

Jan 10, 2026

1.0.0

Jan 10, 2026

0.1.1

Jan 2, 2026

0.1.0

Jan 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_classifier-1.0.1.tar.gz (26.4 kB view details)

Uploaded Jan 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

schema_classifier-1.0.1-py3-none-any.whl (34.1 kB view details)

Uploaded Jan 10, 2026 Python 3

File details

Details for the file schema_classifier-1.0.1.tar.gz.

File metadata

Download URL: schema_classifier-1.0.1.tar.gz
Upload date: Jan 10, 2026
Size: 26.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for schema_classifier-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`a77067098a6e33cd80dec53f0f7832b2b0b0cd9399ef7bf6b52cf7144912b4e3`
MD5	`3bcbb8fc135141ecb0f31854c851ae3c`
BLAKE2b-256	`3582d34ff45a14ea4a9361eb18f11ef772277f70e9f6e55eeeb4e6d429e1ec84`

See more details on using hashes here.

File details

Details for the file schema_classifier-1.0.1-py3-none-any.whl.

File metadata

Download URL: schema_classifier-1.0.1-py3-none-any.whl
Upload date: Jan 10, 2026
Size: 34.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for schema_classifier-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f6d8063757747535ce8e728e01c078d7fd60306117628eb09d4ec413de6d49f5`
MD5	`05a9a19152d05dfffe26c997008f3f0c`
BLAKE2b-256	`b49e2dd9513d6b5f739d0990f5b98583aa1ea507f073ef3ca673a06c6f385a5f`

See more details on using hashes here.

schema-classifier 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

schema-classifier

Status

Quick Start

Key Features

Key flags :

Performance & Limits

Logging

CSV-Specific Knobs

Test cases

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes