Skip to main content

Detect and infer schemas from files/dirs/DataFrames; emit YAML/JSON/TXT/Spark StructType

Project description

schema-classifier

PySchemaClassifier — a Python library and CLI to detect file/table/dataframe formats, infer/extract schemas, and emit schemas (Spark StructType-like dict, YAML, JSON, TXT).MVP focuses on single-level compression, core formats (CSV/JSON/XML/Parquet/Avro/ORC + Delta/Iceberg/Hudi metadata), sampling policies, and robust exceptions.

Status

This is a design-locked skeleton for MVP implementation. Modules are scaffolded with docstrings and TODO markers.

Quick Start

# create and activate venvpip install -e .
 
python -m venv .venv
source .venv/bin/activate 
##or 
.\.venv\Scripts\activate

# editable install
pip install -e .
pip install -e .[orc]
pip install -e .[dataframe]

# run CLI (prints skeleton info)
schema-detect --help

# Try running below commands to test this framework

## schema-detect defaults
# --fmt: yaml
# --output-dir .
# --output-file schema.yml

schema-detect tests/data/csv/sales_header.csv
schema-detect tests/data/csv/sales_no_header.csv --fmt yaml --output-file schema_no_header.yml
schema-detect tests/data/csv/very_wide.csv --fmt yaml --output-file schema_wide.yml
schema-detect tests/data/csv/sales_utf8_sig.csv --fmt yaml --output-file schema_utf8.yml
schema-detect tests/data/orc/TestOrcFile.testDate1900.orc --fmt yaml --output-file schema_orc.yml
schema-detect tests/data/avro/weather.avro --fmt yaml --output-file schema_avro.yml
schema-detect tests/data/parquet/v0.7.1.all-named-index.parquet --fmt yaml --output-file schema_pqt.yml
schema-detect tests/data/delta/people_countries_delta_dask/ --fmt yaml --output-file schema_delta.yml
schema-detect tests/data/json/events.ndjson --fmt yaml --output-file schema_json.yml
schema-detect tests/data/xml/books.xml --fmt yaml --output-file schema_xml.yml
schema-detect tests/data/csv/ --multi-file-fmt txt
schema-detect tests/data/csv/sales_20250101.csv --fmt json --output-file schema_date.json

## To print the schema on CLI
schema-detect tests/data/json/events.ndjson --fmt dict

## To verify the schemas
schema-verify ./tests/data/csv/sales_20250101.csv ./tests/data/csv/sales_20260101.csv --fmt txt --output verify_sales_2025_vs_2026.txt
schema-verify ./tests/data/csv/sales_20250101.csv ./tests/data/csv/sales_20260101.csv --fmt json --output verify_sales_2025_vs_2026.json
schema-verify ./tests/data/csv/sales_20250101.csv ./tests/data/csv/sales_20260101.csv --fmt yaml --output verify_sales_2025_vs_2026.yml
schema-verify ./tests/data/csv/sales_20250101.csv ./tests/data/csv/sales_header.csv --fmt json --output verify_sales_2025_vs_header.json
schema-verify ./tests/data/csv/sales_20250101.csv ./tests/data/csv/sales_header.csv --fmt yaml --output verify_sales_2025_vs_header.yml
schema-verify ./tests/data/csv/sales_20250101.csv ./tests/data/csv/sales_header.csv --fmt txt --output verify_sales_2025_vs_header.txt
schema-verify ./cli/schema_2025.yml ./cli/schema_2026.yml --fmt txt
schema-verify ./api/schema_orc.yml ./api/schema_avro.yml --fmt json
## To test Python APIs
python .\tests\unit\combined_api_detect_write.py
python .\tests\unit\combined_api_verify.py
python .\tests\unit\detect_df_schemas.py
python .\tests\unit\verify_df_schemas.py
## To build the image
.\build.ps1 -Target [test|prod]
##or 

Key Features

Supports single file or directory mode (multi-file detection). Configurable via:

  • YAML config (--config)
  • Environment variables (PYSCH_*)
  • CLI flags (highest precedence)
  • Outputs schema in multiple formats: yaml, json, txt, or raw dict.

Key flags :

--config CONFIG Path to YAML config file for defaults.

--detection-mode {trust_hint,verify_hint,auto_detect} Detection strategy (default: trust_hint).

--coverage-mode {any,max,full} Sampling coverage (default: max).

--sample-records SAMPLE_RECORDS Number of records to sample (default: 500).

--sample-bytes SAMPLE_BYTES Byte-based sampling limit (default: 5MB).

--output-dir OUTPUT_DIR Directory for schema outputs.

--output-file OUTPUT_FILE File name for single-file mode (default: schema.yml).

--fmt {yaml,json,txt,dict} Output format (default: yaml).

--multi-file-fmt MULTI_FILE_FMT Optional suffix for multi-file outputs (e.g., schema → filename.schema.yaml).

Performance & Limits

--zip-max-size ZIP_MAX_SIZE (default: 500MB)

--zip-max-members ZIP_MAX_MEMBERS (default: 100)

--sample-total-bytes-cap SAMPLE_TOTAL_BYTES_CAP (default: 1GB)

--max-file-size MAX_FILE_SIZE (default: 50GB)

--max-workers MAX_WORKERS (default: os.cpu_count())

--retries RETRIES (default: 3)

--timeout-seconds TIMEOUT_SECONDS (default: 180)

Logging

--log-json → Structured JSON logs.

-v, --verbose → Increase verbosity.

CSV-Specific Knobs

--csv.header {auto,true,false} Header detection (auto flips to true if confidence ≥ 0.80).

--csv.delimiter CSV.DELIMITER Custom delimiter.

--csv.quote CSV.QUOTE Quote character.

--csv.escape CSV.ESCAPE Escape character.

--encoding {utf-8,utf-8-sig,utf-16le,utf-16be} File encoding.

Test cases

## Check this repository for examples for CLI & API and the schema files in respective folders
git checkout https://github.com/aashish72it/schema-classifier-test-cases
schema-classifier/
├─ README.md                        # Project overview, installation, usage
├─ LICENSE                          # License details
├─ build.ps1                        # build in windows
├─ build.sh                         # build in mac/linux
├─ pyproject.toml                   # Packaging metadata, dependencies, console script entrypoint
├─ .gitignore                       # Ignore build artifacts, venv, etc.
├─ src/
│  └─ pyschemaclassifier/           # Core library code
│     ├─ __init__.py
│     ├─ cli.py                     # CLI entrypoint: parses args, merges config, calls infer
│     ├─ infer.py                   # Orchestrator: classify → detect → normalize → emit
│     ├─ cli_verify.py              # CLI entrypoint for schema verify: parses args, merges config, calls infer
│     ├─ verify.py                  # Orchestrator: verify the schemas for 2 inputs(files/dir/pandas df/spark df)
│     ├─ config.py                  # Config model + load/merge logic (YAML/env/CLI precedence)
│     ├─ logging_utils.py           # Logging helpers (colored, JSON, verbosity)
│     ├─ exceptions.py              # Custom exceptions (ArgumentError, DetectionError, etc.)
│     ├─ models/
│     │  ├─ __init__.py
│     │  └─ schema.py               # Normalized schema model + Spark StructType JSON conversion
│     ├─ detection/                 # Format-specific schema detection
│     │  ├─ __init__.py
│     │  ├─ classifier.py           # Detect file type by extension/magic bytes/table markers
│     │  ├─ compression.py          # Handle gzip/bz2/xz/zstd/zip; size/member caps; corruption checks
│     │  ├─ sampling.py             # Sampling logic (records/bytes, coverage_mode, error budget)
│     │  ├─ csv.py                  # CSV detection: delimiter, header inference, BOM handling
│     │  ├─ json.py                 # NDJSON vs JSON array/object; recursive inference
│     │  ├─ xml.py                  # XML detection: element→object mapping, arrays via repeated tags
│     │  ├─ parquet.py              # Extract schema via PyArrow footer; logical type mapping
│     │  ├─ avro.py                 # Schema extraction via fastavro
│     │  ├─ orc.py                  # Schema extraction via pyorc
│     │  ├─ delta.py                # Delta Lake: parse _delta_log JSON for latest snapshot
│     │  ├─ iceberg.py              # Iceberg: parse metadata.json; partition transforms
│     │  └─ hudi.py                 # Hudi: COW support; raise TableFormatError for MOR
│     ├─ writers/                   # Output schema in different formats
│     │  ├─ __init__.py
│     │  ├─ yaml.py                 # Default writer: schema.yml
│     │  ├─ json.py                 # Pretty JSON output (schema + meta)
│     │  ├─ txt.py                  # Human-readable text summary
│     │  └─ struct.py               # Dict matching Spark StructType.jsonValue()
│     ├─ dataframe/                 # Detect schema from in-memory data
│     │  ├─ __init__.py
│     │  ├─ pandas.py               # detect_schema_from_df(pd.DataFrame)
│     │  └─ spark.py                # detect_schema_from_df(Spark DataFrame)
│     └─ utils/                     # Generic helpers
│        ├─ __init__.py
│        ├─ io.py                   # Safe file I/O, retries, timeouts, size checks
│        ├─ path.py                 # Path utilities, directory traversal, sampling selection
│        └─ metrics.py              # Confidence scoring, delimiter stability, provenance/meta helpers
├─ tests/
│  ├─ conftest.py                   # Pytest setup
│  ├─ unit/                         # Unit tests for individual modules
│  │  ├─ combined_cli_verify.py
│  │  ├─ combined_api_verify.py
│  │  ├─ combined_cli_detect.py
│  │  ├─ detect_df_schemas.py
│  │  ├─ verify_df_schemas.py
│  │  └─ combined_api_detect_write.py
│  ├─ integration/                  # End-to-end scenarios
│  │  ├─ test_directory_mode.py
│  │  ├─ test_zip_container.py
│  │  └─ test_parallel_sampling.py
│  ├─ data/                         # Test fixtures for all formats
│  │  ├─ csv/                       # CSV samples (header/no-header, BOM, wide)
│  │  ├─ json/                      # NDJSON samples
│  │  ├─ xml/                       # XML samples
│  │  ├─ parquet/                   # Parquet samples
│  │  ├─ avro/                      # Avro samples
│  │  ├─ orc/                       # ORC samples
│  │  ├─ delta/                     # Delta Lake minimal _delta_log
│  │  ├─ iceberg/                   # Iceberg metadata.json
│  │  ├─ hudi/                      # Hudi .hoodie markers
│  │  └─ containers/                # Compressed multi-entry zip
│  └─ golden/                       # Expected schema outputs for regression tests
├─ examples/
│  ├─ configs/
│  │  └─ mvp_defaults.yml           # Centralized defaults for detection/writing
│  ├─ cli/
│  │  ├─ detect_csv.sh              # Sample CLI commands for CSV
│  │  ├─ detect_json.sh             # Sample CLI commands for JSON
│  │  └─ detect_parquet.sh          # Sample CLI commands for Parquet
│  └─ api/
│     ├─ detect_from_path.md        # Programmatic usage: detect from file path
│     └─ detect_from_df.md          # Programmatic usage: detect from Spark/Pandas DataFrame
└─ docs/
   ├─ mvp_overview.md               # MVP architecture and goals
   └─ config_reference.md           # YAML keys, CLI flags, precedence



Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_classifier-1.0.1.tar.gz (26.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

schema_classifier-1.0.1-py3-none-any.whl (34.1 kB view details)

Uploaded Python 3

File details

Details for the file schema_classifier-1.0.1.tar.gz.

File metadata

  • Download URL: schema_classifier-1.0.1.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for schema_classifier-1.0.1.tar.gz
Algorithm Hash digest
SHA256 a77067098a6e33cd80dec53f0f7832b2b0b0cd9399ef7bf6b52cf7144912b4e3
MD5 3bcbb8fc135141ecb0f31854c851ae3c
BLAKE2b-256 3582d34ff45a14ea4a9361eb18f11ef772277f70e9f6e55eeeb4e6d429e1ec84

See more details on using hashes here.

File details

Details for the file schema_classifier-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for schema_classifier-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f6d8063757747535ce8e728e01c078d7fd60306117628eb09d4ec413de6d49f5
MD5 05a9a19152d05dfffe26c997008f3f0c
BLAKE2b-256 b49e2dd9513d6b5f739d0990f5b98583aa1ea507f073ef3ca673a06c6f385a5f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page