Skip to main content

A growing toolkit of data-engineering helper functions and CLI commands — starting with schema inference (column standardisation, type inference, schema + DDL generation for Pandas/ANSI SQL or PySpark/Spark SQL).

Project description

pyde-toolkit

A growing toolkit of data-engineering helper functions and CLI commands. The first tool is schema inference: standardise column names, infer data types from sample data, and emit ready-to-use schema definitions and CREATE TABLE DDL — either Pandas/ANSI SQL or PySpark/Spark SQL (with bronze/silver/gold layer support for Databricks / Unity Catalog workflows).

Install

pip install pyde-toolkit

# with Excel support (.xlsx, .xls, .xlsm, .xlsb, .ods)
pip install "pyde-toolkit[excel]"

# with the pre-flight memory check for large full-file reads
pip install "pyde-toolkit[memcheck]"

# everything
pip install "pyde-toolkit[all]"

Already have it installed and want the latest release?

pip install --upgrade pyde-toolkit

Quick start — Python

from pyde_toolkit import infer_file          # top-level convenience re-export
# or, namespaced (recommended as the toolkit grows):
from pyde_toolkit.schema_inferencer import infer_file

result = infer_file(
    "sales.csv",
    pyspark=True,
    casing="pascal",
    table_name="sales_fact",
    header_row=0,        # skip junk title rows if needed, e.g. header_row=4
    type_threshold=0.95, # tolerate a few dirty values before falling back to string
    encoding=None,       # auto-detected; override only if it picks the wrong one
)

print(result["schema"])         # PySpark StructType or pandas dtype dict
print(result["create_table"])   # SQL DDL
print(result["rename_code"])    # copy-paste column rename snippet
print(result["cleaning_code"])  # copy-paste SAP-style numeric cleanup snippet, if needed
print(result["report"])         # full formatted text report

Quick start — CLI

pyde-toolkit schema-infer sales.csv
pyde-toolkit schema-infer sales.csv --pyspark true --case pascal --layer bronze --catalog prod
pyde-toolkit schema-infer sales.xlsx --sheet Sheet2 --layer silver
pyde-toolkit schema-infer messy.csv --header-row 4 --type-threshold 0.80
pyde-toolkit schema-infer legacy.csv --encoding cp1252
pyde-toolkit --version

Run pyde-toolkit schema-infer --help for the full flag reference, or see the module docstring in pyde_toolkit/schema_inferencer/core.py.

Features

  • Column standardisation — camel, pascal, snake, screaming, kebab, or skip casing, with symbol expansion (/or, %pct, etc.)
  • Type inference — bool, int32/int64, float, date, datetime, string, with a configurable conformance threshold (--type-threshold) to tolerate dirty data
  • Header offset--header-row to skip junk/title rows above the real header, for both CSV and Excel
  • Robust encoding handling — CSV text encoding is auto-detected (BOM sniffing → optional charset-normalizer/chardet → fallback chain), so Windows/Excel/SAP exports in cp1252 no longer crash with a UnicodeDecodeError; override with --encoding if needed
  • SAP-style numeric cleanup — thousands-separator commas and trailing minus signs (e.g. "20,900.73", "130,166.00-") are detected and a copy-pasteable cleaning snippet is generated (result["cleaning_code"])
  • Dual output modes — Pandas dtypes + ANSI SQL, or PySpark StructType + Spark SQL
  • Layered outputs — bronze, parquet_bronze, silver, gold, gold_vw (view), or all five at once
  • Table types — managed Delta, external, or external Delta tables
  • Flexible input — CSV/TSV (delimiter auto-detected), Excel (.xlsx .xls .xlsm .xlsb .ods), or a pandas DataFrame directly

Project structure (for contributors)

src/pyde_toolkit/
├── __init__.py            # top-level re-exports + __version__
├── cli.py                 # top-level CLI dispatcher (registers subcommands)
└── schema_inferencer/     # one subpackage per feature
    ├── __init__.py        # public API for this feature
    ├── core.py            # logic only, no argparse
    └── cli.py             # add_arguments(parser) + run(args) for this feature

Adding a new feature later: create pyde_toolkit/<your_feature>/ with the same three-file shape, then register it with one line in pyde_toolkit/cli.py's build_parser(). No other files need to change.

Releasing a new version

Version lives in one place (pyproject.toml); the installed package's __version__ is read live from package metadata, so there's nothing else to keep in sync.

python scripts/bump_version.py patch   # or minor / major / an exact X.Y.Z
python -m build
twine upload dist/*

Anyone with it already installed just runs pip install --upgrade pyde-toolkit — no need to uninstall first.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyde_toolkit-1.5.5.tar.gz (37.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyde_toolkit-1.5.5-py3-none-any.whl (36.3 kB view details)

Uploaded Python 3

File details

Details for the file pyde_toolkit-1.5.5.tar.gz.

File metadata

  • Download URL: pyde_toolkit-1.5.5.tar.gz
  • Upload date:
  • Size: 37.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for pyde_toolkit-1.5.5.tar.gz
Algorithm Hash digest
SHA256 776747188cdbbc56ad1f56f665091a7dfb8b34336ce52e49a491c124dcd028a0
MD5 e070fa99b9644f438960354f524bbf45
BLAKE2b-256 18ac800e265a683b1df8e9c76a7e291277179762560378644c047bc65ca5958f

See more details on using hashes here.

File details

Details for the file pyde_toolkit-1.5.5-py3-none-any.whl.

File metadata

  • Download URL: pyde_toolkit-1.5.5-py3-none-any.whl
  • Upload date:
  • Size: 36.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for pyde_toolkit-1.5.5-py3-none-any.whl
Algorithm Hash digest
SHA256 8e93a4d20538d30c04eb687b5009aef9f33f173a2581610853d2f2a1487550bf
MD5 d889bbb62ccdb4e92ee6cdc3c1c8de34
BLAKE2b-256 c7c2270b0f2ccf7710b571509ea9232983126f21f965b4854aa1cfdef2c15261

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page