A growing toolkit of data-engineering helper functions and CLI commands — starting with schema inference (column standardisation, type inference, schema + DDL generation for Pandas/ANSI SQL or PySpark/Spark SQL).
Project description
pyde-toolkit
A growing toolkit of data-engineering helper functions and CLI commands.
The first tool is schema inference: standardise column names, infer
data types from sample data, and emit ready-to-use schema definitions and
CREATE TABLE DDL — either Pandas/ANSI SQL or PySpark/Spark SQL (with
bronze/silver/gold layer support for Databricks / Unity Catalog workflows).
Install
pip install pyde-toolkit
# with Excel support (.xlsx, .xls, .xlsm, .xlsb, .ods)
pip install "pyde-toolkit[excel]"
# with the pre-flight memory check for large full-file reads
pip install "pyde-toolkit[memcheck]"
# everything
pip install "pyde-toolkit[all]"
Already have it installed and want the latest release?
pip install --upgrade pyde-toolkit
Quick start — Python
from pyde_toolkit import infer_file # top-level convenience re-export
# or, namespaced (recommended as the toolkit grows):
from pyde_toolkit.schema_inferencer import infer_file
result = infer_file(
"sales.csv",
pyspark=True,
casing="pascal",
table_name="sales_fact",
header_row=0, # skip junk title rows if needed, e.g. header_row=4
type_threshold=0.95, # tolerate a few dirty values before falling back to string
)
print(result["schema"]) # PySpark StructType or pandas dtype dict
print(result["create_table"]) # SQL DDL
print(result["rename_code"]) # copy-paste column rename snippet
print(result["report"]) # full formatted text report
Quick start — CLI
pyde-toolkit schema-infer sales.csv
pyde-toolkit schema-infer sales.csv --pyspark true --case pascal --layer bronze --catalog prod
pyde-toolkit schema-infer sales.xlsx --sheet Sheet2 --layer silver
pyde-toolkit schema-infer messy.csv --header-row 4 --type-threshold 0.80
pyde-toolkit --version
Run pyde-toolkit schema-infer --help for the full flag reference, or see
the module docstring in pyde_toolkit/schema_inferencer/core.py.
Features
- Column standardisation — camel, pascal, snake, screaming, kebab, or
skip casing, with symbol expansion (
/→or,%→pct, etc.) - Type inference — bool, int32/int64, float, date, datetime, string,
with a configurable conformance threshold (
--type-threshold) to tolerate dirty data - Header offset —
--header-rowto skip junk/title rows above the real header, for both CSV and Excel - Dual output modes — Pandas dtypes + ANSI SQL, or PySpark StructType + Spark SQL
- Layered outputs — bronze, parquet_bronze, silver, gold, gold_vw (view), or all five at once
- Table types — managed Delta, external, or external Delta tables
- Flexible input — CSV/TSV (delimiter auto-detected), Excel
(
.xlsx .xls .xlsm .xlsb .ods), or a pandas DataFrame directly
Project structure (for contributors)
src/pyde_toolkit/
├── __init__.py # top-level re-exports + __version__
├── cli.py # top-level CLI dispatcher (registers subcommands)
└── schema_inferencer/ # one subpackage per feature
├── __init__.py # public API for this feature
├── core.py # logic only, no argparse
└── cli.py # add_arguments(parser) + run(args) for this feature
Adding a new feature later: create pyde_toolkit/<your_feature>/ with
the same three-file shape, then register it with one line in
pyde_toolkit/cli.py's build_parser(). No other files need to change.
Releasing a new version
Version lives in one place (pyproject.toml); the installed package's
__version__ is read live from package metadata, so there's nothing else
to keep in sync.
python scripts/bump_version.py patch # or minor / major / an exact X.Y.Z
python -m build
twine upload dist/*
Anyone with it already installed just runs pip install --upgrade pyde-toolkit
— no need to uninstall first.
License
MIT — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyde_toolkit-1.2.1.tar.gz.
File metadata
- Download URL: pyde_toolkit-1.2.1.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b6c535ee59f71b6b4f7a0d60a2348e225fc26947117e4682b3b48dc27e331fd
|
|
| MD5 |
0a0363bfa6da605e5c31000e7b08e4a4
|
|
| BLAKE2b-256 |
fab7abeb5bc018c77b451adb1a1e222595587e1c9ce76e734daae375a756f33e
|
File details
Details for the file pyde_toolkit-1.2.1-py3-none-any.whl.
File metadata
- Download URL: pyde_toolkit-1.2.1-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ee289534ebf06836c4169338f23fc2906d18a822f4c6ea06bb9a3dcce57c4f4
|
|
| MD5 |
d429bdee13852ba992f04bc921219d4a
|
|
| BLAKE2b-256 |
c8578001a59d43f11244b17dd2598ab58b542c924508550290dc5a62dcf83ba5
|