A growing toolkit of data-engineering helper functions and CLI commands — starting with schema inference (column standardisation, type inference, schema + DDL generation for Pandas/ANSI SQL or PySpark/Spark SQL).

These details have not been verified by PyPI

Project links

Project description

cds-pyde-toolkit

A growing toolkit of data-engineering helper functions and CLI commands. The first tool is schema inference: standardise column names, infer data types from sample data, and emit ready-to-use schema definitions and CREATE TABLE DDL — either Pandas/ANSI SQL or PySpark/Spark SQL (with bronze/silver/gold layer support for Databricks / Unity Catalog workflows).

Install

pip install cds-pyde-toolkit

# with Excel support (.xlsx, .xls, .xlsm, .xlsb, .ods)
pip install "cds-pyde-toolkit[excel]"

# with the pre-flight memory check for large full-file reads
pip install "cds-pyde-toolkit[memcheck]"

# everything
pip install "cds-pyde-toolkit[all]"

Already have it installed and want the latest release?

pip install --upgrade cds-pyde-toolkit

Quick start — Python

from cds_pyde_toolkit import infer_file          # top-level convenience re-export
# or, namespaced (recommended as the toolkit grows):
from cds_pyde_toolkit.schema_inferencer import infer_file

result = infer_file(
    "sales.csv",
    pyspark=True,
    casing="pascal",
    table_name="sales_fact",
    header_row=0,        # skip junk title rows if needed, e.g. header_row=4
    type_threshold=0.95, # tolerate a few dirty values before falling back to string
    encoding=None,       # auto-detected; override only if it picks the wrong one
)

print(result["schema"])         # PySpark StructType or pandas dtype dict
print(result["create_table"])   # SQL DDL
print(result["rename_code"])    # copy-paste column rename snippet
print(result["cleaning_code"])  # copy-paste SAP-style numeric cleanup snippet, if needed
print(result["report"])         # full formatted text report

Quick start — CLI

cds-pyde-toolkit schema-infer sales.csv
cds-pyde-toolkit schema-infer sales.csv --pyspark true --case pascal --layer bronze --catalog prod
cds-pyde-toolkit schema-infer sales.xlsx --sheet Sheet2 --layer silver
cds-pyde-toolkit schema-infer messy.csv --header-row 4 --type-threshold 0.80
cds-pyde-toolkit schema-infer legacy.csv --encoding cp1252
cds-pyde-toolkit --version

Run cds-pyde-toolkit schema-infer --help for the full flag reference, or see the module docstring in cds_pyde_toolkit/schema_inferencer/core.py.

Features

Column standardisation — camel, pascal, snake, screaming, kebab, or skip casing, with symbol expansion (/ → or, % → pct, etc.)
Type inference — bool, int32/int64, float, date, datetime, string, with a configurable conformance threshold (--type-threshold) to tolerate dirty data
Header offset — --header-row to skip junk/title rows above the real header, for both CSV and Excel
Robust encoding handling — CSV text encoding is auto-detected (BOM sniffing → optional charset-normalizer/chardet → fallback chain), so Windows/Excel/SAP exports in cp1252 no longer crash with a UnicodeDecodeError; override with --encoding if needed
SAP-style numeric cleanup — thousands-separator commas and trailing minus signs (e.g. "20,900.73", "130,166.00-") are detected and a copy-pasteable cleaning snippet is generated (result["cleaning_code"])
Dual output modes — Pandas dtypes + ANSI SQL, or PySpark StructType + Spark SQL
Layered outputs — bronze, parquet_bronze, silver, gold, gold_vw (view), or all five at once
Table types — managed Delta, external, or external Delta tables
Flexible input — CSV/TSV (delimiter auto-detected), Excel (.xlsx .xls .xlsm .xlsb .ods), or a pandas DataFrame directly

Project structure (for contributors)

src/cds_pyde_toolkit/
├── __init__.py            # top-level re-exports + __version__
├── cli.py                 # top-level CLI dispatcher (registers subcommands)
└── schema_inferencer/     # one subpackage per feature
    ├── __init__.py        # public API for this feature
    ├── core.py            # logic only, no argparse
    └── cli.py             # add_arguments(parser) + run(args) for this feature

Adding a new feature later: create cds_pyde_toolkit/<your_feature>/ with the same three-file shape, then register it with one line in cds_pyde_toolkit/cli.py's build_parser(). No other files need to change.

Releasing a new version

Version lives in one place (pyproject.toml); the installed package's __version__ is read live from package metadata, so there's nothing else to keep in sync.

python scripts/bump_version.py patch   # or minor / major / an exact X.Y.Z
python -m build
twine upload dist/*

Anyone with it already installed just runs pip install --upgrade cds-pyde-toolkit — no need to uninstall first.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.5.5

Jun 25, 2026

1.1.1

Jun 22, 2026

1.1.0

Jun 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cds_pyde_toolkit-1.5.5.tar.gz (37.8 kB view details)

Uploaded Jun 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cds_pyde_toolkit-1.5.5-py3-none-any.whl (36.5 kB view details)

Uploaded Jun 25, 2026 Python 3

File details

Details for the file cds_pyde_toolkit-1.5.5.tar.gz.

File metadata

Download URL: cds_pyde_toolkit-1.5.5.tar.gz
Upload date: Jun 25, 2026
Size: 37.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for cds_pyde_toolkit-1.5.5.tar.gz
Algorithm	Hash digest
SHA256	`68b8be87affa82e526517124ed8bde391fb203e2ed2df79bd444b89d22d93a54`
MD5	`10c7f70f93e8f10b9ce989256e4f5740`
BLAKE2b-256	`ad495ead8c62dea6dc351f327f9e65b77b4e0c57bd9298b927caa9d5ef5a1803`

See more details on using hashes here.

File details

Details for the file cds_pyde_toolkit-1.5.5-py3-none-any.whl.

File metadata

Download URL: cds_pyde_toolkit-1.5.5-py3-none-any.whl
Upload date: Jun 25, 2026
Size: 36.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for cds_pyde_toolkit-1.5.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f6f538b9b0e2b3f427c09850466a9968df0fc9770da7261058439b747bcbd722`
MD5	`2538620feee412d8086ca70e71124a0e`
BLAKE2b-256	`0e0c300b83e04b9a91c49317e1fe7c48dce33e719cb42eb29ec18f3978a42a7d`

See more details on using hashes here.

cds-pyde-toolkit 1.5.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cds-pyde-toolkit

Install

Quick start — Python

Quick start — CLI

Features

Project structure (for contributors)

Releasing a new version

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes