Skip to main content

A growing toolkit of data-engineering helper functions and CLI commands — starting with schema inference (column standardisation, type inference, schema + DDL generation for Pandas/ANSI SQL or PySpark/Spark SQL).

Project description

pyde-toolkit

A growing toolkit of data-engineering helper functions and CLI commands. Each tool lives in its own submodule so the package can keep expanding without things colliding.

Tools currently included:

Submodule What it does
pyde_toolkit.schema_inferencer Infers column names, data types, schema definitions, and CREATE TABLE/CREATE VIEW DDL from a CSV/TSV/Excel file or a pandas DataFrame already in memory. Outputs Pandas/ANSI SQL or PySpark/Spark SQL, with optional Databricks medallion-layer (bronze/silver/gold) support.

Installation

pip install pyde-toolkit

Reading Excel files (for schema_inferencer) needs the optional extra:

pip install "pyde-toolkit[excel]"

Not yet on PyPI? See Building & Publishing below to build and install it locally first.

Quick Start — Schema Inferencer

Pass a DataFrame directly — no file I/O required:

import pandas as pd
from pyde_toolkit.schema_inferencer import infer_file

df = pd.DataFrame({
    "Plant Description": ["Mumbai Plant", "Pune Plant"],
    "ZODI/ZLDI":          ["ZODI", "ZLDI"],
    "Cost %":              [12.5, 8.0],
})

result = infer_file(df, pyspark=True, casing="snake", table_name="plant_master")

print(result["schema"])        # PySpark StructType, ready to paste
print(result["create_table"])  # CREATE TABLE ... USING DELTA
print(result["rename_code"])   # df.withColumnRenamed(...) snippet

A top-level convenience import also works for the most common function:

from pyde_toolkit import infer_file

Works the same way from a Spark DataFrame in a Databricks notebook:

result = infer_file(spark_df.toPandas(), pyspark=True, casing="snake",
                     table_name="sales_fact", layer="silver", catalog="prod")

Or from a file path:

result = infer_file("Sales1.csv", casing="pascal")   # Pandas + ANSI SQL by default

Command line

The package installs a single pyde-toolkit command. Each tool is a subcommand:

pyde-toolkit schema-infer Sales1.csv
pyde-toolkit schema-infer Sales1.csv --pyspark true --case pascal
pyde-toolkit schema-infer Sales1.csv --pyspark true --layer all --catalog prod
pyde-toolkit schema-infer --help
pyde-toolkit --version

Full documentation

  • docs/schema_inferencer.md — complete reference for the schema inferencer: every flag/parameter, casing rules, type-inference behaviour, sampling, medallion layers, table types, and the full infer_file() return value.
  • docs/RELEASING.md — step-by-step checklist for making a change, bumping the version, building, publishing, and installing the upgrade.

(As more tools are added, each gets its own docs/<tool_name>.md.)

Adding a new tool to the toolkit

The package is structured so new tools drop in without touching existing ones:

  1. Create src/pyde_toolkit/<your_tool>/ with its own core.py (the logic) and cli.py exposing two functions: add_arguments(parser) to register its flags, and run(args) to execute. See schema_inferencer/cli.py for the pattern.
  2. In src/pyde_toolkit/cli.py, register it as a new subcommand — one subparsers.add_parser(...) call plus your_tool_cli.add_arguments(...). Dispatch is generic, so nothing else needs to change.
  3. Optionally re-export its main function from src/pyde_toolkit/__init__.py for a top-level convenience import.
  4. Add docs/<your_tool>.md and tests/<your_tool>/.

Building, releasing & installing upgrades

Quick version — see docs/RELEASING.md for the full checklist (versioning rules, publishing options, troubleshooting):

pip install -e ".[dev]"              # 1. dev install
pytest                               # 2. test your changes
#    bump version = "X.Y.Z" in pyproject.toml   # 3. one-line version bump
rm -rf build dist src/*.egg-info
python -m build                      # 4. build dist/*.whl and dist/*.tar.gz
twine upload dist/*                  # 5. publish (PyPI or your private index)
pip install --upgrade pyde-toolkit   # 6. install the new version

pyde_toolkit.__version__ and pyde-toolkit --version are read automatically from whatever version is installed — no need to edit any source file besides pyproject.toml.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyde_toolkit-1.0.5.tar.gz (23.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyde_toolkit-1.0.5-py3-none-any.whl (22.6 kB view details)

Uploaded Python 3

File details

Details for the file pyde_toolkit-1.0.5.tar.gz.

File metadata

  • Download URL: pyde_toolkit-1.0.5.tar.gz
  • Upload date:
  • Size: 23.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for pyde_toolkit-1.0.5.tar.gz
Algorithm Hash digest
SHA256 1daf30a74ff115ac17a4adce1ade384e0e25b523f06558d926910dd59e077ef0
MD5 228e76bd1e5176c4a81a4eb45fefff51
BLAKE2b-256 06e144de1f936e2b1ec9f81660853fb9fc94d2be3eaa6019c704257f5c49af9d

See more details on using hashes here.

File details

Details for the file pyde_toolkit-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: pyde_toolkit-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 22.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for pyde_toolkit-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 e27c45251a5be896fa9f9ca589ccaa46c7e5a923a7b33d0eef44c6096bec0fa1
MD5 017547adf5d470a0ae7a8c406a9518af
BLAKE2b-256 5583b7ae5eb4993cbd97d4bc6d586bdc0a6e07feb363bfc74ddb677934396865

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page