Skip to main content

Infer column names, data types, schema, and CREATE TABLE/VIEW DDL from a file or a pandas DataFrame — Pandas/ANSI SQL or PySpark/Spark SQL.

Project description

pyde_toolkit

Infer column names, data types, schema definitions, and CREATE TABLE / CREATE VIEW DDL from a CSV/TSV/Excel file — or directly from a pandas DataFrame already in memory (including a Spark DataFrame converted via .toPandas()). Outputs either Pandas/ANSI SQL or PySpark/Spark SQL, with optional Databricks medallion-layer (bronze/silver/gold) support.

Installation

pip install pyde_toolkit

Reading Excel files needs the optional extras:

pip install "pyde_toolkit[excel]"

Not yet on PyPI? See Building & Publishing below to build and install it locally first.

Quick Start — pass a DataFrame directly

This is the primary intended use case: no file I/O, just hand it a DataFrame.

import pandas as pd
from pyde_toolkit import infer_file

df = pd.DataFrame({
    "Plant Description": ["Mumbai Plant", "Pune Plant"],
    "ZODI/ZLDI":          ["ZODI", "ZLDI"],
    "Cost %":              [12.5, 8.0],
})

result = infer_file(df, pyspark=True, casing="snake", table_name="plant_master")

print(result["schema"])        # PySpark StructType, ready to paste
print(result["create_table"])  # CREATE TABLE ... USING DELTA
print(result["rename_code"])   # df.withColumnRenamed(...) snippet

Works the same way from a Spark DataFrame in a Databricks notebook:

result = infer_file(spark_df.toPandas(), pyspark=True, casing="snake",
                     table_name="sales_fact", layer="silver", catalog="prod")

Quick Start — pass a file path

result = infer_file("Sales1.csv", casing="pascal")   # Pandas + ANSI SQL by default

Command line

The same engine is also available as a CLI, installed as pyde_toolkit:

pyde_toolkit Sales1.csv
pyde_toolkit Sales1.csv --pyspark true --case pascal
pyde_toolkit Sales1.csv --pyspark true --layer all --catalog prod
pyde_toolkit --help

Full documentation

See docs/USAGE.md for the complete reference: every flag/parameter, casing rules, type-inference behaviour, sampling, medallion layers, table types, and the full infer_file() return value.

Building & Publishing

This repo is set up as a standard pyproject.toml package, so it can be built and installed without needing PyPI:

# Install locally, editable (changes to source take effect immediately)
pip install -e .

# Or build a wheel/sdist you can distribute internally
pip install build
python -m build              # creates dist/*.whl and dist/*.tar.gz
pip install dist/pyde_toolkit-1.0.0-py3-none-any.whl

To publish to PyPI so it's installable via a plain pip install pyde_toolkit, you'll need your own PyPI account/API token, then:

pip install twine
twine upload dist/*

(Double-check the name pyde_toolkit isn't already taken on PyPI before publishing — rename it in pyproject.toml if it is.)

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyde_toolkit-1.0.1.tar.gz (21.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyde_toolkit-1.0.1-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file pyde_toolkit-1.0.1.tar.gz.

File metadata

  • Download URL: pyde_toolkit-1.0.1.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for pyde_toolkit-1.0.1.tar.gz
Algorithm Hash digest
SHA256 025b68c6d8408806801b024ca84d17e52c4bef4c823e7bea315bc108eadc0b52
MD5 b3af75979aa36f0f99b7e619b20c9934
BLAKE2b-256 12cb753e02d2dfb7e7433123c5a0b5624bb7a26499ba90ec7523f0a803d49150

See more details on using hashes here.

File details

Details for the file pyde_toolkit-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pyde_toolkit-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for pyde_toolkit-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eb8d11502f7ae4e9a7fbc07fe663914fe491c6a4e630ea7184397830c557d4e7
MD5 6da193539044efdfd85ad7bd8ae7c568
BLAKE2b-256 0cd56a3ffa03f3d8324ddc772809163d1bd77de4c8b3b2fef5d8e52462c194a7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page