Advanced Delta-Lake related tooling based on Apache Spark
Reason this release was yanked:
sorry folks, this is the only time this will happen, promise.
Project description
hydro 💧
hydro is a collection of Python-based Apache Spark and Delta Lake tooling.
See Key Functionality for concrete use cases.
Installation
pip install delta-hydro
Docs 📖
https://christophergrant.github.io/delta-hydro
Key Functionality 🔑
- De-duplicate a Delta Lake table, in-place, without a full overwrite - hydro.delta.deduplicate
- Correctly perform Slowly Changing Dimensions (SCD) (types 1 or 2) on Delta Lake tables - hydro.delta.scd and hydro.delta.bootstrap_scd2
- Issue queries against Delta Log metadata, quickly and efficently getting things like partition sizes on huge tables - hydro.delta.partition_stats
- Other quality of life improvements like hydro.delta.detail_enhanced and hydro.spark.fields
Contributions ✨
Contributions are welcome.
However, please create an issue before starting work on a feature to make sure that it aligns with the future of the project.
Naming 🤓
Originally this project was going to be hydrologist
but that's way too long and pretentious, so we shortened to hydro
.
A hydrologist is a person who studies water and its movement. Delta Lake, Data Lake, Lakehouse => water.
ChatGPT and LLMs 🤖
Some of this project's code was generated by a Large Language Model(LLM), namely ChatGPT.
We are proud prompt engineers, so we display the prompt that gave us the code in hydro's source (example).
Our take is that the model is very impressive, but not sophisticated enough to be able to write this whole program (yet). A lot of this stuff is very context-dependent and would be difficult to explain to an AI. Plus, ChatGPT isn't aware of newer APIs as it was trained on an older corpus.
We are excited for the future of humanity given recent advancements in artificial intelligence and hope that the technology is used to liberate, rather than accelerate.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file delta_hydro-0.2.1.tar.gz
.
File metadata
- Download URL: delta_hydro-0.2.1.tar.gz
- Upload date:
- Size: 9.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b19694e09713e678c97e7186fff344ce0a0627249ea024a2da63bc46d08cc750 |
|
MD5 | a844337e8dd975ea37dc002fa67ea026 |
|
BLAKE2b-256 | 96dc2838100ac1353d093ffb7a5f0b2778733861e1b0bdba6b1b5fccda56d3df |
File details
Details for the file delta_hydro-0.2.1-py2.py3-none-any.whl
.
File metadata
- Download URL: delta_hydro-0.2.1-py2.py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | af6504a021194bda183aeaf15bc7e9fef184eff5e3933f6a0826cd56fc9977e4 |
|
MD5 | eed2abfc007b365535f1969b4b214b8a |
|
BLAKE2b-256 | ce1ad2f7172c364492be428b578dfc8b7a19b0185c4c7aaa80cfe768edf0e8ce |