Advanced Delta-Lake related tooling based on Apache Spark
Reason this release was yanked:
sorry folks, this is the only time this will happen, promise.
Project description
hydro 💧
hydro is a collection of Python-based Apache Spark and Delta Lake extensions.
See Key Functionality for concrete use cases.
Warning ⚠️
hydro is well tested but not battle hardened, yet. Use it at your own risk.
Installation
pip install delta-hydro
Docs 📖
https://christophergrant.github.io/delta-hydro
Key Functionality 🔑
- De-duplicate a Delta Lake table, in-place, without a full overwrite - hydro.delta.deduplicate
- Correctly perform Slowly Changing Dimensions (SCD) (types 1 or 2) on Delta Lake tables - hydro.delta.scd and hydro.delta.bootstrap_scd2
- Issue queries against Delta Log metadata, quickly and efficently getting things like partition sizes on huge tables - hydro.delta.partition_stats
- Other quality of life improvements like hydro.delta.detail_enhanced and hydro.spark.fields
Contributions ✨
Contributions are welcome.
Please create an issue and discuss before starting work on a feature to make sure that it aligns with the future of the project.
Naming 🤓
hydro
is short for hydrologist, where a hydrologist is a person who studies water and its movement. Delta Lake, Data Lake, Lakehouse => water.
ChatGPT and LLMs 🤖
Some of this project's code and documentation was generated by a Large Language Model(LLM), namely ChatGPT.
We are proud prompt engineers, so we display the prompt that gave us the code in hydro's source (example).
APIs
The topic of SQL vs DataFrames is a hot one in the data space.
SQL certainly has its place in analytic and other ad-hoc use cases, but it is missing the expressive power of an imperative language.
This project is a testament to the power of the mix of imperative and declarative expression that DataFrames give. A lot of this code would be very verbose or impossible to express with SQL.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file delta_hydro-0.4.2.tar.gz
.
File metadata
- Download URL: delta_hydro-0.4.2.tar.gz
- Upload date:
- Size: 12.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dfe78cf648b85d637741fb53d1d754561245c3c6f592d29f9b2a75bed514cb0f |
|
MD5 | d22ebfe50974f7dd031b31eae7307392 |
|
BLAKE2b-256 | 3c35730ef3df9537b9c835b1e0902634115b27c3b4806b4889e14576758f303a |
File details
Details for the file delta_hydro-0.4.2-py2.py3-none-any.whl
.
File metadata
- Download URL: delta_hydro-0.4.2-py2.py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b17f9556ed639456ac761bb632c6047a28d3d032cbb6d5d5c6ce0754e7523c9c |
|
MD5 | cbb8a8abab159672ef2df1482224d909 |
|
BLAKE2b-256 | f050c458ac0b75228944a96f1cf457ef7d3758a34a2a58841fc9bf894d11d067 |