data operations related code - extended
Project description
tgedr-dataops-ext
Concrete, tested implementations of the tgedr-dataops abstract contracts — PySpark, Delta Lake, and Databricks, all in one place.
motivation
tgedr-dataops-ext builds on top of tgedr-dataops (the abstract contracts layer) and provides concrete, tested implementations for distributed data processing with PySpark and Delta Lake. It covers session management, ETL pipelines, Delta table storage, data validation, and Databricks job integration, all following consistent code quality and structural standards.
installation
pip install tgedr-dataops-ext
package contents
commons
Shared utilities and base classes used across the library.
| Class | Description | Example |
|---|---|---|
Dataset |
Immutable wrapper pairing a Spark DataFrame with its Metadata |
test |
Metadata |
Immutable dataclass describing a dataset (name, version, framing, sources) | test |
UtilsSpark |
Utility class for creating and configuring Spark sessions (local, AWS Glue, or active session) and building PySpark schemas from type dictionaries | test |
UtilsDatabricks |
Utility class for retrieving the Databricks dbutils object from the active Spark session |
test |
EtlDatabricks |
Abstract intermediate ETL class extending Etl with Databricks job integration: captures run_id, publishes outputs via dbutils.jobs.taskValues, and provides the inject_configuration decorator for auto-wiring method parameters from configuration or upstream task values |
test |
quality
Data quality validation backed by Great Expectations.
| Class | Description | Example |
|---|---|---|
PysparkValidation |
GreatExpectationsValidation implementation for validating PySpark DataFrames using the Great Expectations library |
test |
source
Implementations of the Source contract for reading data from various backends.
| Class | Description | Example |
|---|---|---|
DeltaTableSource |
Abstract Source base class for reading Delta Lake datasets, returning a pandas DataFrame |
test |
LocalDeltaTable |
Concrete Source reading Delta Lake datasets from the local filesystem using pure Python (no PySpark required) |
test |
S3DeltaTable |
Concrete Source reading Delta Lake datasets from S3 using pure Python (no PySpark required) |
test |
CatalogFileSource |
Source implementation for listing, copying, and retrieving metadata of files in a Databricks-accessible file system (DBFS, S3, ADLS) via dbutils.fs |
test |
sink
Implementations of the Sink contract for writing and managing data in various backends.
| Class | Description | Example |
|---|---|---|
CatalogFileSink |
Sink implementation for copying and deleting files or directories in a Databricks-accessible file system via dbutils.fs |
test |
store
Implementations of the Store contract for persistent, structured data storage.
| Class | Description | Example |
|---|---|---|
SparkDeltaStore |
Store implementation for PySpark distributed processing with Delta Lake format. Supports versioned reads, append/overwrite writes, upserts, partitioning, schema evolution, retention policies, metadata management, and column comments |
test |
development
Requirements:
uvbash
# clone
git clone git@github.com:tgedr/dataops-ext
cd dataops-ext
# install dependencies
./helper.sh reqs
# run tests
./helper.sh test
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tgedr_dataops_ext-1.0.6.tar.gz.
File metadata
- Download URL: tgedr_dataops_ext-1.0.6.tar.gz
- Upload date:
- Size: 22.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
994e8c57d7ecc83e8db5c7d8caccca7494b4adcb56debd8bcf75ae56f26d0a08
|
|
| MD5 |
3d78addb2955a2884c226e709c228b97
|
|
| BLAKE2b-256 |
fed1ba4e21744b70828f9b725dc2bc3b40d178ab0fa496258db9d4a226156967
|
File details
Details for the file tgedr_dataops_ext-1.0.6-py3-none-any.whl.
File metadata
- Download URL: tgedr_dataops_ext-1.0.6-py3-none-any.whl
- Upload date:
- Size: 25.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08df423580653f35f3c5c19f8be701c8c85caafde2b47a59316b375a0c3d418e
|
|
| MD5 |
4b2ec60567da174ab41f34fe46b01409
|
|
| BLAKE2b-256 |
9fad78f381328007a2df76d97bae592db6e7d6cde80971dec37a0c6fcda7fa36
|