Extensible library for interacting with datasets in data intensive projects
Project description
Gamma IO
Extensible I/O, filesystems and dataset layer for data-science projects.
Overview
"Gamma IO" provides an opinionated way to implement a "I/O and datasets" layer to avoid cluttering your "business layer" data pipelines with infra-structured concerns. One of the main goals is to provide a simple way to adapt this layer to your particular needs.
We take an open trunk approach to code: the underlying code should be clean, very
easy to understand and extend. You can use the provided code as a library dependency
(eg. pip install gamma-io
) or you can simply "vendor" the code base by copying it
over to your project as a Python module and extend it yourself.
While we provide a decent amount of functionality (see below), we do not expect this project to be able to read, write and manage data from every single possible data source. Our main goal is to provide a consistent and nice way to write glue code for the data storage layer.
Features
- Clean interface to read/write datasets, separating infra configuration from data transformations.
- Nice integration with
gamma-config
. But you can easily bring your own configuration provider! - Support for Pandas, PyArrow and Polars dataframes.
- Support for reading files from multiple filesystems via fsspec.
- First party support for partitioned Parquet datasets.
Getting started
Using pip and assuming you're using the optional gamma-config
integration:
pip install gamma-io gamma-config[jinja2]
You can "scaffold" an initial configuration. In your project folder:
python -m gamma.config.scaffold
Remove the sample files, then create yourself a config/20-datasets.yaml
file
with the contents:
datasets:
source:
customers_1k:
location: https://github.com/cjalmeida/gamma-io/raw/main/samples/customers-1000.zip
format: csv
compression: zip
customers_1k_plain:
location: https://github.com/cjalmeida/gamma-io/raw/main/samples/customers-1000.csv
format: csv
raw:
customers:
location: "file:///tmp/gamma-io/data/customers"
format: parquet
compression: snappy
partition_by: [cluster]
The file above provide two "layers": a source
layer containing HTTPS remote
customers_1k
and customers_1k_plain
datasets, and a raw
layer, containing a
customers
dataset partitioned by the cluster
column.
In your code (or Jupyter Notebook) you can read these datasets as Pandas dataframe easily:
from gamma.io import read_pandas
df = read_pandas("source", "customers_1k")
All details about dataset format and storage infrastructure is not cluttering the codebase. Now let's write the dataset as set of partitioned Parquet files:
from gamma.io import read_pandas, write_pandas
# read it again
df = read_pandas("source", "customers_1k")
# some transformation: let's add the cluster column to the dataset
df["cluster"] = (df["Index"] % 3).astype(str)
# write to our dataset
write_pandas(df, "raw", "customers")
You can see it generated the Parquet structured partitioned in the "Hive" format:
$ tree /tmp/gamma-io/data
/tmp/gamma-io/data
└── customers
├── cluster=0
│ └── part-0.parquet
├── cluster=1
│ └── part-0.parquet
└── cluster=2
└── part-0.parquet
Configuring the filesystem
In the example above, the location
configuration key points to where we can find the
dataset. The underlying infrastructure is based on fsspec, so it supports
many filesystem-like implementations out of the box.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file gamma_io-0.1.9-py3-none-any.whl
.
File metadata
- Download URL: gamma_io-0.1.9-py3-none-any.whl
- Upload date:
- Size: 19.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15f365c558492e999de4330fe3ac3cc06383e26957d844487fc6eadd10e78295 |
|
MD5 | b9ec8cdb67b05d525c08e266533d8466 |
|
BLAKE2b-256 | 36aa3ed075cc051f35f6c78df391b250188ba2e0e8abe4bdb193607183bd06e7 |