Skip to main content

Extensible library for interacting with datasets in data intensive projects

Project description

Gamma IO

python build cov

Extensible I/O, filesystems and dataset layer for data-science projects.

Overview

"Gamma IO" provides an opinionated way to implement a "I/O and datasets" layer to avoid cluttering your "business layer" data pipelines with infra-structured concerns. One of the main goals is to provide a simple way to adapt this layer to your particular needs.

We take an open trunk approach to code: the underlying code should be clean, very easy to understand and extend. You can use the provided code as a library dependency (eg. pip install gamma-io) or you can simply "vendor" the code base by copying it over to your project as a Python module and extend it yourself.

While we provide a decent amount of functionality (see below), we do not expect this project to be able to read, write and manage data from every single possible data source. Our main goal is to provide a consistent and nice way to write glue code for the data storage layer.

Features

  • Clean interface to read/write datasets, separating infra configuration from data transformations.
  • Nice integration with gamma-config. But you can easily bring your own configuration provider!
  • Support for Pandas, PyArrow and Polars dataframes.
  • Support for reading files from multiple filesystems via fsspec.
  • First party support for partitioned Parquet datasets.

Getting started

Using pip and assuming you're using the optional gamma-config integration:

pip install gamma-io gamma-config[jinja2]

You can "scaffold" an initial configuration. In your project folder:

python -m gamma.config.scaffold

Remove the sample files, then create yourself a config/20-datasets.yaml file with the contents:

datasets:
    source:
        customers_1k:
            location: https://github.com/cjalmeida/gamma-io/raw/main/samples/customers-1000.zip
            format: csv
            compression: zip

        customers_1k_plain:
            location: https://github.com/cjalmeida/gamma-io/raw/main/samples/customers-1000.csv
            format: csv

    raw:
        customers:
            location: "file:///tmp/gamma-io/data/customers"
            format: parquet
            compression: snappy
            partition_by: [cluster]

The file above provide two "layers": a source layer containing HTTPS remote customers_1k and customers_1k_plain datasets, and a raw layer, containing a customers dataset partitioned by the cluster column.

In your code (or Jupyter Notebook) you can read these datasets as Pandas dataframe easily:

from gamma.io import read_pandas

df = read_pandas("source", "customers_1k")

All details about dataset format and storage infrastructure is not cluttering the codebase. Now let's write the dataset as set of partitioned Parquet files:

from gamma.io import read_pandas, write_pandas

# read it again
df = read_pandas("source", "customers_1k")

# some transformation: let's add the cluster column to the dataset
df["cluster"] = (df["Index"] % 3).astype(str)

# write to our dataset
write_pandas(df, "raw", "customers")

You can see it generated the Parquet structured partitioned in the "Hive" format:

$ tree /tmp/gamma-io/data

/tmp/gamma-io/data
└── customers
    ├── cluster=0
       └── part-0.parquet
    ├── cluster=1
       └── part-0.parquet
    └── cluster=2
        └── part-0.parquet

Configuring the filesystem

In the example above, the location configuration key points to where we can find the dataset. The underlying infrastructure is based on fsspec, so it supports many filesystem-like implementations out of the box.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

gamma_io-0.1.9-py3-none-any.whl (19.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page