cloud-dataplug

Pluggable data partitioning for cloud-native scientific workloads

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3 :: Only

Project description

Dataplug is a client-side only, extensible, Python framework with the goal of enabling efficient data partitioning of unstructured scientific data stored in object storage (like Amazon S3) for elastic workloads in the Cloud

Dataplug provides a plug-in interface to enable data partitioning to a multitude of scientific data types, from a variety of domains, stored in S3. Currently, Dataplug supports the following data types:
- Generic
  - CSV
  - Raw Text
- Genomics
  - FASTA
  - FASTQ
  - VCF
- Geospatial
- Metabolomics
  - ImzML
Dataplug follows a read-only cloud-aware pre-processing approach to enable on-the-fly dynamic partitioning of scientific unstructured data.
- It is cloud-aware because it specifically targets cold raw data residing in huge repositories in object storage (e.g. Amazon S3). S3 allows partial reads at high bandwidth by using HTTP GET Byte-range requests. Dataplug builds indexes around this premise to enable parallel chunked access to unstructured data. It compensates high latency of object storage with many parallel reads at high bandwidth.
- It is Read-only because, in object storage, objects are immutable. Thus, pre-processing is read-only, meaning that index and metadata are stored decoupled from data, in another object. Raw cold data is kept as-is in storage. This voids to re-write the entire dataset for partitioning. Since indexes are several orders of magnitude smaller, the data movements are considerably reduced.
Dataplug allows re-partitioning a dataset at zero-cost.
- Dataplug introduces the concept of data slicing. A data slice is a lazily-evaluated partition of a pre-processed dataset in its raw form, present in object storage (1).
- Users can perform different partitioning strategies (2) on the same dataset without actually moving data around (3).
- Data slices are serializable, and can be sent to remote workers using any Python-compatible distributed computing environment (4) (e.g. PySpark, Dask or Ray).
- Data slices are evaluated at the moment of accessing the data, and not before (5). This allows many remote workers to perform many HTTP GET Byte-range requests in parallel onto the raw dataset, exploiting S3's high bandwidth capabilities.

Installation

Dataplug is only available through GitHub. You can use pip to install it directly from the repository:
```
pip install git+https://github.com/CLOUDLAB-URV/dataplug
```

Partitioning text example

from dataplug import CloudObject
from dataplug.formats.genomics.fastq import FASTQGZip, partition_reads_batches

# Assign FASTQGZip data type for object in s3://genomics/SRR6052133_1.fastq.gz
co = CloudObject.from_s3(FASTQGZip, "s3://genomics/SRR6052133_1.fastq.gz")

# Data must be pre-processed first ==> This only needs to be done once per dataset
# Preprocessing will create reusable indexes to repartition
# the data many times in many chunk sizes
# Dataplug leverages joblib to deploy preprocessing jobs
co.preprocess(parallel_config={"backend": "dask"})

# Partition the FASTQGZip object into 200 chunks
# This does not move data around, it only creates data slices from the indexes
data_slices = co.partition(partition_reads_batches, num_batches=200)

def process_fastq_data(data_slice):
    # Evaluate the data_slice, which will perform the
    # actual HTTP GET requests to get the FASTQ partition data
    fastq_reads = data_slice.get()
    ...

# Use Dask for deploying a parallel distributed job
import dask.bag as db
from dask.distributed import Client

client = Client()
# Create a Dask Bag from the data_slices list
dask_bag = db.from_sequence(data_slices)

# Apply the process_fastq_data function to each data slice
# Dask will serialize the data_slices and send them to the workers
dask_bag.map(process_fastq_data).compute()

Documentation

Acknowledgements

This project has been partially funded by the EU Horizon programme under grant agreements No. 101086248, No. 101092644, No. 101092646, No. 101093110.

The logo has been proudly generated using cooltext.com.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3 :: Only

Release history Release notifications | RSS feed

1.0.2

Sep 5, 2025

1.0.1

May 29, 2025

This version

1.0.0

May 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloud_dataplug-1.0.0.tar.gz (186.2 kB view details)

Uploaded May 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cloud_dataplug-1.0.0-py3-none-any.whl (44.1 kB view details)

Uploaded May 8, 2025 Python 3

File details

Details for the file cloud_dataplug-1.0.0.tar.gz.

File metadata

Download URL: cloud_dataplug-1.0.0.tar.gz
Upload date: May 8, 2025
Size: 186.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.18

File hashes

Hashes for cloud_dataplug-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`3ec023c9f1e00cac21f5f05d46957fdeb9a519f6a153fc71f91b28ac94397004`
MD5	`31fbe7fb67159c46876596b40c0c0b6c`
BLAKE2b-256	`858339a4d616e10edfd849f9e00e7403de2b15e39429790ade2fc69e1632ea04`

See more details on using hashes here.

File details

Details for the file cloud_dataplug-1.0.0-py3-none-any.whl.

File metadata

Download URL: cloud_dataplug-1.0.0-py3-none-any.whl
Upload date: May 8, 2025
Size: 44.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.18

File hashes

Hashes for cloud_dataplug-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`385d4b54b4fd79527b6148e8ac2a263a069fe688bef610d4a8ac07c3a2b33370`
MD5	`e4b7fd2f8603a192e7d2f064bac4a440`
BLAKE2b-256	`8c5f1590dfd0d01ee68747e7752eb853dd83945d5045b4661e0e4cebb005ec65`

See more details on using hashes here.

cloud-dataplug 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Dataplug is a client-side only, extensible, Python framework with the goal of enabling efficient data partitioning of unstructured scientific data stored in object storage (like Amazon S3) for elastic workloads in the Cloud

Installation

Partitioning text example

Documentation

Acknowledgements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes