Python pandas dataframe cloud agnostic storage

Project description

pdcloud

pdcloud is a Python package designed to simplify and accelerate the onboarding of data stored in cloud environments. It's built to be cloud-agnostic, allowing seamless access to dataframes stored across various cloud platforms.

Simplifying Cloud Data Access with `pdcloud`

pdcloud offers a unified interface to interact with multiple cloud storage providers, abstracting away the complexities of dealing with different cloud-specific APIs. Key advantages include:

Cloud-Agnostic Interface: One interface to access data across Azure, AWS, GCP, and more, removing the need to understand each cloud provider's specifics.
Streamlined Data Operations: Whether reading or writing data, pdcloud provides a consistent, intuitive API, simplifying cloud data operations.
Optimized Data Handling: Leveraging PyArrow and Parquet, pdcloud ensures efficient, fast, and cost-effective data processing.

Through pdcloud, users gain a straightforward, efficient path to access and manipulate cloud-stored data, irrespective of the underlying cloud platform.

Benefits of Using PyArrow and Parquet

pdcloud leverages the power of PyArrow and Parquet for data storage and processing, offering several key advantages:

Efficient Data Storage: Parquet stores data in a columnar format, which is more space-efficient compared to row-based storage, especially for analytical queries.
Optimized for Performance: PyArrow's columnar memory format enables fast data access and efficient in-memory computing, which is crucial for analytics.
Cross-platform Support: Parquet is supported across multiple programming languages and platforms, ensuring compatibility and flexibility.
Scalability: Ideal for handling large datasets, Parquet efficiently scales to accommodate massive volumes of data.
Data Compression: Parquet supports various compression techniques, significantly reducing storage costs and improving I/O performance.
Schema Evolution: Parquet supports schema evolution, allowing modification of the schema over time without the need to rewrite the dataset.

By using PyArrow and Parquet, pdcloud ensures that data is stored and accessed in the most efficient, performant, and cost-effective manner.

Design Choices in `pdcloud`

pdcloud is crafted with the vision of simplifying data access across various cloud platforms. Key design choices include:

Unified API: A single, intuitive interface for all cloud storage operations, regardless of the cloud provider.
Abstraction Layer: Abstracts the complexities of each cloud provider's API, providing a seamless experience.
Cloud-Agnostic Approach: Designed to be adaptable to different cloud environments, ensuring flexibility and broad applicability.
Optimized Data Processing: Integration with PyArrow and Parquet for efficient data handling, suitable for both small and large-scale datasets.
Focus on Performance and Scalability: Ensures efficient data operations, catering to the needs of both individual users and large enterprises.

These design choices reflect our commitment to providing a versatile, efficient, and user-friendly tool for cloud-based data management.

Key Features

Cloud Agnostic: Compatible with major cloud providers, enabling access to data regardless of its cloud location.
Efficient Data Onboarding: Reduces the steps involved in data transfer and processing, moving away from traditional methods like SFTP/FTP.
Direct Data Access: Facilitates direct access to data through simple cloud configurations and connection strings.
Standardized Data Format: Utilizes Parquet format for data storage and retrieval, ensuring efficiency and uniformity.

Motivation

The goal of pdcloud is to revolutionize how data providers share and users access data. By eliminating the cumbersome process of data transfer and storage, pdcloud enables users to onboard data swiftly and efficiently. Upon signing necessary data agreements, users can instantly access data provided by vendors through unique cloud configurations, significantly cutting down the time and resources typically spent on data integration.

Features

Cloud agnostic: Works with Azure Blob Storage, with planned support for AWS S3 and Google Cloud Storage.
Asynchronous and synchronous read/write operations.
Utilizes Apache Arrow for efficient data handling.

Installation

pip install pdcloud

Usage

Azure Storage Adapter

import pandas as pd

from pdcloud import AzureStorageAdapter
from pdcloud import Lib

# Initialize the Azure Storage Adapter
connection_string = ""
azure_storage = AzureStorageAdapter(connection_string)

# Define the container name
container_name = "library"

# Create an instance of the Lib class
lib = Lib(container=container_name, storage=azure_storage)


# Read and process all data objects from the container
all_data: pd.DataFrame = lib.read_all()
print("All Data:", all_data)

# Read and process a specific data object from the container
data_object_name = "mydata"
specific_data: pd.DataFrame = lib.read(data_object_name)
print("Specific Data Object:", specific_data)

# Write a DataFrame to the same Container
lib.write("mydata", data=df, overwrite=True)

# Write a DataFrame to a different container
lib.write("mydata", container="library", data=df, overwrite=True)

Contributing

Contributions to pdcloud are welcome! Please read our contributing guidelines for details on how to contribute to the project.

License

This project is licensed under the MIT License.

Project details

Release history Release notifications | RSS feed

0.4.1

Dec 10, 2023

0.4

Dec 2, 2023

0.3

Dec 2, 2023

This version

0.2

Dec 2, 2023

0.1

Dec 2, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdcloud-0.2.tar.gz (10.4 kB view details)

Uploaded Dec 2, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdcloud-0.2-py3-none-any.whl (10.6 kB view details)

Uploaded Dec 2, 2023 Python 3

File details

Details for the file pdcloud-0.2.tar.gz.

File metadata

Download URL: pdcloud-0.2.tar.gz
Upload date: Dec 2, 2023
Size: 10.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for pdcloud-0.2.tar.gz
Algorithm	Hash digest
SHA256	`505d2202d2196955128dc0e5c4079661598bae6eebed91ec9d6e6044041c166b`
MD5	`4e1e922a41f8075e801bfe2ece48a376`
BLAKE2b-256	`3f7809fc678bf1383667585a47808e42cb4f974302808b72a1eb47dfabc1fe0a`

See more details on using hashes here.

File details

Details for the file pdcloud-0.2-py3-none-any.whl.

File metadata

Download URL: pdcloud-0.2-py3-none-any.whl
Upload date: Dec 2, 2023
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for pdcloud-0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c8f8e2d59400935769c5f463661e09fe1f94768ff87b2087cad4209238d17f53`
MD5	`16a47f76e5bb38b77e1b61c933818c8e`
BLAKE2b-256	`2580edff948107e52b24cdd2eca7ba7422ce49b0f00c54d24c28e37695c2d994`

See more details on using hashes here.

pdcloud 0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

pdcloud

Simplifying Cloud Data Access with `pdcloud`

Benefits of Using PyArrow and Parquet

Design Choices in `pdcloud`

Key Features

Motivation

Features

Installation

Usage

Azure Storage Adapter

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

pdcloud 0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

pdcloud

Simplifying Cloud Data Access with pdcloud

Benefits of Using PyArrow and Parquet

Design Choices in pdcloud

Key Features

Motivation

Features

Installation

Usage

Azure Storage Adapter

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Simplifying Cloud Data Access with `pdcloud`

Design Choices in `pdcloud`