Skip to main content

Python pandas dataframe cloud agnostic storage

Project description

pdcloud

pdcloud is a Python package designed to simplify and accelerate the onboarding of data stored in cloud environments. It's built to be cloud-agnostic, allowing seamless access to dataframes stored across various cloud platforms.

Simplifying Cloud Data Access with pdcloud

pdcloud offers a unified interface to interact with multiple cloud storage providers, abstracting away the complexities of dealing with different cloud-specific APIs. Key advantages include:

  • Cloud-Agnostic Interface: One interface to access data across Azure, AWS, GCP, and more, removing the need to understand each cloud provider's specifics.
  • Streamlined Data Operations: Whether reading or writing data, pdcloud provides a consistent, intuitive API, simplifying cloud data operations.
  • Optimized Data Handling: Leveraging PyArrow and Parquet, pdcloud ensures efficient, fast, and cost-effective data processing.

Through pdcloud, users gain a straightforward, efficient path to access and manipulate cloud-stored data, irrespective of the underlying cloud platform.

Benefits of Using PyArrow and Parquet

pdcloud leverages the power of PyArrow and Parquet for data storage and processing, offering several key advantages:

  • Efficient Data Storage: Parquet stores data in a columnar format, which is more space-efficient compared to row-based storage, especially for analytical queries.

  • Optimized for Performance: PyArrow's columnar memory format enables fast data access and efficient in-memory computing, which is crucial for analytics.

  • Cross-platform Support: Parquet is supported across multiple programming languages and platforms, ensuring compatibility and flexibility.

  • Scalability: Ideal for handling large datasets, Parquet efficiently scales to accommodate massive volumes of data.

  • Data Compression: Parquet supports various compression techniques, significantly reducing storage costs and improving I/O performance.

  • Schema Evolution: Parquet supports schema evolution, allowing modification of the schema over time without the need to rewrite the dataset.

By using PyArrow and Parquet, pdcloud ensures that data is stored and accessed in the most efficient, performant, and cost-effective manner.

Design Choices in pdcloud

pdcloud is crafted with the vision of simplifying data access across various cloud platforms. Key design choices include:

  • Unified API: A single, intuitive interface for all cloud storage operations, regardless of the cloud provider.
  • Abstraction Layer: Abstracts the complexities of each cloud provider's API, providing a seamless experience.
  • Cloud-Agnostic Approach: Designed to be adaptable to different cloud environments, ensuring flexibility and broad applicability.
  • Optimized Data Processing: Integration with PyArrow and Parquet for efficient data handling, suitable for both small and large-scale datasets.
  • Focus on Performance and Scalability: Ensures efficient data operations, catering to the needs of both individual users and large enterprises.

These design choices reflect our commitment to providing a versatile, efficient, and user-friendly tool for cloud-based data management.

Key Features

  • Cloud Agnostic: Compatible with major cloud providers, enabling access to data regardless of its cloud location.
  • Efficient Data Onboarding: Reduces the steps involved in data transfer and processing, moving away from traditional methods like SFTP/FTP.
  • Direct Data Access: Facilitates direct access to data through simple cloud configurations and connection strings.
  • Standardized Data Format: Utilizes Parquet format for data storage and retrieval, ensuring efficiency and uniformity.

Motivation

The goal of pdcloud is to revolutionize how data providers share and users access data. By eliminating the cumbersome process of data transfer and storage, pdcloud enables users to onboard data swiftly and efficiently. Upon signing necessary data agreements, users can instantly access data provided by vendors through unique cloud configurations, significantly cutting down the time and resources typically spent on data integration.

Features

  • Cloud agnostic: Works with Azure Blob Storage, with planned support for AWS S3 and Google Cloud Storage.
  • Asynchronous and synchronous read/write operations.
  • Utilizes Apache Arrow for efficient data handling.

Installation

pip install pdcloud

Usage

Azure Storage Adapter

import pandas as pd

from pdcloud import AzureStorageAdapter
from pdcloud import Lib

# Initialize the Azure Storage Adapter
connection_string = ""
azure_storage = AzureStorageAdapter(connection_string)

# Define the container name
container_name = "library"

# Create an instance of the Lib class
lib = Lib(container=container_name, storage=azure_storage)


# Read and process all data objects from the container
all_data: pd.DataFrame = lib.read_all()
print("All Data:", all_data)

# Read and process a specific data object from the container
data_object_name = "mydata"
specific_data: pd.DataFrame = lib.read(data_object_name)
print("Specific Data Object:", specific_data)

# Write a DataFrame to the same Container
lib.write("mydata", data=df, overwrite=True)

# Write a DataFrame to a different container
lib.write("mydata", container="library", data=df, overwrite=True)

Contributing

Contributions to pdcloud are welcome! Please read our contributing guidelines for details on how to contribute to the project.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdcloud-0.2.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdcloud-0.2-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file pdcloud-0.2.tar.gz.

File metadata

  • Download URL: pdcloud-0.2.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for pdcloud-0.2.tar.gz
Algorithm Hash digest
SHA256 505d2202d2196955128dc0e5c4079661598bae6eebed91ec9d6e6044041c166b
MD5 4e1e922a41f8075e801bfe2ece48a376
BLAKE2b-256 3f7809fc678bf1383667585a47808e42cb4f974302808b72a1eb47dfabc1fe0a

See more details on using hashes here.

File details

Details for the file pdcloud-0.2-py3-none-any.whl.

File metadata

  • Download URL: pdcloud-0.2-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for pdcloud-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c8f8e2d59400935769c5f463661e09fe1f94768ff87b2087cad4209238d17f53
MD5 16a47f76e5bb38b77e1b61c933818c8e
BLAKE2b-256 2580edff948107e52b24cdd2eca7ba7422ce49b0f00c54d24c28e37695c2d994

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page