Skip to main content

Multi-tiered cloud-backed blob storage system

Project description

Epic bitstore — Multi-tiered cloud-backed blob storage system

Epic-bitstore CI

What is it?

The epic-bitstore Python library provides client access to a multi-tiered blob storage system based on cloud backends, with the option to use API backends and caching mechanisms as well.

Usage

For example, let's assume you store blobs in the following locations:

  1. In an AWS bucket: s3://aws_customer_data/files/<sha1>
  2. In a GCP bucket: gs://gcp_customer_data/blobs/<sha1>
  3. In another GCP bucket: gs://my_project/more_files/<sha1>

Using epic-bitstore, you could fetch a blob from any of these storages using a single get command. The library would iterate on the sources in order and would retrieve the data from the first matching source. This could also run in parallel on multiple blobs, backed by the ultima parallelization library.

To implement the above strategy, you create a Composite store, and add each of your sources in order:

from epic.bitstore import Sha1Composite, Sha1Store, S3Raw, GSRaw

blob_store = Sha1Composite()
blob_store.append_source(Sha1Store(S3Raw(), "s3://aws_customer_data/files/"))
blob_store.append_source(Sha1Store(GSRaw(), "gs://gcp_customer_data/blobs/"))
blob_store.append_source(Sha1Store(GSRaw(), "gs://my_project/more_files/"))

data = blob_store.get("4bc39c7d87318382feb3cc5a684c767fbd913968")

You can then use parallelization to efficiently map an iterator of hashes into an iterator of byte buffers:

from ultima import ultimap

data_iter = ultimap(blob_store.get, iter_hashes, backend='threading', n_workers=16)

API sources and caching layers

Let's also assume that you have an API that can retrieve blobs given their SHA1. You would like to use it for fetching blobs, but only when they're not found in the above "passive" sources.

You can implement a Sha1APISource for your API, and add it to the Composite object:

from epic.bitstore import Sha1APISource

class MyAPIStore(Sha1APISource):
    def __init__(self, api_client):
        super().__init__()
        self.api_client = api_client
    
    def api_get(self, sha1):
        # return None for a blob that can't be found
        return self.api_client.get_bytes(sha1)

# ctd after adding the three passive stores
blob_store.append_source(MyAPIStore(my_api_client))

API sources are often expensive, either in cost or performance. You can add a caching store, and configure the API source to store fetched blobs into the cache. It is important to append the cache before adding the API source, so that its cached blobs have precedence.

Append the cache and the API source:

from epic.bitstore import Sha1Cache, GSRaw

# ctd after adding the three passive stores
blob_store.append_cache(Sha1Cache(GSRaw(), "gs://cache_for_api/"))
blob_store.append_store(MyAPIStore(my_api_client))

Now, when you retrieve a missing blob for the first time, the API is used; and after that the cache is used.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epic_bitstore-1.1.0.tar.gz (6.3 kB view details)

Uploaded Source

File details

Details for the file epic_bitstore-1.1.0.tar.gz.

File metadata

  • Download URL: epic_bitstore-1.1.0.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.13.1 Darwin/24.2.0

File hashes

Hashes for epic_bitstore-1.1.0.tar.gz
Algorithm Hash digest
SHA256 9d1fbf53359e36d2bb21cf562d0d31ac51aedc549b1e454907f3ed7979b386b1
MD5 a54832675e78f093a52d391898f6846d
BLAKE2b-256 05d3f0dd0fd460d154d29cb28fead6f15cd9c35f2a04691865f403b2ee69d37c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page