Multi-tiered cloud-backed blob storage system
Project description
Epic bitstore — Multi-tiered cloud-backed blob storage system
What is it?
The epic-bitstore Python library provides client access to a multi-tiered blob storage system based on cloud backends, with the option to use API backends and caching mechanisms as well.
Usage
For example, let's assume you store blobs in the following locations:
- In an AWS bucket:
s3://aws_customer_data/files/<sha1>
- In a GCP bucket:
gs://gcp_customer_data/blobs/<sha1>
- In another GCP bucket:
gs://my_project/more_files/<sha1>
Using epic-bitstore, you could fetch a blob from any of these storages using a single get
command. The library would
iterate on the sources in order and would retrieve the data from the first matching source. This could also run in
parallel on multiple blobs, backed by the ultima parallelization library.
To implement the above strategy, you create a Composite
store, and add each of your sources in order:
from epic.bitstore import Sha1Composite, Sha1Store, S3Raw, GSRaw
blob_store = Sha1Composite()
blob_store.append_source(Sha1Store(S3Raw(), "s3://aws_customer_data/files/"))
blob_store.append_source(Sha1Store(GSRaw(), "gs://gcp_customer_data/blobs/"))
blob_store.append_source(Sha1Store(GSRaw(), "gs://my_project/more_files/"))
data = blob_store.get("4bc39c7d87318382feb3cc5a684c767fbd913968")
You can then use parallelization to efficiently map an iterator of hashes into an iterator of byte buffers:
from ultima import ultimap
data_iter = ultimap(blob_store.get, iter_hashes, backend='threading', n_workers=16)
API sources and caching layers
Let's also assume that you have an API that can retrieve blobs given their SHA1. You would like to use it for fetching blobs, but only when they're not found in the above "passive" sources.
You can implement a Sha1APISource
for your API, and add it to the Composite
object:
from epic.bitstore import Sha1APISource
class MyAPIStore(Sha1APISource):
def __init__(self, api_client):
super().__init__()
self.api_client = api_client
def api_get(self, sha1):
# return None for a blob that can't be found
return self.api_client.get_bytes(sha1)
# ctd after adding the three passive stores
blob_store.append_source(MyAPIStore(my_api_client))
API sources are often expensive, either in cost or performance. You can add a caching store, and configure the API source to store fetched blobs into the cache. It is important to append the cache before adding the API source, so that its cached blobs have precedence.
Append the cache and the API source:
from epic.bitstore import Sha1Cache, GSRaw
# ctd after adding the three passive stores
blob_store.append_cache(Sha1Cache(GSRaw(), "gs://cache_for_api/"))
blob_store.append_store(MyAPIStore(my_api_client))
Now, when you retrieve a missing blob for the first time, the API is used; and after that the cache is used.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file epic-bitstore-1.0.zip
.
File metadata
- Download URL: epic-bitstore-1.0.zip
- Upload date:
- Size: 37.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d999a48280a9d1c884d593f5c1d746c4dc8a37a73bac655eb3f5847d3d6c4a7 |
|
MD5 | 29d23d1183833da28f8a18cbc2d4a966 |
|
BLAKE2b-256 | 8a30fb6f7e14037ac84cf7700c1dfce834c289c78acb5f63d8d116bcfeecbf96 |