Cloneable (with rclone) content-addressable storage for Python
Project description
EverCas
EverCas is a content-addressable file management system. What does that mean? Simply, that EverCas manages a directory where files are saved based on the file's hash.
Typical use cases for this kind of system are ones where:
- Files are written once and never change (e.g. image storage).
- It's desirable to have no duplicate files (e.g. user uploads).
- File metadata is stored elsewhere (e.g. in a database).
Features
- Files are stored once and never duplicated.
- Uses an efficient folder structure optimized for a large number of
files. File paths are based on the content hash and are nested based
on the first
n
number of characters. - Can save files from local file paths or readable objects (open file handlers, IO buffers, etc).
- Pluggable put strategies, allowing fine-grained control of how files are added.
- Able to repair the root folder by reindexing all files. Useful if the hashing algorithm or folder structure options change or to initialize existing files.
- Supports any hashing algorithm available via
hashlib.new
. - Python 3.10+ compatible.
- Support for hard-linking files into the EverCas-managed directory on compatible filesystems
Links
- Project: https://github.com/weedonandscott/evercas
- Documentation: https://weedonandscott.github.io/evercas/
- PyPI: https://pypi.python.org/pypi/evercas/
Quickstart
Install using pip:
pip install evercas
Initialization
from evercas import EverCas
Designate a root folder for EverCas
. If the folder doesn't already
exist, it will be created.
# Set the `depth` to the number of subfolders the file's hash should be split when saving.
# Set the `width` to the desired width of each subfolder.
fs = EverCas('temp_evercas', depth=4, width=1, algorithm='sha256')
# With depth=4 and width=1, files will be saved in the following pattern:
# temp_evercas/a/b/c/d/efghijklmnopqrstuvwxyz
# With depth=3 and width=2, files will be saved in the following pattern:
# temp_evercas/ab/cd/ef/ghijklmnopqrstuvwxyz
NOTE: The algorithm
value should be a valid string argument to
hashlib.new()
.
Basic Usage
EverCas
supports basic file storage, retrieval, and removal as well as
some more advanced features like file repair.
Storing Content
Add content to the folder using either readable objects (e.g.
StringIO
) or file paths (e.g. 'a/path/to/some/file'
).
from io import StringIO
some_content = StringIO('some content')
address = fs.put(some_content)
# Or if you'd like to save the file with an extension...
address = fs.put(some_content, '.txt')
# Put all files in a directory
for srcpath, address in fs.putdir("dir"):
#...
# Put all files in a directory tree recursively
for srcpath, address in fs.putdir("dir", recursive=True):
#...
# Put all files in a directory tree using same extensions
for srcpath, address in fs.putdir("dir", extensions=True):
# address.abspath will have same file extension as srcpath
# The id of the file (i.e. the hexdigest of its contents).
address.id
# The absolute path where the file was saved.
address.abspath
# The path relative to fs.root.
address.relpath
# Whether the file previously existed.
address.is_duplicate
Retrieving File Address
Get a file's HashAddress
by address ID or path. This address would be
identical to the address returned by put()
.
assert fs.get(address.id) == address
assert fs.get(address.relpath) == address
assert fs.get(address.abspath) == address
assert fs.get('invalid') is None
Retrieving Content
Get a BufferedReader
handler for an existing file by address ID or
path.
fileio = fs.open(address.id)
# Or using the full path...
fileio = fs.open(address.abspath)
# Or using a path relative to fs.root
fileio = fs.open(address.relpath)
NOTE: When getting a file that was saved with an extension, it's not necessary to supply the extension. Extensions are ignored when looking for a file based on the ID or path.
Removing Content
Delete a file by address ID or path.
fs.delete(address.id)
fs.delete(address.abspath)
fs.delete(address.relpath)
NOTE: When a file is deleted, any parent directories above the file will also be deleted if they are empty directories.
Advanced Usage
Below are some of the more advanced features of EverCas
.
Repairing Files
The EverCas
files may not always be in sync with it's depth
,
width
, or algorithm
settings (e.g. if EverCas
takes ownership of a
directory that wasn't previously stored using content hashes or if the
EverCas
settings change). These files can be easily reindexed using
repair()
.
repaired = fs.repair()
# Or if you want to drop file extensions...
repaired = fs.repair(extensions=False)
WARNING: It's recommended that a backup of the directory be made before repairing just in case something goes wrong.
Walking Corrupted Files
Instead of actually repairing the files, you can iterate over them for custom processing.
for corrupted_path, expected_address in fs.corrupted():
# do something
WARNING: EverCas.corrupted()
is a generator so be aware that
modifying the file system while iterating could have unexpected results.
Walking All Files
Iterate over files.
for file in fs.files():
# do something
# Or using the class' iter method...
for file in fs:
# do something
Iterate over folders that contain files (i.e. ignore the nested subfolders that only contain folders).
for folder in fs.folders():
# do something
Computing Size
Compute the size in bytes of all files in the root
directory.
total_bytes = fs.size()
Count the total number of files.
total_files = fs.count()
# Or via len()...
total_files = len(fs)
Hard-linking files
You can use the built-in "link" put strategy to hard-link files into the EverCas directory if the platform and filesystem support it. This will automatically and silently fall back to copying if a hard-link can't be made, e.g. because the source is on a different device, the EverCas directory is on a filesystem that does not support hard links or the source file already has the operating system's maximum allowed number of hard links to it.
newpath = fs.put("file/path", put_strategy="link").abspath
assert os.path.samefile("file/path", newpath)
Custom Put Strategy
Fine-grained control over how each file or file-like object is stored in the underlying filesytem.
# Implement your own put strategy
def my_put_strategy(evercas, src_stream, dst_path):
# src_stream is the source data to insert
# it is a EverCas.Stream object, which is a Python file-like object
# Stream objects also expose the filesystem path of the underlying
# file via the src_stream.name property
# dst_path is the path generated by EverCas, based on the hash of the
# source data
# src_stream.name will be None if there is not an underlying file path
# available (e.g. a StringIO was passed or some other non-file
# file-like)
# Its recommended to check name property is available before using
if src_stream.name:
# Example: rename files instead of copying
# (be careful with underlying file paths, make sure to test your
# implementation before using it).
os.rename(src_stream.name, dst_path)
# You can also access properties and methods of the EverCas instance
# using the evercas parameter
os.chmod(dst_path, EverCas.fmode)
else:
# The default put strategy is available for use as
# PutStrategies.copy
# You can manually call other strategies if you want fallbacks
# (recommended)
PutStrategies.copy(EverCas, src_stream, dst_path)
# And use it like:
fs.put("myfile", put_strategy=my_put_strategy)
For more details, please see the full documentation at https://weedonandscott.github.io/evercas/.
Acknowledgements
This software is based on HashFS, made by @dgilland with @x11x contributions, and inspired by parts of dud, by @kevin-hanselman.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file evercas-0.8.1.tar.gz
.
File metadata
- Download URL: evercas-0.8.1.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 45f879e5836de6094d199e601896028b1282090ec546a235c242b2cba0356bf2 |
|
MD5 | d01f191ac65dcedff894475c284dfd45 |
|
BLAKE2b-256 | b3699adda8c57d4251f54ddebe41737bd0c67d861c2e9ccb3cf25addd66bfe9b |
File details
Details for the file evercas-0.8.1-py3-none-any.whl
.
File metadata
- Download URL: evercas-0.8.1-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ebb78945e31f07fa5483d891ee4334059b68944013f07da234f92f80b399832 |
|
MD5 | bafaebaa161e277a44cd962af266d545 |
|
BLAKE2b-256 | 0c2138a7da5f01c1dbf1bf7a21d45445eea3b50d180d8d8fbf552d9b2944a1e6 |