Skip to main content

A content-addressable file management system.

Project description

version travis coveralls license

HashFS is a content-addressable file management system. What does that mean? Simply, that HashFS manages a directory where files are saved based on the file’s hash.

Typical use cases for this kind of system are ones where:

  • Files are written once and never change (e.g. image storage).

  • It’s desirable to have no duplicate files (e.g. user uploads).

  • File metadata is stored elsewhere (e.g. in a database).

Extra

this was forked from https://github.com/dgilland/hashfs/ and it supports:

  • init all directory to optimize the performance

  • support put file with hashid

  • change str to pathlib.Path

Features

  • Files are stored once and never duplicated.

  • Uses an efficient folder structure optimized for a large number of files. File paths are based on the content hash and are nested based on the first n number of characters.

  • Can save files from local file paths or readable objects (open file handlers, IO buffers, etc).

  • Able to repair the root folder by reindexing all files. Useful if the hashing algorithm or folder structure options change or to initialize existing files.

  • Supports any hashing algorithm available via hashlib.new.

  • Python 2.7+/3.3+ compatible.

Quickstart

Install using pip:

pip install hashfs

Initialization

from hashfs import HashFS

Designate a root folder for HashFS. If the folder doesn’t already exist, it will be created.

# Set the `depth` to the number of subfolders the file's hash should be split when saving.
# Set the `width` to the desired width of each subfolder.
fs = HashFS('temp_hashfs', depth=4, width=1, algorithm='sha256')

# With depth=4 and width=1, files will be saved in the following pattern:
# temp_hashfs/a/b/c/d/efghijklmnopqrstuvwxyz

# With depth=3 and width=2, files will be saved in the following pattern:
# temp_hashfs/ab/cd/ef/ghijklmnopqrstuvwxyz

NOTE: The algorithm value should be a valid string argument to hashlib.new().

Basic Usage

HashFS supports basic file storage, retrieval, and removal as well as some more advanced features like file repair.

Storing Content

Add content to the folder using either readable objects (e.g. StringIO) or file paths (e.g. 'a/path/to/some/file').

from io import StringIO

some_content = StringIO('some content')

address = fs.put(some_content)

# Or if you'd like to save the file with an extension...
address = fs.put(some_content, '.txt')

# The id of the file (i.e. the hexdigest of its contents).
address.id

# The absolute path where the file was saved.
address.abspath

# The path relative to fs.root.
address.relpath

# Whether the file previously existed.
address.is_duplicate

Retrieving File Address

Get a file’s HashAddress by address ID or path. This address would be identical to the address returned by put().

assert fs.get(address.id) == address
assert fs.get(address.relpath) == address
assert fs.get(address.abspath) == address
assert fs.get('invalid') is None

Retrieving Content

Get a BufferedReader handler for an existing file by address ID or path.

fileio = fs.open(address.id)

# Or using the full path...
fileio = fs.open(address.abspath)

# Or using a path relative to fs.root
fileio = fs.open(address.relpath)

NOTE: When getting a file that was saved with an extension, it’s not necessary to supply the extension. Extensions are ignored when looking for a file based on the ID or path.

Removing Content

Delete a file by address ID or path.

fs.delete(address.id)
fs.delete(address.abspath)
fs.delete(address.relpath)

NOTE: When a file is deleted, any parent directories above the file will also be deleted if they are empty directories.

Advanced Usage

Below are some of the more advanced features of HashFS.

Repairing Files

The HashFS files may not always be in sync with it’s depth, width, or algorithm settings (e.g. if HashFS takes ownership of a directory that wasn’t previously stored using content hashes or if the HashFS settings change). These files can be easily reindexed using repair().

repaired = fs.repair()

# Or if you want to drop file extensions...
repaired = fs.repair(extensions=False)

WARNING: It’s recommended that a backup of the directory be made before repairing just in case something goes wrong.

Walking Corrupted Files

Instead of actually repairing the files, you can iterate over them for custom processing.

for corrupted_path, expected_address in fs.corrupted():
    # do something

WARNING: HashFS.corrupted() is a generator so be aware that modifying the file system while iterating could have unexpected results.

Walking All Files

Iterate over files.

for file in fs.files():
    # do something

# Or using the class' iter method...
for file in fs:
    # do something

Iterate over folders that contain files (i.e. ignore the nested subfolders that only contain folders).

for folder in fs.folders():
    # do something

Computing Size

Compute the size in bytes of all files in the root directory.

total_bytes = fs.size()

Count the total number of files.

total_files = fs.count()

# Or via len()...
total_files = len(fs)

For more details, please see the full documentation at http://hashfs.readthedocs.org.

Changelog

v0.7.2 (2019-10-24)

  • Fix out-of-memory issue when computing file ID hashes of large files.

v0.7.1 (2018-10-13)

  • Replace usage of distutils.dir_util.mkpath with os.path.makedirs.

v0.7.0 (2016-04-19)

  • Use shutil.move instead of shutil.copy to move temporary file created during put operation to HashFS directory.

v0.6.0 (2015-10-19)

  • Add faster scandir package for iterating over files/folders when platform is Python < 3.5. Scandir implementation was added to os module starting with Python 3.5.

v0.5.0 (2015-07-02)

  • Rename private method HashFS.copy to HashFS._copy.

  • Add is_duplicate attribute to HashAddress.

  • Make HashFS.put() return HashAddress with is_duplicate=True when file with same hash already exists on disk.

v0.4.0 (2015-06-03)

  • Add HashFS.size() method that returns the size of all files in bytes.

  • Add HashFS.count()/HashFS.__len__() methods that return the count of all files.

  • Add HashFS.__iter__() method to support iteration. Proxies to HashFS.files().

  • Add HashFS.__contains__() method to support in operator. Proxies to HashFS.exists().

  • Don’t create the root directory (if it doesn’t exist) until at least one file has been added.

  • Fix HashFS.repair() not using extensions argument properly.

v0.3.0 (2015-06-02)

  • Rename HashFS.length parameter/property to width. (breaking change)

v0.2.0 (2015-05-29)

  • Rename HashFS.get to HashFS.open. (breaking change)

  • Add HashFS.get() method that returns a HashAddress or None given a file ID or path.

v0.1.0 (2015-05-28)

  • Add HashFS.get() method that retrieves a reader object given a file ID or path.

  • Add HashFS.delete() method that deletes a file ID or path.

  • Add HashFS.folders() method that returns the folder paths that directly contain files (i.e. subpaths that only contain folders are ignored).

  • Add HashFS.detokenize() method that returns the file ID contained in a file path.

  • Add HashFS.repair() method that reindexes any files under root directory whose file path doesn’t not match its tokenized file ID.

  • Rename Address classs to HashAddress. (breaking change)

  • Rename HashAddress.digest to HashAddress.id. (breaking change)

  • Rename HashAddress.path to HashAddress.abspath. (breaking change)

  • Add HashAddress.relpath which represents path relative to HashFS.root.

v0.0.1 (2015-05-27)

  • First release.

  • Add HashFS class.

  • Add HashFS.put() method that saves a file path or file-like object by content hash.

  • Add HashFS.files() method that returns all files under root directory.

  • Add HashFS.exists() which checks either a file hash or file path for existence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hashfs2-1.1.0.tar.gz (24.7 kB view details)

Uploaded Source

Built Distribution

hashfs2-1.1.0-py2.py3-none-any.whl (12.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file hashfs2-1.1.0.tar.gz.

File metadata

  • Download URL: hashfs2-1.1.0.tar.gz
  • Upload date:
  • Size: 24.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for hashfs2-1.1.0.tar.gz
Algorithm Hash digest
SHA256 04d730ae1584b1b41938342348c07c6ddb5a42cbc04fb3dd6c5b7da486750617
MD5 222ff124a08129d6f9edacb85d34adc6
BLAKE2b-256 180af96b571c7f01d8b6c0b8ba29071f6b7b35767fe378a025994cfdcff824ac

See more details on using hashes here.

File details

Details for the file hashfs2-1.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: hashfs2-1.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 12.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for hashfs2-1.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 b9fc7ac601482172f0a10828c982969b15962d2a39c06aa34a8aeab31939c2b0
MD5 550c58d5929a8ae30b926480b26e7107
BLAKE2b-256 128d847c8543ec998f7b08034ce01dd4d6f4b2f058c42aca83f4bb3be9056072

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page