imohash

Fast hashing for large files

These details have not been verified by PyPI

Project links

Homepage

Project description

imohash is a fast, constant-time hashing library. It uses file size and sampling to calculate hashes quickly, regardless of file size. It was originally released as a Go library.

imosum is a sample application to hash files from the command line, similar to md5sum.

Alternative implementations

Go: https://github.com/kalafut/imohash
Java: https://github.com/dynatrace-oss/hash4j
Rust: https://github.com/hiql/imohash

Installation

pip install imohash

Usage

As a library:

from imohash import hashfile

hashfile('foo.txt')
'O\x9b\xbd\xd3[\x86\x9dE\x0e3LI\x83\r~\xa3'

hashfile('foo.txt', hexdigest=True)
'a608658926d8aa86b3db8208ad279bfe'

# just hash the whole file if smaller then 200000 bytes. Default is 128K
hashfile('foo.txt', sample_threshhold=200000)
'x86\x9dE\x0e3LI\x83\r~\xa3O\x9b\xbd\xd3[E'

# use samples of 1000 bytes. Default is 16K
hashfile('foo.txt', sample_size=1000)
'E\x0e3LI\x83\r~\xa3O\x9b\xbd\xd3[E\x23\x25'

# hash an already opened file.
# note: the file-like object passed in should be in binary mode. Text mode
#       behavior is undefined (and likely will raise an exception)
f = open('foo.txt', 'rb')
hashfileobject(f)
'O\x9b\xbd\xd3[\x86\x9dE\x0e3LI\x83\r~\xa3'

# hash a file on a remote server
import paramiko
ssh = paramiko.SSHClient()
ssh.connect('host', username='username', password='verysecurepassword')
ftp = ssh.open_sftp()
hashfileobject(ftp.file('/path/to/remote/file/foo.txt'))
'O\x9b\xbd\xd3[\x86\x9dE\x0e3LI\x83\r~\xa3'

Or from the command line:

imosum *.jpg

Uses

Because imohash only reads a small portion of a file’s data, it is very fast and well suited to file synchronization and deduplication, especially over a fairly slow network. A need to manage media (photos and video) over Wi-Fi between a NAS and multiple family computers is how the library was born.

If you just need to check whether two files are the same, and understand the limitations that sampling imposes (see below), imohash may be a good fit.

Misuses

Because imohash only reads a small portion of a file’s data, it is not suitable for:

file verification or integrity monitoring
cases where fixed-size files are manipulated
anything cryptographic

Design

(Note: a more precise description is provided in the algorithm description.)

imohash works by hashing small chunks of data from the beginning, middle and end of a file. It also incorporates the file size into the final 128-bit hash. This approach is based on a few assumptions which will vary by application. First, file size alone tends (1) to be a pretty good differentiator, especially as file size increases. And when people do things to files (such as editing photos), size tends to change. So size is used directly in the hash, and any files that have different sizes will have different hashes.

Size is an effective differentiator but isn’t sufficient. It can show that two files aren’t the same, but to increase confidence that like-size files are the same, a few segments are hashed using murmur3, a fast and effective hashing algorithm. By default, 16K chunks from the beginning, middle and end of the file are used. The ends of files often contain metadata which is more prone to changing without affecting file size. The middle is for good measure. The sample size can be changed for your application.

1 Try du -a . | sort -nr | less on a sample of your files to check this assertion.

Small file exemption

Small files are more likely to collide on size than large ones. They’re also probably more likely to change in subtle ways that sampling will miss (e.g. editing a large text file). For this reason, imohash will simply hash the entire file if it is less than 128K. This parameter is also configurable.

Performance

The standard hash performance metrics make no sense for imohash since it’s only reading a limited set of the data. That said, the real-world performance is very good. If you are working with large files and/or a slow network, expect huge speedups. (spoiler: reading 48K is quicker than reading 500MB.)

Name

Inspired by ILS marker beacons.

Credits

The “sparseFingerprints” used in TMSU gave me some confidence in this approach to hashing.
Sébastien Paolacci’s murmur3 library does all of the heavy lifting in the Go version.
As does Hajime Senuma’s mmh3 library for the Python version.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.1.0

Sep 5, 2024

1.0.5

Apr 27, 2023

1.0.4

Jul 29, 2018

1.0.3

Jul 29, 2018

1.0.2

Sep 29, 2017

1.0.1

Sep 29, 2017

1.0.0

Jan 16, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imohash-1.1.0.tar.gz (6.1 kB view details)

Uploaded Sep 5, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

imohash-1.1.0-py2.py3-none-any.whl (6.6 kB view details)

Uploaded Sep 5, 2024 Python 2Python 3

File details

Details for the file imohash-1.1.0.tar.gz.

File metadata

Download URL: imohash-1.1.0.tar.gz
Upload date: Sep 5, 2024
Size: 6.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.1

File hashes

Hashes for imohash-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`087a608e88021b13967994ed2888d6f685943717f52afd16bb7f85105184ed6b`
MD5	`43aaa64e37f0c598a390bc4456a9912d`
BLAKE2b-256	`a6391d83aeacb40fc094c8151734d923d4f8f10277df762dd8df1ab00cffdd05`

See more details on using hashes here.

File details

Details for the file imohash-1.1.0-py2.py3-none-any.whl.

File metadata

Download URL: imohash-1.1.0-py2.py3-none-any.whl
Upload date: Sep 5, 2024
Size: 6.6 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.1

File hashes

Hashes for imohash-1.1.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`e93d70e5cbd7a4356df6289a0f3a5b44cded86d7ce6c1566bd215cebfb3e332a`
MD5	`87d73c27886e84f0ad797007d998e2e2`
BLAKE2b-256	`93a7d961461048db0564d03909ca266aa9c0716b0651b404ea3f68b16d399d52`

See more details on using hashes here.

imohash 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Alternative implementations

Installation

Usage

Uses

Misuses

Design

Small file exemption

Performance

Name

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes