Skip to main content

FastCDC (content defined chunking) in pure Python.

Project description

FastCDC

Version Downloads

This package implements the "FastCDC" content defined chunking algorithm in pure Python. A critical aspect of its behavior is that it returns exactly the same results for the same input. To learn more about content defined chunking and its applications, see the reference material linked below.

Requirements

  • Python Version 3.6 and later.

Installing

$ pip3 install fastcdc

Example Usage

An example can be found in the examples directory of the source repository, which demonstrates reading files of arbitrary size into a memory-mapped buffer and passing them through the chunker (and computing the SHA256 hash digest of each chunk).

$ fastcdc -s 32768 tests/SekienAkashita.jpg
hash=5a80871bad4588c7278d39707fe68b8b174b1aa54c59169d3c2c72f1e16ef46d offset=0 size=32857
hash=13f6a4c6d42df2b76c138c13e86e1379c203445055c2b5f043a5f6c291fa520d offset=32857 size=16408
hash=0fe7305ba21a5a5ca9f89962c5a6f3e29cd3e2b36f00e565858e0012e5f8df36 offset=49265 size=60201

The tests also have some short examples of using the chunker, of which this code snippet is an example:

from fastcdc import FastCDC

chunker = FastCDC.new("SekienAkashita.jpg", 16384, 32768, 65536)
results = [c for c in chunker]
assert len(results) == 3
assert results[0].offset == 0
assert results[0].length == 32857
assert results[1].offset == 32857
assert results[1].length == 16408
assert results[2].offset == 49265
assert results[2].length == 60201

Reference Material

The algorithm is as described in "FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication"; see the paper, and presentation for details. There are some minor differences, as described below.

Differences with the FastCDC paper

The explanation below is copied from ronomon/deduplication since this codebase is little more than a translation of that implementation:

The following optimizations and variations on FastCDC are involved in the chunking algorithm:

  • 31 bit integers to avoid 64 bit integers for the sake of the Javascript reference implementation.
  • A right shift instead of a left shift to remove the need for an additional modulus operator, which would otherwise have been necessary to prevent overflow.
  • Masks are no longer zero-padded since a right shift is used instead of a left shift.
  • A more adaptive threshold based on a combination of average and minimum chunk size (rather than just average chunk size) to decide the pivot point at which to switch masks. A larger minimum chunk size now switches from the strict mask to the eager mask earlier.
  • Masks use 1 bit of chunk size normalization instead of 2 bits of chunk size normalization.

The primary objective of this codebase was to have a Python implementation with a permissive license, which could be used for new projects, without concern for data parity with existing implementations.

Prior Art

This crate is little more than a rewrite of the implementation by Joran Dirk Greef (see the ronomon link below), in Rust, and greatly simplified in usage. One significant difference is that the chunker in this crate does not calculate a hash digest of the chunks.

Change Log

[1.0.0] - 2019-05-07

Added

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastcdc-1.0.0.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

fastcdc-1.0.0-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file fastcdc-1.0.0.tar.gz.

File metadata

  • Download URL: fastcdc-1.0.0.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.8.0 Windows/10

File hashes

Hashes for fastcdc-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ee75446c1ec33c0316a3697d4730dc139833ae286686143c439786067231238b
MD5 984f123fae35e373bfc414f502f85a7f
BLAKE2b-256 d7018727cc4ff01c2b8b72f1366faa2d89116d6cca0e4aaab02289edd4a47346

See more details on using hashes here.

File details

Details for the file fastcdc-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: fastcdc-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.8.0 Windows/10

File hashes

Hashes for fastcdc-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e88400f96220e4168f6c7f42e174c2a4bdedac50df20b7dccc31f079176f6f65
MD5 7f7629644edc128c45f8e4a2cab896e2
BLAKE2b-256 a4a022be5cd4d7aed2cc70c5c9ddf0926054d943d2f9c8cff11e28165fb0c64f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page