Skip to main content

FastCDC (content defined chunking) in pure Python.

Project description

FastCDC

Tests Version Downloads

This package implements the "FastCDC" content defined chunking algorithm in Python with optional cython support. To learn more about content defined chunking and its applications, see the reference material linked below.

Requirements

  • Python Version 3.6 and later. Tested on Linux, Mac and Windows

Installing

$ pip install fastcdc

To enable add additional support for the hash algorithms (xxhash and blake3) use

$ pip install fastcdc[hashes]

Example Usage

An example can be found in the examples directory of the source repository, which demonstrates reading files of arbitrary size into a memory-mapped buffer and passing them through the chunker (and computing the SHA256 hash digest of each chunk).

Calculate chunks with default settings:

$ fastcdc tests/SekienAkashita.jpg
hash=103159aa68bb1ea98f64248c647b8fe9a303365d80cb63974a73bba8bc3167d7 offset=0 size=22366
hash=3f2b58dc77982e763e75db76c4205aaab4e18ff8929e298ca5c58500fee5530d offset=22366 size=10491
hash=fcfb2f49ccb2640887a74fad1fb8a32368b5461a9dccc28f29ddb896b489b913 offset=32857 size=14094
hash=bd1198535cdb87c5571378db08b6e886daf810873f5d77000a54795409464138 offset=46951 size=18696
hash=d6347a2e5bf586d42f2d80559d4f4a2bf160dce8f77eede023ad2314856f3086 offset=65647 size=43819

Customize min-size, avg-size, max-size, and hash function

$ fastcdc -mi 16384 -s 32768 -ma 65536 -hf sha256 tests/SekienAkashita.jpg
hash=5a80871bad4588c7278d39707fe68b8b174b1aa54c59169d3c2c72f1e16ef46d offset=0 size=32857
hash=13f6a4c6d42df2b76c138c13e86e1379c203445055c2b5f043a5f6c291fa520d offset=32857 size=16408
hash=0fe7305ba21a5a5ca9f89962c5a6f3e29cd3e2b36f00e565858e0012e5f8df36 offset=49265 size=60201

Scan files in directory and report duplication.

$ fastcdc scan ~/Downloads
[####################################]  100%
Files:          1,332
Chunk Sizes:    min 4096 - avg 16384 - max 131072
Unique Chunks:  506,077
Total Data:     9.3 GB
Dupe Data:      873.8 MB
DeDupe Ratio:   9.36 %
Throughput:     135.2 MB/s

Show help

$ fastcdc
Usage: fastcdc [OPTIONS] COMMAND [ARGS]...

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  chunkify*  Find variable sized chunks for FILE and compute hashes.
  benchmark  Benchmark chunking performance.
  scan       Scan files in directory and report duplication.

Use from your python code

The tests also have some short examples of using the chunker, of which this code snippet is an example:

from fastcdc import fastcdc

results = list(fastcdc("tests/SekienAkashita.jpg", 16384, 32768, 65536))
assert len(results) == 3
assert results[0].offset == 0
assert results[0].length == 32857
assert results[1].offset == 32857
assert results[1].length == 16408
assert results[2].offset == 49265
assert results[2].length == 60201

Reference Material

The algorithm is as described in "FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication"; see the paper, and presentation for details. There are some minor differences, as described below.

Differences with the FastCDC paper

The explanation below is copied from ronomon/deduplication since this codebase is little more than a translation of that implementation:

The following optimizations and variations on FastCDC are involved in the chunking algorithm:

  • 31 bit integers to avoid 64 bit integers for the sake of the Javascript reference implementation.
  • A right shift instead of a left shift to remove the need for an additional modulus operator, which would otherwise have been necessary to prevent overflow.
  • Masks are no longer zero-padded since a right shift is used instead of a left shift.
  • A more adaptive threshold based on a combination of average and minimum chunk size (rather than just average chunk size) to decide the pivot point at which to switch masks. A larger minimum chunk size now switches from the strict mask to the eager mask earlier.
  • Masks use 1 bit of chunk size normalization instead of 2 bits of chunk size normalization.

The primary objective of this codebase was to have a Python implementation with a permissive license, which could be used for new projects, without concern for data parity with existing implementations.

Prior Art

This package started as Python port of the implementation by Nathan Fiedler (see the nlfiedler link below).

Change Log

[1.4.2] - 2020-11-25

  • add binary releases to PyPI (Xie Yanbo)
  • update dependencies

[1.4.1] - 2020-09-30

  • fix issue with fat option in cython version
  • updated dependencies

[1.4.0] - 2020-08-08

  • add support for multiple path with scan command
  • fix issue with building cython extension
  • fix issue with fat option
  • fix zero-devision error

[1.3.0] - 2020-06-26

  • add new scan command to calculate deduplication ratio for directories

[1.2.0] - 2020-05-23

Added

  • faster optional cython implementation
  • benchmark command

[1.1.0] - 2020-05-09

Added

  • high-level API
  • support for streams
  • support for custom hash functions

[1.0.0] - 2020-05-07

Added

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastcdc-1.4.2.tar.gz (19.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fastcdc-1.4.2-cp39-cp39-win_amd64.whl (245.3 kB view details)

Uploaded CPython 3.9Windows x86-64

fastcdc-1.4.2-cp39-cp39-manylinux2010_x86_64.whl (586.1 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.12+ x86-64

fastcdc-1.4.2-cp39-cp39-macosx_10_15_x86_64.whl (230.3 kB view details)

Uploaded CPython 3.9macOS 10.15+ x86-64

fastcdc-1.4.2-cp38-cp38-win_amd64.whl (245.1 kB view details)

Uploaded CPython 3.8Windows x86-64

fastcdc-1.4.2-cp38-cp38-manylinux2010_x86_64.whl (601.7 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.12+ x86-64

fastcdc-1.4.2-cp38-cp38-macosx_10_15_x86_64.whl (228.2 kB view details)

Uploaded CPython 3.8macOS 10.15+ x86-64

fastcdc-1.4.2-cp37-cp37m-win_amd64.whl (243.2 kB view details)

Uploaded CPython 3.7mWindows x86-64

fastcdc-1.4.2-cp37-cp37m-manylinux2010_x86_64.whl (547.1 kB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.12+ x86-64

fastcdc-1.4.2-cp37-cp37m-macosx_10_15_x86_64.whl (228.5 kB view details)

Uploaded CPython 3.7mmacOS 10.15+ x86-64

fastcdc-1.4.2-cp36-cp36m-win_amd64.whl (243.2 kB view details)

Uploaded CPython 3.6mWindows x86-64

fastcdc-1.4.2-cp36-cp36m-manylinux2010_x86_64.whl (548.4 kB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.12+ x86-64

fastcdc-1.4.2-cp36-cp36m-macosx_10_15_x86_64.whl (230.8 kB view details)

Uploaded CPython 3.6mmacOS 10.15+ x86-64

File details

Details for the file fastcdc-1.4.2.tar.gz.

File metadata

  • Download URL: fastcdc-1.4.2.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for fastcdc-1.4.2.tar.gz
Algorithm Hash digest
SHA256 1b10257d4faf165fe77670684aac5fbb06d35bccad28a0bc49d2b031aab70482
MD5 4db2eaeb6462303f6e0f4093465bd180
BLAKE2b-256 5601dab03918bf3ffda4dffa1401cec034417838352639354c21fe4fd88ea479

See more details on using hashes here.

File details

Details for the file fastcdc-1.4.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.4.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 245.3 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for fastcdc-1.4.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 22fb19a5ac6efdf65a849667451d41e18ac42a395fbae9f6a91bf95300766b06
MD5 711b4ec8377e5d2966932a4be440c4eb
BLAKE2b-256 dc8621114ceb6b17b14ed0f026f564ba7c2991ed6a0b971cc4aa34c05b6c6513

See more details on using hashes here.

File details

Details for the file fastcdc-1.4.2-cp39-cp39-manylinux2010_x86_64.whl.

File metadata

  • Download URL: fastcdc-1.4.2-cp39-cp39-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 586.1 kB
  • Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for fastcdc-1.4.2-cp39-cp39-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 55176bb05f91ceec3d9af4ad0737adeefc98e83a948bcd411d73594d532dffcd
MD5 fb49a2dbd7d46e111492ba26f94102f6
BLAKE2b-256 19c698a85b6d288554b52fb371f4ab9b30e951d0e7554727b432ee77275f31b9

See more details on using hashes here.

File details

Details for the file fastcdc-1.4.2-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fastcdc-1.4.2-cp39-cp39-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 230.3 kB
  • Tags: CPython 3.9, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for fastcdc-1.4.2-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 12187e8234187ac345535a0c9fc7a79de7d0a351d51928f1639e8482fc8db74d
MD5 b773196e7d2e741ab6272c410c22a748
BLAKE2b-256 ee003ed09997c6c405615db94431706320a26e5c5dab2698a3199a76e9c6ea6a

See more details on using hashes here.

File details

Details for the file fastcdc-1.4.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.4.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 245.1 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for fastcdc-1.4.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 dd480cdc54344f51ee1805e991320e8365a64a0f8fad3f37a290c88ecf8a841f
MD5 62145a266b7948b4f421aa2a6ddd065e
BLAKE2b-256 e31cebe11da29741cdca2936d6974179bd255b35aed013e7786f536a1176c073

See more details on using hashes here.

File details

Details for the file fastcdc-1.4.2-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: fastcdc-1.4.2-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 601.7 kB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for fastcdc-1.4.2-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 e3efabf8ef3138e4be0f74367ae3bf279f7a2aaa254b3b068d4dda4ff66af803
MD5 d56b4911b055ec26db1eaabecf241e76
BLAKE2b-256 103b680ddef81c6e33f650d0a5e5dd3dea756028c3c427f385f3debff04a712d

See more details on using hashes here.

File details

Details for the file fastcdc-1.4.2-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fastcdc-1.4.2-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 228.2 kB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for fastcdc-1.4.2-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 71a0a8d8ea59f4e1a393cd56de49141efe5885506b45e47bcdefebbafeecbd12
MD5 f14d0210ae88bbbd3b60943cd790396e
BLAKE2b-256 b801080b2204c35997640ca9a3fc816de05c6f6d857c7da9c8be42c91fe49c20

See more details on using hashes here.

File details

Details for the file fastcdc-1.4.2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.4.2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 243.2 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for fastcdc-1.4.2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 88f44ef22d5d76b26be1316ec83de00c9a2f410d4e3d3bf945da7e0e69553783
MD5 c551c1e4946a75998dbe1a654778e12b
BLAKE2b-256 8947df09e830b1595c38965ee7ceb28636e2b42784601b39fd5256248d136673

See more details on using hashes here.

File details

Details for the file fastcdc-1.4.2-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: fastcdc-1.4.2-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 547.1 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for fastcdc-1.4.2-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 c4be02dfc29e91411233aae8116f4c74101db0f46871ce39aac5cb0a633a2e13
MD5 08437b9d09847d4cde8d377b6492158b
BLAKE2b-256 af0601f02468ff725ddc52e363cd679d66995595d6654b1ee0a0637b76daf28e

See more details on using hashes here.

File details

Details for the file fastcdc-1.4.2-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fastcdc-1.4.2-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 228.5 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for fastcdc-1.4.2-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 bf148539acfa39e86747ca8adfadf3b15ab0a16042809fb1f8c0379abcfc4633
MD5 398988e99670ed8822a15cf0d99ab17c
BLAKE2b-256 0cc39d9d50bb877544aeca25b95e6b13e056b06f5b00ff280c50f943354e1097

See more details on using hashes here.

File details

Details for the file fastcdc-1.4.2-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.4.2-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 243.2 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for fastcdc-1.4.2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 4e34795c0cc40d5a03aeea0a8c239c9b84f6c3e6a1156ccdb0c4968ad4966863
MD5 b4de4ceef028712eaf8ccfe8ecc82cea
BLAKE2b-256 2d7e6ebaaf03c3516a8134137b88ea574d03623a442bf6f9e199b27168c490cf

See more details on using hashes here.

File details

Details for the file fastcdc-1.4.2-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: fastcdc-1.4.2-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 548.4 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for fastcdc-1.4.2-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f88ecb854d0ac4e61820f8113ed1f3ccea20c078415ac62b1db73f0c89c274ff
MD5 4a67574513de44ba313c4d3d926eeae9
BLAKE2b-256 1217c7bb1e9af34e518facc8d6d1dcd5e1e1c1a8ed4ac5ac247fd5270ef6b720

See more details on using hashes here.

File details

Details for the file fastcdc-1.4.2-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: fastcdc-1.4.2-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 230.8 kB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for fastcdc-1.4.2-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 6869c2db8dfb9ab7aa29173cf95201e1b12fab2a0bdd4dd41d01377405e7b31e
MD5 9ef146c4ac496343cc2880514666576e
BLAKE2b-256 c883da5cd086248e8818661a863d7f9d7eaeebfa2c4976c5321510ac58eca156

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page