Skip to main content

FastCDC (content defined chunking) in pure Python.

Project description

FastCDC

Tests Version Downloads

This package implements the "FastCDC" content defined chunking algorithm in Python with optional cython support. To learn more about content defined chunking and its applications, see the reference material linked below.

Requirements

  • Python Version 3.7 and later. Tested on Linux, Mac and Windows

Installing

$ pip install fastcdc

To enable add additional support for the hash algorithms (xxhash and blake3) use

$ pip install fastcdc[hashes]

Usage

Calculate chunks with default settings:

$ fastcdc tests/SekienAkashita.jpg
hash=103159aa68bb1ea98f64248c647b8fe9a303365d80cb63974a73bba8bc3167d7 offset=0 size=22366
hash=3f2b58dc77982e763e75db76c4205aaab4e18ff8929e298ca5c58500fee5530d offset=22366 size=10491
hash=fcfb2f49ccb2640887a74fad1fb8a32368b5461a9dccc28f29ddb896b489b913 offset=32857 size=14094
hash=bd1198535cdb87c5571378db08b6e886daf810873f5d77000a54795409464138 offset=46951 size=18696
hash=d6347a2e5bf586d42f2d80559d4f4a2bf160dce8f77eede023ad2314856f3086 offset=65647 size=43819

Customize min-size, avg-size, max-size, and hash function

$ fastcdc -mi 16384 -s 32768 -ma 65536 -hf sha256 tests/SekienAkashita.jpg
hash=5a80871bad4588c7278d39707fe68b8b174b1aa54c59169d3c2c72f1e16ef46d offset=0 size=32857
hash=13f6a4c6d42df2b76c138c13e86e1379c203445055c2b5f043a5f6c291fa520d offset=32857 size=16408
hash=0fe7305ba21a5a5ca9f89962c5a6f3e29cd3e2b36f00e565858e0012e5f8df36 offset=49265 size=60201

Scan files in directory and report duplication.

$ fastcdc scan ~/Downloads
[####################################]  100%
Files:          1,332
Chunk Sizes:    min 4096 - avg 16384 - max 131072
Unique Chunks:  506,077
Total Data:     9.3 GB
Dupe Data:      873.8 MB
DeDupe Ratio:   9.36 %
Throughput:     135.2 MB/s

Show help

$ fastcdc
Usage: fastcdc [OPTIONS] COMMAND [ARGS]...

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  chunkify*  Find variable sized chunks for FILE and compute hashes.
  benchmark  Benchmark chunking performance.
  scan       Scan files in directory and report duplication.

Use from your python code

The tests also have some short examples of using the chunker, of which this code snippet is an example:

from fastcdc import fastcdc

results = list(fastcdc("tests/SekienAkashita.jpg", 16384, 32768, 65536))
assert len(results) == 3
assert results[0].offset == 0
assert results[0].length == 32857
assert results[1].offset == 32857
assert results[1].length == 16408
assert results[2].offset == 49265
assert results[2].length == 60201

Reference Material

The algorithm is as described in "FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication"; see the paper, and presentation for details. There are some minor differences, as described below.

Differences with the FastCDC paper

The explanation below is copied from ronomon/deduplication since this codebase is little more than a translation of that implementation:

The following optimizations and variations on FastCDC are involved in the chunking algorithm:

  • 31 bit integers to avoid 64 bit integers for the sake of the Javascript reference implementation.
  • A right shift instead of a left shift to remove the need for an additional modulus operator, which would otherwise have been necessary to prevent overflow.
  • Masks are no longer zero-padded since a right shift is used instead of a left shift.
  • A more adaptive threshold based on a combination of average and minimum chunk size (rather than just average chunk size) to decide the pivot point at which to switch masks. A larger minimum chunk size now switches from the strict mask to the eager mask earlier.
  • Masks use 1 bit of chunk size normalization instead of 2 bits of chunk size normalization.

The primary objective of this codebase was to have a Python implementation with a permissive license, which could be used for new projects, without concern for data parity with existing implementations.

Prior Art

This package started as Python port of the implementation by Nathan Fiedler (see the nlfiedler link below).

Change Log

[1.6.0] - 2024-05-09

  • added python 3.12 support
  • removed python 3.7 support
  • updated dependencies

[1.5.0] - 2023-03-13

  • added python 3.10/3.11 support
  • removed python 3.6 support
  • update dependencies

[1.4.2] - 2020-11-25

  • add binary releases to PyPI (Xie Yanbo)
  • update dependencies

[1.4.1] - 2020-09-30

  • fix issue with fat option in cython version
  • updated dependencies

[1.4.0] - 2020-08-08

  • add support for multiple path with scan command
  • fix issue with building cython extension
  • fix issue with fat option
  • fix zero-devision error

[1.3.0] - 2020-06-26

  • add new scan command to calculate deduplication ratio for directories

[1.2.0] - 2020-05-23

Added

  • faster optional cython implementation
  • benchmark command

[1.1.0] - 2020-05-09

Added

  • high-level API
  • support for streams
  • support for custom hash functions

[1.0.0] - 2020-05-07

Added

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastcdc-1.6.0.tar.gz (17.3 kB view details)

Uploaded Source

Built Distributions

fastcdc-1.6.0-cp312-cp312-win_amd64.whl (270.4 kB view details)

Uploaded CPython 3.12 Windows x86-64

fastcdc-1.6.0-cp312-cp312-manylinux_2_31_x86_64.whl (760.7 kB view details)

Uploaded CPython 3.12 manylinux: glibc 2.31+ x86-64

fastcdc-1.6.0-cp312-cp312-macosx_12_0_x86_64.whl (371.2 kB view details)

Uploaded CPython 3.12 macOS 12.0+ x86-64

fastcdc-1.6.0-cp312-cp312-macosx_11_0_x86_64.whl (370.8 kB view details)

Uploaded CPython 3.12 macOS 11.0+ x86-64

fastcdc-1.6.0-cp311-cp311-win_amd64.whl (269.7 kB view details)

Uploaded CPython 3.11 Windows x86-64

fastcdc-1.6.0-cp311-cp311-manylinux_2_31_x86_64.whl (769.2 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.31+ x86-64

fastcdc-1.6.0-cp311-cp311-macosx_12_0_x86_64.whl (367.8 kB view details)

Uploaded CPython 3.11 macOS 12.0+ x86-64

fastcdc-1.6.0-cp311-cp311-macosx_11_0_x86_64.whl (367.4 kB view details)

Uploaded CPython 3.11 macOS 11.0+ x86-64

fastcdc-1.6.0-cp310-cp310-win_amd64.whl (269.6 kB view details)

Uploaded CPython 3.10 Windows x86-64

fastcdc-1.6.0-cp310-cp310-manylinux_2_31_x86_64.whl (725.9 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.31+ x86-64

fastcdc-1.6.0-cp310-cp310-macosx_12_0_x86_64.whl (280.5 kB view details)

Uploaded CPython 3.10 macOS 12.0+ x86-64

fastcdc-1.6.0-cp310-cp310-macosx_11_0_x86_64.whl (280.7 kB view details)

Uploaded CPython 3.10 macOS 11.0+ x86-64

fastcdc-1.6.0-cp39-cp39-win_amd64.whl (269.7 kB view details)

Uploaded CPython 3.9 Windows x86-64

fastcdc-1.6.0-cp39-cp39-manylinux_2_31_x86_64.whl (726.5 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.31+ x86-64

fastcdc-1.6.0-cp39-cp39-macosx_12_0_x86_64.whl (280.8 kB view details)

Uploaded CPython 3.9 macOS 12.0+ x86-64

fastcdc-1.6.0-cp39-cp39-macosx_11_0_x86_64.whl (281.0 kB view details)

Uploaded CPython 3.9 macOS 11.0+ x86-64

fastcdc-1.6.0-cp38-cp38-win_amd64.whl (269.9 kB view details)

Uploaded CPython 3.8 Windows x86-64

fastcdc-1.6.0-cp38-cp38-manylinux_2_31_x86_64.whl (746.1 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.31+ x86-64

fastcdc-1.6.0-cp38-cp38-macosx_12_0_x86_64.whl (280.4 kB view details)

Uploaded CPython 3.8 macOS 12.0+ x86-64

fastcdc-1.6.0-cp38-cp38-macosx_11_0_x86_64.whl (280.3 kB view details)

Uploaded CPython 3.8 macOS 11.0+ x86-64

File details

Details for the file fastcdc-1.6.0.tar.gz.

File metadata

  • Download URL: fastcdc-1.6.0.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.1 Windows/10

File hashes

Hashes for fastcdc-1.6.0.tar.gz
Algorithm Hash digest
SHA256 a1b04979644f7582649d70b5cd6cabd54bc009005ffb78dd82a1f0c2e2aae656
MD5 b6a1c5ff0ff231960e93521e6f1bd9d5
BLAKE2b-256 da16a745ac0828417fe7abe2529fec02fe1c0257941907d5ac06708142ef48ed

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.6.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 270.4 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.1 Windows/10

File hashes

Hashes for fastcdc-1.6.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 92793329dae8611fd0ab87663b7175be9b4377de760eaff7745a1e2080f6c87c
MD5 df487692c421c2af7921dc880f7e08b3
BLAKE2b-256 da8d9024c7857275d3a0147e8ba7222e30f1f899b0f0b9986d01fc79bc9570a7

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp312-cp312-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp312-cp312-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 a80cc54c474a98a0774af9caf52aeb5fa7bf6e9a0814a58d86380a9538e0e2d5
MD5 824d3d32f08ea4a175f7f2dd83575965
BLAKE2b-256 1675e4d883d42453b17c0133af05a1446d5121ca00531280d2d2e606bb4ee924

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp312-cp312-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp312-cp312-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 a4eb7edc81d3f6f9b3836ea0d4280afb73026063c21a2aed3116d77341305865
MD5 d73f25f07a7ccdc8413fef0a42b0d54e
BLAKE2b-256 6521ea8320659be5ab327f771fc5653de1a412094d2c171f2a16709c751b2203

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp312-cp312-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp312-cp312-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 51b056789770c9458936f54a5d9d874e003e841143b6ad00cbadb4c6261d0f52
MD5 9fc76727318fe516cea566730042cb08
BLAKE2b-256 1bc93d8ebd57df54aae7c47235baed70f9d2070b23e65d957cbe5e29f9e4de5c

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.6.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 269.7 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.1 Windows/10

File hashes

Hashes for fastcdc-1.6.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 075df947591e66b9fa2e545b030e6b4a659e8fa44b841a4517fcaf9a54498048
MD5 aab88ff9833b2f09935da49968e6729d
BLAKE2b-256 7668e79c7aa8acd56063c33aaa7760be52c7a37758a391323d3cad4d6c873f9d

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp311-cp311-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp311-cp311-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 ee707505159166d0009d25e09d36de71942192a59021a7959ab833bf4dc2a9f9
MD5 697d76d12d63348a6bad0627f2181de8
BLAKE2b-256 8cacfde34f467cacd15930b0784823feacd0fd2b8639c4315a44028406651d2b

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp311-cp311-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp311-cp311-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 ddefbb0e890683be29c4c50ca519c0128d1aaab0cd0cf31647a97e83e33ad02c
MD5 273725302317f0ec64aefb6a46605425
BLAKE2b-256 fdf4bda1d9678bea81a762ca5b73c703911f6821ec998fefcfb6fcf3bb0c33bb

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp311-cp311-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp311-cp311-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 a056f2847aeebdc51cb3e8442d86c22823fc362f7c8d23ff317d4061c83de95d
MD5 e74627bd2672b50eb2bf5a13c155107d
BLAKE2b-256 151eb2d8302b342d8e37002d30cebdb351531f5bbbf261dcb3eb8a49289cc413

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.6.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 269.6 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.1 Windows/10

File hashes

Hashes for fastcdc-1.6.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 dc1fab61abce713d4895baf7fad28391efb705b3157a70049e8384f4013c8159
MD5 f17d9e8ca28d04911828b5e8dd2f930a
BLAKE2b-256 a0d22944adfd07bfa237ac60cdf4a9411b323654cbf40eea5cc0325f4538f7d6

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp310-cp310-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp310-cp310-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 e3037026e3c553d0a678e2d2b70affa62dc5183dba14b0d3daa91b5acb196120
MD5 45d166391e310738ab0022369cbce008
BLAKE2b-256 88b54804766f898626ca047bdd8deaa31486629607476d4c077682f9b7c08ba1

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp310-cp310-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp310-cp310-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 470e22ece244adc00dd744eab71591daa6ca4da7f52d1e95a308ad9f731a0a02
MD5 b85b19dd26e97d507936d057d6b8b507
BLAKE2b-256 bccf424916d980bef71962b85f9179915d29d8db5c2a1ef79a18f02691b6fa9d

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp310-cp310-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp310-cp310-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 b755c490bef330a796eb6a8f175ed5f7afcda9967c959afd447b0fa2367f2f38
MD5 8b092815a473cd4e536ee76e43587a3a
BLAKE2b-256 7f8dee6eb180bd8c007e2844dbb6c67eec4532081fa94995e7e59b27d4dfa463

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.6.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 269.7 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.1 Windows/10

File hashes

Hashes for fastcdc-1.6.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 863a97773193d2f2e702cce33121c06edfff45f679ac855682ec34115fc2097b
MD5 8259e3ca8a79bba5ef1ceb897009709d
BLAKE2b-256 9d348a6907a09b456d3c89e59b60e2e701d94002b2cfc4d049399e0149fda037

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp39-cp39-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp39-cp39-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 762da55d96f65b30ac5e8000f7f23c56a718e6589801f821cb2dff5a1bc46090
MD5 622efe5d07af514c4d404e54562d3f12
BLAKE2b-256 72ba07fcfb6eece7adc155a2212e72c405415c88da700c348d2d69bc2a404eb5

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 8866e3b77b2412bcc68bdadbe80c7f718f2634b2cfbdd9f10dfa229511e961a0
MD5 2ba8009e3a0d9bac5502c60c59bb2364
BLAKE2b-256 4c58d6df61ac2205e543c78f49f41f370dbb7a9cd5905651a9d25c9b2eea8268

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp39-cp39-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp39-cp39-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 a6f42ee6c59da4ffa4a5175a269cd9ece695b67e83c7b0b46937679c9ba15bfc
MD5 b6bcf1630464cca3737643ba80f213d6
BLAKE2b-256 f7636b78f233ffe142e6059ef09ca6607449be01801c45c4143d19f27cb3b54c

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.6.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 269.9 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.1 Windows/10

File hashes

Hashes for fastcdc-1.6.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 ca1b8d0f5a4547a9479c66f19f25efcadb4b969f423f8fa4c99055eb953ad15e
MD5 d54966c347e4a975545a643640ec5eef
BLAKE2b-256 4e818b92cac377767d6658006330c6cdd853d57e72b0b5753ebd7bde6537b278

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp38-cp38-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp38-cp38-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 76493bfbb1aba57af7a400c667705316c1fdde87589a1e17eaf6bef6e933fe27
MD5 bb7e76c84da7fe0dd2986a15971913e0
BLAKE2b-256 d3535503d759d0a6295808048931f13349e8103daeded9f01d6c12044452ea4e

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp38-cp38-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp38-cp38-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 d7c0d080bcba475faa2d81b2aacdb019ce6d85927f9eca92b7b441e57ecaf544
MD5 4ca737a4c693538cac17a21f8a7d22f7
BLAKE2b-256 1c143e62e9ecfc9acdbaf15cb3a72c45826f4ebc42bef8424d2a888ded9e9afc

See more details on using hashes here.

File details

Details for the file fastcdc-1.6.0-cp38-cp38-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.6.0-cp38-cp38-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 6d2a01cd13407319c4fbb7f812d78d7628550dad5ef5a8ec0c0f715c63d3ab18
MD5 388336953cb140278477c457ff88aa07
BLAKE2b-256 0dbef8d76b0e7a322034b969571e4ff135a4cfd5e05e998cfc494efed2fa3c13

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page