Skip to main content

FastCDC (content defined chunking) in pure Python.

Project description

FastCDC

Tests Version Downloads

This package implements the "FastCDC" content defined chunking algorithm in Python with optional cython support. To learn more about content defined chunking and its applications, see the reference material linked below.

Requirements

  • Python Version 3.7 and later. Tested on Linux, Mac and Windows

Installing

$ pip install fastcdc

To enable add additional support for the hash algorithms (xxhash and blake3) use

$ pip install fastcdc[hashes]

Usage

Calculate chunks with default settings:

$ fastcdc tests/SekienAkashita.jpg
hash=103159aa68bb1ea98f64248c647b8fe9a303365d80cb63974a73bba8bc3167d7 offset=0 size=22366
hash=3f2b58dc77982e763e75db76c4205aaab4e18ff8929e298ca5c58500fee5530d offset=22366 size=10491
hash=fcfb2f49ccb2640887a74fad1fb8a32368b5461a9dccc28f29ddb896b489b913 offset=32857 size=14094
hash=bd1198535cdb87c5571378db08b6e886daf810873f5d77000a54795409464138 offset=46951 size=18696
hash=d6347a2e5bf586d42f2d80559d4f4a2bf160dce8f77eede023ad2314856f3086 offset=65647 size=43819

Customize min-size, avg-size, max-size, and hash function

$ fastcdc -mi 16384 -s 32768 -ma 65536 -hf sha256 tests/SekienAkashita.jpg
hash=5a80871bad4588c7278d39707fe68b8b174b1aa54c59169d3c2c72f1e16ef46d offset=0 size=32857
hash=13f6a4c6d42df2b76c138c13e86e1379c203445055c2b5f043a5f6c291fa520d offset=32857 size=16408
hash=0fe7305ba21a5a5ca9f89962c5a6f3e29cd3e2b36f00e565858e0012e5f8df36 offset=49265 size=60201

Scan files in directory and report duplication.

$ fastcdc scan ~/Downloads
[####################################]  100%
Files:          1,332
Chunk Sizes:    min 4096 - avg 16384 - max 131072
Unique Chunks:  506,077
Total Data:     9.3 GB
Dupe Data:      873.8 MB
DeDupe Ratio:   9.36 %
Throughput:     135.2 MB/s

Show help

$ fastcdc
Usage: fastcdc [OPTIONS] COMMAND [ARGS]...

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  chunkify*  Find variable sized chunks for FILE and compute hashes.
  benchmark  Benchmark chunking performance.
  scan       Scan files in directory and report duplication.

Use from your python code

The tests also have some short examples of using the chunker, of which this code snippet is an example:

from fastcdc import fastcdc

results = list(fastcdc("tests/SekienAkashita.jpg", 16384, 32768, 65536))
assert len(results) == 3
assert results[0].offset == 0
assert results[0].length == 32857
assert results[1].offset == 32857
assert results[1].length == 16408
assert results[2].offset == 49265
assert results[2].length == 60201

Reference Material

The algorithm is as described in "FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication"; see the paper, and presentation for details. There are some minor differences, as described below.

Differences with the FastCDC paper

The explanation below is copied from ronomon/deduplication since this codebase is little more than a translation of that implementation:

The following optimizations and variations on FastCDC are involved in the chunking algorithm:

  • 31 bit integers to avoid 64 bit integers for the sake of the Javascript reference implementation.
  • A right shift instead of a left shift to remove the need for an additional modulus operator, which would otherwise have been necessary to prevent overflow.
  • Masks are no longer zero-padded since a right shift is used instead of a left shift.
  • A more adaptive threshold based on a combination of average and minimum chunk size (rather than just average chunk size) to decide the pivot point at which to switch masks. A larger minimum chunk size now switches from the strict mask to the eager mask earlier.
  • Masks use 1 bit of chunk size normalization instead of 2 bits of chunk size normalization.

The primary objective of this codebase was to have a Python implementation with a permissive license, which could be used for new projects, without concern for data parity with existing implementations.

Prior Art

This package started as Python port of the implementation by Nathan Fiedler (see the nlfiedler link below).

Change Log

[1.5.0] - 2023-03-13

  • added python 3.10/3.11 support
  • removed python 3.6 support
  • update dependencies

[1.4.2] - 2020-11-25

  • add binary releases to PyPI (Xie Yanbo)
  • update dependencies

[1.4.1] - 2020-09-30

  • fix issue with fat option in cython version
  • updated dependencies

[1.4.0] - 2020-08-08

  • add support for multiple path with scan command
  • fix issue with building cython extension
  • fix issue with fat option
  • fix zero-devision error

[1.3.0] - 2020-06-26

  • add new scan command to calculate deduplication ratio for directories

[1.2.0] - 2020-05-23

Added

  • faster optional cython implementation
  • benchmark command

[1.1.0] - 2020-05-09

Added

  • high-level API
  • support for streams
  • support for custom hash functions

[1.0.0] - 2020-05-07

Added

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastcdc-1.5.0.tar.gz (17.1 kB view details)

Uploaded Source

Built Distributions

fastcdc-1.5.0-cp311-cp311-win_amd64.whl (235.2 kB view details)

Uploaded CPython 3.11 Windows x86-64

fastcdc-1.5.0-cp311-cp311-manylinux_2_31_x86_64.whl (651.7 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.31+ x86-64

fastcdc-1.5.0-cp311-cp311-macosx_12_0_x86_64.whl (306.8 kB view details)

Uploaded CPython 3.11 macOS 12.0+ x86-64

fastcdc-1.5.0-cp311-cp311-macosx_11_0_x86_64.whl (306.3 kB view details)

Uploaded CPython 3.11 macOS 11.0+ x86-64

fastcdc-1.5.0-cp310-cp310-win_amd64.whl (235.9 kB view details)

Uploaded CPython 3.10 Windows x86-64

fastcdc-1.5.0-cp310-cp310-manylinux_2_31_x86_64.whl (630.4 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.31+ x86-64

fastcdc-1.5.0-cp310-cp310-macosx_12_0_x86_64.whl (234.9 kB view details)

Uploaded CPython 3.10 macOS 12.0+ x86-64

fastcdc-1.5.0-cp310-cp310-macosx_11_0_x86_64.whl (235.1 kB view details)

Uploaded CPython 3.10 macOS 11.0+ x86-64

fastcdc-1.5.0-cp39-cp39-win_amd64.whl (237.2 kB view details)

Uploaded CPython 3.9 Windows x86-64

fastcdc-1.5.0-cp39-cp39-manylinux_2_31_x86_64.whl (635.1 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.31+ x86-64

fastcdc-1.5.0-cp39-cp39-macosx_12_0_x86_64.whl (234.7 kB view details)

Uploaded CPython 3.9 macOS 12.0+ x86-64

fastcdc-1.5.0-cp39-cp39-macosx_11_0_x86_64.whl (234.8 kB view details)

Uploaded CPython 3.9 macOS 11.0+ x86-64

fastcdc-1.5.0-cp38-cp38-win_amd64.whl (237.2 kB view details)

Uploaded CPython 3.8 Windows x86-64

fastcdc-1.5.0-cp38-cp38-manylinux_2_31_x86_64.whl (652.1 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.31+ x86-64

fastcdc-1.5.0-cp38-cp38-macosx_12_0_x86_64.whl (232.5 kB view details)

Uploaded CPython 3.8 macOS 12.0+ x86-64

fastcdc-1.5.0-cp38-cp38-macosx_11_0_x86_64.whl (232.4 kB view details)

Uploaded CPython 3.8 macOS 11.0+ x86-64

fastcdc-1.5.0-cp37-cp37m-win_amd64.whl (235.7 kB view details)

Uploaded CPython 3.7m Windows x86-64

fastcdc-1.5.0-cp37-cp37m-manylinux_2_31_x86_64.whl (600.3 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.31+ x86-64

fastcdc-1.5.0-cp37-cp37m-macosx_12_0_x86_64.whl (233.1 kB view details)

Uploaded CPython 3.7m macOS 12.0+ x86-64

fastcdc-1.5.0-cp37-cp37m-macosx_11_0_x86_64.whl (233.0 kB view details)

Uploaded CPython 3.7m macOS 11.0+ x86-64

File details

Details for the file fastcdc-1.5.0.tar.gz.

File metadata

  • Download URL: fastcdc-1.5.0.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.0

File hashes

Hashes for fastcdc-1.5.0.tar.gz
Algorithm Hash digest
SHA256 111c44538c1b0c29f8a20292d370576607007a233e20ef266369b0f9edd7b3dd
MD5 44e8bab8b6eecb19ce3129d641c9ebf8
BLAKE2b-256 ff8c07ba34aee9facd87321b4ab7037d3e0914b415dcfc833fe69f4b7442cd9f

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.5.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 235.2 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.0

File hashes

Hashes for fastcdc-1.5.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 10555f559155bb96a2a90cc2712eb0c2244e56dde788b9c92a2757938d9914e2
MD5 5096caa078c349ea6d0e38f692e80434
BLAKE2b-256 5bc08b4d8e36d6aac1ee0be32d07af804dbea752aa983f7d56ff584cfd650314

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp311-cp311-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp311-cp311-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 d715d618cd10b32f866419680b707746a87d108598f8256b6e61efe5a0b4ca44
MD5 57dd7b24582e7dad1f2eb17714ee61a8
BLAKE2b-256 ab71b4b48e4fb10c1eedcb2f153d53d16aed03f7e9c1c8f911ad25a9ad03458d

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp311-cp311-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp311-cp311-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 ca9b3eebcd4c83ef58a3f8806cc040c2e900dc3fcc154338e9d3e31c3aa44774
MD5 c7389663559e5138b1d9c930c62d53c7
BLAKE2b-256 da1c54a6d9b9e35734c956839d8057f5e5ca6d21f424bda54e85d9acfdbc327a

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp311-cp311-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp311-cp311-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 88774064b8138f0a42697c501f03b452d660595db8c400e04ea1bbe90099eec5
MD5 d739645ee1f09fc871d5842ad786f976
BLAKE2b-256 feada6c7e91568675357c7caf827bcf6e741d85cc40c1a4f50736ac86c5fbc68

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.5.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 235.9 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.0

File hashes

Hashes for fastcdc-1.5.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 5003ae62a7265a246138568720e777f98e51b19a9331c74bcd571d38aed019be
MD5 2eb8f52837bec4681536fc4307ebbcdd
BLAKE2b-256 1054f4b6c5ce085304fdc9b32fdcbb8a634e256c264a2c7cc02a8535ffc6fce4

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp310-cp310-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp310-cp310-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 0f1ca4a87a06800843cdc6a2d1705195d29f21d39295e1e7a45f070031fabdcc
MD5 ec856b2ef46636eac775b71cf6f8ab74
BLAKE2b-256 cbdae2a9f0dab48ea200833ef3c59e3a0fce517b0414702b0f2ad0258196ea8b

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp310-cp310-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp310-cp310-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 d56cb9dc71a2200d72ccfa30515efe31a97e3c1ba5b2123347007637c21faaa1
MD5 05e77864d5389846f9300468509c68b1
BLAKE2b-256 a5d5c34f0f000eb0a443d926c51573b04a5a85d00b407ec04157cf101e595014

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp310-cp310-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp310-cp310-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 64ae4d26602d58efd9e0a4c78aecdf6434e6cbc82acfd760afcd90b5521b23f9
MD5 b6c7fa1eae8a536d38386437930d7029
BLAKE2b-256 978dd5fa55e33a05b4e53012d5f2f98a846807726ad4597b5358cb12e39bde47

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.5.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 237.2 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.0

File hashes

Hashes for fastcdc-1.5.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 1f58e5da65f7f3af6ca061b39bf7997f0e8f569955094f410c101a059ca0181e
MD5 68a090a3152ba159220505d1777884a9
BLAKE2b-256 200bef0b9609f239cc4c4127af549d3981201558ab03bcc89f23866b417ac181

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp39-cp39-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp39-cp39-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 0aae6eda0a3d8ebc62cfadbafd43e96c2e5ee3ca92869981064b965da04cdc34
MD5 34b83409486ea4baf33710cb1126a4ec
BLAKE2b-256 f0cfd0403077d315376777d1096600a60375aef530c9e3e3f8382c072529da75

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 cf1d654a5210af08790c117ade97b43ecf58bb5c26b1ddb23befdd0a3ebf66c1
MD5 7e143d8379bced5be4b1f9828bd861f5
BLAKE2b-256 963a1a6cecb0f8bc3aa223fce0af1e49021e8275926bb10435aa57426e665aa3

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp39-cp39-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp39-cp39-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 9c98822e706117b9fdc7055d711b2df9b5dbdc111d10f2d1a58e279f646a4b75
MD5 df047991fb45bd35c9b423f70ba03321
BLAKE2b-256 048789ef37a3cf27cec33365eccaf53598e5356e0f6aeef0ed04cb6bb6555f50

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.5.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 237.2 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.0

File hashes

Hashes for fastcdc-1.5.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 ff04a5b0e6fbf96c061f40c19c82ebae32477b3ef5ccdfd9961230d86773a159
MD5 baf4e2a0cf0852378289d51f2ff77598
BLAKE2b-256 64216c917f08cca69647d98f52bb50da2e875233113af0638b5bc37c1be7df9b

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp38-cp38-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp38-cp38-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 20c3124cf1088109315435e7ca01f221c6e75d637525e74c243850e8e463e982
MD5 63d52b665e4032feb252bf3deeedccf4
BLAKE2b-256 bd5ff7cb5bf4458dbf98c352e0d4d97abb07c13c48c273c83225d352204888b5

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp38-cp38-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp38-cp38-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 cf86f81e47932463a651b019ae2220098dd7506e89b8d5d1b3a6398228c80ec1
MD5 c0e1f623594bbc98955e85d9870818cb
BLAKE2b-256 4e07715abd496a38369e918a735477e66c4d5fd22d2c3b788a281d395de71889

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp38-cp38-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp38-cp38-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 00dd670a23384ca0026a7f6a8f684c69a95be1f8b3a399359137e209bce377c1
MD5 1d7eeb26f9ac5a04f22bc5a751e9bd5a
BLAKE2b-256 232b6c18187e5b10943fdcc82c7b594553aef74f4ab0a78d22c033cca9996be2

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: fastcdc-1.5.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 235.7 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.0

File hashes

Hashes for fastcdc-1.5.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 7a09783ef6199519a8a93bc3cd2262ed6f43c923dd6f123fcdf768c1c092b2a6
MD5 1b63995e437fbf41ffd2f313bfad7fb6
BLAKE2b-256 783710efb92df765c3fd9e3a01a528ff6509a92a06d9ccdceb650f2d08c4cf78

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp37-cp37m-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp37-cp37m-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 0747850a41de83ee4dff160434270356e182674abe4b4472164826e08a16f8bc
MD5 0d0dfe35fa65de7eb7c80ef771de0eb9
BLAKE2b-256 60e047d9d03810537bcf3a0fc414cd7a5b324cbc66a8699a5b7d3304165bf01c

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp37-cp37m-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp37-cp37m-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 cf77d8e8531599ac346ec5e5f2ed650bb8819b3a7b4956d7a08c61157bab622d
MD5 79530c304d4463a338cfad95776ef889
BLAKE2b-256 d566bc4797e37f1bff6bf607f519bf846bff49cdb51f083c95fa93b3a4903c02

See more details on using hashes here.

File details

Details for the file fastcdc-1.5.0-cp37-cp37m-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for fastcdc-1.5.0-cp37-cp37m-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 8db8e758219b7f0d562a4ece531ccf9e51bf9679e7093160dff2f68edcba77c9
MD5 bc22f951556a4cea54885827ca0e05c0
BLAKE2b-256 0a43130ebde50a6661dfe9a39917f26cdebe53e937578051f4e2a30039d9a8f8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page