Skip to main content

FastCDC for large git files

Project description

git-fastcdc

Split certain files using content-defined-chunking for faster deduplication. It has a similar use-case to git-lfs, but blobs are in-repository. git-fastcdc mitigates some of the speed penalties. For most use-cases you are probably better off with git-lfs. If you have a focus on archival and deduplication, git- fastcdc might right for you.

Enable

git fastcdc install

Config

Edit .gitattributes:

*.wav binary filter=git_fastcdc
/.gitattributes text -binary -filter
/.gitignore text -binary -filter

By default git-fastcdc runs in-memory. Switch to on-disk:

git config --local fastcdc.ondisk true

If you have a pure git-fastcdc repository, you probably want to disable delta-compression to benefit from the speedups through fastcdc.

git fastcdc delta disable

Which will set core.bigFileThreshold to 200k which isn't exect science. It means most of the history- and meta-data is delta-compressed while most of the cdc-blobs aren't.

Results

For my repository - 800GB of music collection:

  • Without git-fastcdc delta-compression took over 5 hours (actually it took all night)
  • With git-fastcdc delta-compression takes about 2 minutes
  • With git-fastcdc the repostiory got slightly smaller: about 1%

So much faster repack, with the same delta-compression.

Methodology: I took one state of my repostory from 2 years ago and one state from today. A lot of meta-data has changed in those two states, because I am constantly fixing these using beaTunes. In both tests I created two commits and did reapck -a -d -f at the end.

How

It will split files on filtering when you add them. The split files go into the git-fastcdc branch. You need to push this branch to remotes too!

You will see the actual data in the files in the working copy, in *.wav in the example above. But actually the blobs of these files are just a list of chunks.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

git_fastcdc-0.5.0.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

git_fastcdc-0.5.0-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file git_fastcdc-0.5.0.tar.gz.

File metadata

  • Download URL: git_fastcdc-0.5.0.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.3 Darwin/22.6.0

File hashes

Hashes for git_fastcdc-0.5.0.tar.gz
Algorithm Hash digest
SHA256 f857c23ee08b78053106fd0521c3ebe74e40f2a3d3d21aeff56e407aed867d53
MD5 3a5847811321a488513e9a5eb42ee068
BLAKE2b-256 c381946fd40cb4b0b5cf9424cf1d08a2b2dd38f4eeb7c99ec3012543cf4ecfc7

See more details on using hashes here.

File details

Details for the file git_fastcdc-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: git_fastcdc-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 18.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.3 Darwin/22.6.0

File hashes

Hashes for git_fastcdc-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aa3a70b794d101c5fef6d6636e549d36eca1359295765a04d88027fe4051b1fe
MD5 d7737f049a76357de3f3059b41442483
BLAKE2b-256 dd5a2e0f11ecc93bb53179606f94201aeecc5124a45400f4ed42a1b2a9cfe05f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page