FastCDC for large git files
Project description
git-fastcdc
Split certain files using content-defined-chunking for faster deduplication. It has a similar use-case to git-lfs, but blobs are in-repository. git-fastcdc mitigates some of the speed penalties. For most use-cases you are probably better off with git-lfs. If you have a focus on archival and deduplication, git- fastcdc might right for you.
Enable
git fastcdc install
Config
Edit .gitattributes:
*.wav binary filter=git_fastcdc
/.gitattributes text -binary -filter
/.gitignore text -binary -filter
By default git-fastcdc runs in-memory. Switch to on-disk:
git config --local fastcdc.ondisk true
If you have a pure git-fastcdc repository, you probably want to disable delta-compression to benefit from the speedups through fastcdc.
git fastcdc delta disable
Which will set core.bigFileThreshold
to 200k
which isn't exect science. It
means most of the history- and meta-data is delta-compressed while most of the
cdc-blobs aren't.
Results
For my repository - 800GB of music collection:
- Without git-fastcdc delta-compression took over 5 hours (actually it took all night)
- With git-fastcdc delta-compression takes about 2 minutes
- With git-fastcdc the repostiory got slightly smaller: about 1%
So much faster repack, with the same delta-compression.
Methodology: I took one state of my repostory from 2 years ago and one state
from today. A lot of meta-data has changed in those two states, because I am
constantly fixing these using beaTunes. In both tests I created two commits
and did reapck -a -d -f
at the end.
How
It will split files on filtering when you add them. The split files go into
the git-fastcdc
branch. You need to push this branch to remotes too!
You will see the actual data in the files in the working copy, in *.wav
in the
example above. But actually the blobs of these files are just a list of chunks.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file git_fastcdc-0.5.0.tar.gz
.
File metadata
- Download URL: git_fastcdc-0.5.0.tar.gz
- Upload date:
- Size: 17.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.3 Darwin/22.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f857c23ee08b78053106fd0521c3ebe74e40f2a3d3d21aeff56e407aed867d53 |
|
MD5 | 3a5847811321a488513e9a5eb42ee068 |
|
BLAKE2b-256 | c381946fd40cb4b0b5cf9424cf1d08a2b2dd38f4eeb7c99ec3012543cf4ecfc7 |
File details
Details for the file git_fastcdc-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: git_fastcdc-0.5.0-py3-none-any.whl
- Upload date:
- Size: 18.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.3 Darwin/22.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa3a70b794d101c5fef6d6636e549d36eca1359295765a04d88027fe4051b1fe |
|
MD5 | d7737f049a76357de3f3059b41442483 |
|
BLAKE2b-256 | dd5a2e0f11ecc93bb53179606f94201aeecc5124a45400f4ed42a1b2a9cfe05f |