Skip to main content

hashget deduplication and compression tool

Project description

hashget

Network deduplication tool for archiving (backup) debian virtual machines (mostly). For example, very useful for backup LXC containers before uploading to Amazon Glacier.

When compressing, hashget replaces indexed static files (which could be downloaded by static URL) to it's hashes and URLs. This can compress 600Mb debian root filesystem with mysql, apache and other software to just 4Mb !

When decompressing, hashget downloads these files, verifies hashsum and places it on target system with same permissions, ownership, atime and mtime.

Hashget archive (in contrast to incremental and differential archive) is 'self-sufficient in same world' (where Debian or Linux kernel projects are still alive).

Installation

Pip (recommended):

pip3 install hashget

or clone from git:

git clone https://gitlab.com/yaroslaff/hashget.git

QuickStart

Compressing

Compressing test machine:

# hashget -zf /tmp/mydebvm.tar.gz --pack /var/lib/lxc/mydebvm/rootfs/ --exclude var/cache/apt var/lib/apt/lists 
STEP 1/3 Crawling...
Total: 222 packages
Crawling done in 0.01s. 222 total, 0 new, 0 already in db.
STEP 2/3 prepare exclude list for packing...
saved: 8515 files, 219 pkgs, size: 445.8M
STEP 3/3 tarring...
/var/lib/lxc/mydebvm/rootfs/ (687.2M) packed into /tmp/mydebvm.tar.gz (4.0M)

Now lets compare results with usual tarring

# du -sh --apparent-size /var/lib/lxc/mydebvm/rootfs/
693M	/var/lib/lxc/mydebvm/rootfs/

# tar -czf /tmp/mydebvm-orig.tar.gz  --exclude=var/cache/apt --exclude=var/lib/apt/lists -C /var/lib/lxc/mydebvm/rootfs/ .

# ls -lh mydebvm*
-rw-r--r-- 1 root root 165M мар 25 19:58 mydebvm-orig.tar.gz
-rw-r--r-- 1 root root 4,1M мар 25 19:54 mydebvm.tar.gz

Optimized backup is 40 times smaller!

Decompressing

Untarring:

# tar -xzf mydebvm.tar.gz -C rootfs
# du -sh --apparent-size rootfs/
130M	rootfs/

After untarring, we have just 130 Mb. Now, get all the missing files with hashget:

root@mir:/tmp# hashget -u rootfs/
Recovered 8534/8534 files 450.0M bytes (49.9M downloaded, 49.1M cached) in 242.68s

Now we have fully working debian system. Some files are still missing (e.g. APT list files) but can be created with 'apt update' command.

Adding custom files for deduplication

Now, lets add some files to our test machine:

mydebvm# wget -q https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.4.tar.xz
mydebvm# tar -xf linux-5.0.4.tar.xz 
mydebvm# du -sh --apparent-size .
893M	.

We added almost 900Mb of files to system, Lets see how it will be compressed:

# hashget -zf /tmp/mydebvm.tar.gz --pack /var/lib/lxc/mydebvm/rootfs/ --exclude var/cache/apt var/lib/apt/lists 
STEP 1/3 Crawling...
Total: 222 packages
Crawling done in 0.01s. 222 total, 0 new, 0 already in db.
STEP 2/3 prepare exclude list for packing...
saved: 8515 files, 219 pkgs, size: 445.8M
STEP 3/3 tarring...
/var/lib/lxc/mydebvm/rootfs/ (1.5G) packed into /tmp/mydebvm.tar.gz (265.0M)

Still very good, but 265M is not as impressive as 4M. Lets fix it!

hashget --project my --submit https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.4.tar.xz

We created our own project 'my' and indexed file https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.4.tar.xz

We can look our project details:

# hashget-admin --status --project my
my DirHashDB(path:/var/cache/hashget/hashdb/my stor:basename pkgtype:generic packages:0)
  size: 4.1M
  packages: 1
  first crawled: 2019-03-25 21:54:54
  last_crawled: 2019-03-25 21:54:54
  files: 50579
  anchors: 767
  packages size: 100.4M
  files size: 774.9M
  indexed size: 768.9M (99.23%)
  noanchor packages: 0
  noanchor size: 0
  no anchor link: 0
  bad anchor link: 0

It takes just 4M on disk, has 1 package indexed (100.4M), over 50K total files.

You can list contents of project:

# hashget-admin --list --project my
linux-5.0.4.tar.xz (767/50579)

Here you see list of all (one) indexed packages. This package has 50K files and 700+ 'anchors' (large files, over 100K).

Now, lets compress again, with same command:

STEP 1/3 Crawling...
Total: 222 packages
Crawling done in 0.00s. 222 total, 0 new, 0 already in db.
STEP 2/3 prepare exclude list for packing...
saved: 59095 files, 220 pkgs, size: 1.3G
STEP 3/3 tarring...
/var/lib/lxc/mydebvm/rootfs/ (1.5G) packed into /tmp/mydebvm.tar.gz (8.6M)

Great! We packed 1.5G into just 8.6Mb!

Hashget packs this into 8 Mb in 28 seconds (on my Core i5 computer) vs 426Mb in 48 seconds with plain tar -czf.

Documentation

For more detailed documentation see Wiki.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hashget-0.134.tar.gz (26.0 kB view details)

Uploaded Source

File details

Details for the file hashget-0.134.tar.gz.

File metadata

  • Download URL: hashget-0.134.tar.gz
  • Upload date:
  • Size: 26.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.19.5 CPython/3.6.7

File hashes

Hashes for hashget-0.134.tar.gz
Algorithm Hash digest
SHA256 6d99747a80b5a19e60f7c38603e0ba79605421486a7c01c384a0161af6d6a8ef
MD5 2a7cccaab9fa2e39edec4621d5ad46b8
BLAKE2b-256 6b01c2091c583a220b73cb280a79efc1f0425a340cb5091ad8aa687679481c6d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page