Skip to main content

hashget deduplication and compression tool

Project description

hashget

Network deduplication tool for archiving (backup) debian virtual machines (mostly). For example, very useful for backup LXC containers before uploading to Amazon Glacier.

When compressing, hashget replaces indexed static files (which could be downloaded by static URL) to it's hashes and URLs. This can compress 600Mb debian root filesystem with mysql, apache and other software to just 4Mb !

When decompressing, hashget downloads these files, verifies hashsum and places it on target system with same permissions, ownership, atime and mtime.

Hashget archive (in contrast to incremental and differential archive) is 'self-sufficient in same world' (where Debian or Linux kernel projects are still alive).

Installation

Pip (recommended):

pip3 install hashget

or clone from git:

git clone https://gitlab.com/yaroslaff/hashget.git

QuickStart

Compressing

Compressing test machine:

# hashget -zf /tmp/mydebian.tar.gz --pack /var/lib/lxc/mydebvm/rootfs/ --exclude var/cache/apt var/lib/apt/lists
STEP 1/3 Indexing debian packages...
Total: 222 packages
Indexing done in 0.02s. 222 local + 0 pulled + 0 new = 222 total.
STEP 2/3 prepare exclude list for packing...
saved: 8515 files, 216 pkgs, size: 445.8M. Download: 98.7M
STEP 3/3 tarring...
/var/lib/lxc/mydebvm/rootfs/ (687.2M) packed into /tmp/mydebian.tar.gz (4.0M)

--exclude directive tells hashget and tar to skip some directories which are not necessary in backup. (You can omit it, backup will be larger)

Now lets compare results with usual tarring

# du -sh --apparent-size /var/lib/lxc/mydebvm/rootfs/
693M	/var/lib/lxc/mydebvm/rootfs/

# tar -czf /tmp/mydebvm-orig.tar.gz  --exclude=var/cache/apt --exclude=var/lib/apt/lists -C /var/lib/lxc/mydebvm/rootfs/ .

# ls -lh mydebvm*
-rw-r--r-- 1 root root 165M Mar 29 00:27 mydebvm-orig.tar.gz
-rw-r--r-- 1 root root 4.1M Mar 29 00:24 mydebvm.tar.gz

Optimized backup is 40 times smaller!

Decompressing

Untarring:

# mkdir rootfs
# tar -xzf mydebvm.tar.gz -C rootfs
# du -sh --apparent-size rootfs/
130M	rootfs/

After untarring, we have just 130 Mb. Now, get all the missing files with hashget:

# hashget -u rootfs/
Recovered 8534/8534 files 450.0M bytes (49.9M downloaded, 49.1M cached) in 242.68s

(you can run with -v for verbosity)

Now we have fully working debian system. Some files are still missing (e.g. APT list files in /var/lib/apt/lists, which we explicitly --exclude'd. Hashget didn't misses anything on it's own) but can be created with 'apt update' command.

Heuristics

Heuristics is small subprograms (part of hashget package) which can auto-detect some non-indexed files which could be indexed.

Now, lets add some files to our test machine:

mydebvm# wget -q https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.4.tar.xz
mydebvm# tar -xf linux-5.0.4.tar.xz 
mydebvm# du -sh --apparent-size .
893M	.

If we will pack this machine same was as before we will see this:

root@braconnier:~# hashget -zf /tmp/mydebian.tar.gz --pack /var/lib/lxc/mydebvm/rootfs/ --exclude var/cache/apt var/lib/apt/lists
STEP 1/3 Indexing debian packages...
Total: 222 packages
Indexing done in 0.03s. 222 local + 0 pulled + 0 new = 222 total.
submitting https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.5.tar.xz
STEP 2/3 prepare exclude list for packing...
saved: 59095 files, 217 pkgs, size: 1.3G. Download: 199.1M
STEP 3/3 tarring...
/var/lib/lxc/mydebvm/rootfs/ (1.5G) packed into /tmp/mydebian.tar.gz (8.7M)

You see one more interesting line here:

submitting https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.5.tar.xz

Hashget detected linux kernel sources package, downloaded and indexed it. And we got fantastic result again: 1.5G packed into just 8.7M!

Hashget packs this into 8 Mb in 28 seconds (on my Core i5 computer) vs 426Mb in 48 seconds with plain tar -czf. (And 3 minutes with hashget/tar/gz vs 4 minutes with tar on slower notebook). Hashget packs faster and often much more effective.

If you will make hashget-admin --status you will see kernel.org project. hashget-admin --list -p PROJECT will show content of project:

# hashget-admin --list -p kernel.org
linux-5.0.5.tar.xz (767/50579)

Even if when kernel package will be released (and it's not indexed anywhere), hashget will detect it and automatically index.

Manually indexing files to local HashDB

Now lets make test directory for packing.

# mkdir /tmp/test
# cd /tmp/test/
# wget -q https://ru.wordpress.org/wordpress-5.1.1-ru_RU.zip
# unzip wordpress-5.1.1-ru_RU.zip 
Archive:  wordpress-5.1.1-ru_RU.zip
   creating: wordpress/
  inflating: wordpress/wp-login.php  
  inflating: wordpress/wp-cron.php   
....
# du -sh --apparent-size .
54M	.

and now we will pack it:

# hashget -zf /tmp/test.tar.gz --pack /tmp/test/
STEP 1/3 Indexing...
STEP 2/3 prepare exclude list for packing...
saved: 4 files, 3 pkgs, size: 104.6K. Download: 3.8M
STEP 3/3 tarring...
/tmp/test/ (52.3M) packed into /tmp/test.tar.gz (22.1M)

Thats same result as usual tar would do. Only ~100K saved (you can see it in .hashget-restore.json file, there are usual license files). Still ok, but not as impressive as before. Lets fix miracle and make it impressive again!

# hashget --project my --submit https://ru.wordpress.org/wordpress-5.1.1-ru_RU.zip
# hashget -zf /tmp/test.tar.gz --pack /tmp/test/
STEP 1/3 Indexing...
STEP 2/3 prepare exclude list for packing...
saved: 1396 files, 1 pkgs, size: 52.2M. Download: 11.7M
STEP 3/3 tarring...
/tmp/test/ (52.3M) packed into /tmp/test.tar.gz (157.9K)

50M packed into 150K. Very good! What other archiver can make such great compression? (300+ times smaller!)

We can look our project details:

root@braconnier:/tmp/test# hashget-admin --status -p my
my DirHashDB(path:/var/cache/hashget/hashdb/my stor:basename pkgtype:generic packages:0)
  size: 119.4K
  packages: 1
  first crawled: 2019-04-01 01:45:45
  last_crawled: 2019-04-01 01:45:45
  files: 1395
  anchors: 72
  packages size: 11.7M
  files size: 40.7M
  indexed size: 40.5M (99.61%)
  noanchor packages: 0
  noanchor size: 0
  no anchor link: 0
  bad anchor link: 0

It takes just 100K on disk, has 1 package indexed (11.7M), over 1395 total files.

Hint files

If our package is indexed (like we just did with wordpress) it will be very effectively deduplicated on packing. But what if it's not indexed? For example, if you cleaned hashdb cache or if you will recover this backup on other machine and pack it again. It will take it's full space again.

We will delete index for this file:

# hashget-admin --purge wordpress-5.1.1-ru_RU.zip

Now, if you will make hashget --pack it it will take huge 50M again, our magic is lost...

Now, create special hint file hashget-hint.json (or .hashget-hint.json , if you want it to be hidden) in /tmp/test with this content:

{
	"project": "wordpress.org",
	"url": "https://ru.wordpress.org/wordpress-5.1.1-ru_RU.zip"
}

And now try compress it again:

# hashget -zf /tmp/test.tar.gz --pack /tmp/test
STEP 1/3 Indexing...
submitting https://ru.wordpress.org/wordpress-5.1.1-ru_RU.zip
STEP 2/3 prepare exclude list for packing...
saved: 1396 files, 1 pkgs, size: 52.2M. Download: 11.7M
STEP 3/3 tarring...
/tmp/test (52.3M) packed into /tmp/test.tar.gz (157.9K)

Great! Hashget used hint file and automatically indexed file, so we got our fantastic compression rate again.

What you should NOT index

You should index ONLY static and permanent files, which will be available on same URL with same content. Not all projects provides such files. Usual linux package repositories has only latest files so it's not good for this purpose, but debian has great snapshot.debian.org repository, which makes Debian great for hashget compression.

Do not index latest files, because content will change later (it's not static). E.g. you may index https://wordpress.org/wordpress-5.1.1.zip but you should not index https://wordpress.org/latest.zip

Not only Debian, not only virtual machines

For now development hashserver has index files (HashPackages) for Debian only. But this does not means you can use power of hashget only for debian VMs. In previous example you added linux kernel to local HashDB. You can pack anything which has indexed files from HashDB:

# wget -q https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.4.tar.xz
# tar -xf linux-5.0.4.tar.xz
# hashget -zf /tmp/mykernel.tar.gz --pack .
STEP 1/3 Crawling [skipped]...
STEP 2/3 prepare exclude list for packing...
saved: 50580 files, 1 pkgs, size: 869.3M
STEP 3/3 tarring...
. (875.3M) packed into /tmp/mykernel.tar.gz (4.6M)

Documentation

For more detailed documentation see Wiki.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hashget-0.140.tar.gz (31.4 kB view details)

Uploaded Source

File details

Details for the file hashget-0.140.tar.gz.

File metadata

  • Download URL: hashget-0.140.tar.gz
  • Upload date:
  • Size: 31.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for hashget-0.140.tar.gz
Algorithm Hash digest
SHA256 b0cfe3e699c5c59246696aa1465a5636b15b26bddb3829baabee02112608b6ea
MD5 42fee5a242b338b4b998e5945561635f
BLAKE2b-256 641aa319142b5658d179e97f99cce16ff4f69c2d2e43033c1c0acd1aec4cd33b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page