Skip to main content

Split tarballs into smaller pieces along file boundaries.

Project description

Tarsplit

A utility to split tarballs into smaller pieces along file boundaries.

This is useful for gigantic tarballs that need to need to be split up so that they can fit on USB sticks, more reasonably sized Docker layers, or whatever.

Installation

Manually

python3 -m pip install git+https://github.com/dmuth/tarsplit.git

Usage

tarsplit [ --dry-run ] tarball num_files

Example run:

FAQ

How does it work?

This script is written in Python, and uses the tarfile module to read and write tarfiles. This has the advantage of not having to extract the entire tarball, unlike the previous version of this app which was written in Bash Shell Script.

Why?

While working on Splunk Lab, I kept running into an issue where a particular layer in the Docker image was a Gigabyte in size. This was a challenge because there was a number of wallclock seconds wasted when processing the large layer after a push or pull. If only there was a way to split that layer up into multiple smaller layers, which Docker would then transfer in parallel...

While investigating, the culprit turned out to be a very large tarball. I wanted a way to split that tarball into multiple smaller tarballs, each of which contained a portion of the filesystem. Then, I could build multiple Docker containers, each with a portion of the original tarball's files, with each container inheriting the previous container. This would leverage one of the things Docker is good at: layered filesystems.

This is slow on large files. Ever hear of multithreading?

Yeah, I tried that after release 1.0. It turns out that even when using every trick I knew that a multithreaded approach consisting of one thread per chunk to be written was slower than just doing everything in a single thread. I observed this on a 10-core machine with an SSD, so I'm just gonna go ahead and point the finger at the GIL and remind myself that threading in Python is cursed.

What about asyncio?

I used asyncio successfully for another project and haven't ruled it out. I am however skeptical because of the very high level of disk usage. Async I/O would be more approiate for dozens/hundreds of writers hitting the disk occasionally, and this is not the case here.

Development

Support scripts

  • bin/create-test-tarball.sh - Create a test tarball with directories and files inside.
  • sha1-from-directory.sh - Get a recursive list of all files in a directory, sort it, SHA1 each file, then concatenate all SHA1s and SHA1 that!
  • sha1-from-tarball.sh - Extract a tarball, then do the same thing to the contents as sha1-from-directory.sh.

Tests

Tests can be run with tests.sh. A successful run looks something like this:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tarsplit-1.0.post2.tar.gz (4.7 kB view details)

Uploaded Source

File details

Details for the file tarsplit-1.0.post2.tar.gz.

File metadata

  • Download URL: tarsplit-1.0.post2.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.0

File hashes

Hashes for tarsplit-1.0.post2.tar.gz
Algorithm Hash digest
SHA256 e5c6169429c650491c2fa98cb8aedea82fc81f59fa27a3f650d7b499ea8ade59
MD5 abe5afc5102713227b3a555e83d6dc33
BLAKE2b-256 7a5b3c0fa9835fbe5a826f5e8253d98b91d51ef22eb21773cfb9d44551673241

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page