Skip to main content

Small utility to inspect/extract Zip files over HTTP

Project description

zipinspect

PyPI - Version

Zip files on the network — a niche use case, but still an interesting one to cover. This tool aims to extract individual files from Zip files over HTTP without downloading the whole file. It depends on the fact that HTTP range requests are available (i.e. downloading the file can be resumed), otherwise it refuses to operate as it would defeat the purpose of this tool.

Installation

$ pip install zipinspect

uv

$ uv tool install zipinspect

Demo

# If uvx (astral.sh/uv) is avaiable, an installation is not required; just use
#
# 	uvx zipinspect 'https://example.com/ArthurRimbaud-OnlyFans.zip'
$ zipinspect 'https://example.com/ArthurRimbaud-OnlyFans.zip'
> list
  #  entry                    size    modified date
---  -----------------------  ------  -------------------
  0  ArthurRimbaudOF_001.jpg  2.2M    2024-11-07T18:41:46
  1  ArthurRimbaudOF_002.jpg  2.4M    2024-11-07T18:41:48
  2  ArthurRimbaudOF_003.jpg  2.4M    2024-11-07T18:41:50
  3  ArthurRimbaudOF_004.jpg  2.5M    2024-11-07T18:41:50
  4  ArthurRimbaudOF_005.jpg  2.3M    2024-11-07T18:41:52
  5  ArthurRimbaudOF_006.jpg  2.4M    2024-11-07T18:41:52
  6  ArthurRimbaudOF_007.jpg  2.2M    2024-11-07T18:41:54
  7  ArthurRimbaudOF_008.jpg  2.4M    2024-11-07T18:41:56
  8  ArthurRimbaudOF_009.jpg  2.4M    2024-11-07T18:41:56
  9  ArthurRimbaudOF_010.jpg  2.3M    2024-11-07T18:41:58
 10  ArthurRimbaudOF_011.jpg  2.5M    2024-11-07T18:41:58
 11  ArthurRimbaudOF_012.jpg  1.5M    2024-11-07T18:42:00
 12  ArthurRimbaudOF_013.jpg  2.4M    2024-11-07T18:42:00
 13  ArthurRimbaudOF_014.jpg  2.6M    2024-11-07T18:42:02
 14  ArthurRimbaudOF_015.jpg  2.8M    2024-11-07T18:42:02
 15  ArthurRimbaudOF_016.jpg  2.8M    2024-11-07T18:42:04
 16  ArthurRimbaudOF_017.jpg  2.3M    2024-11-07T18:42:04
 17  ArthurRimbaudOF_018.jpg  2.9M    2024-11-07T18:42:06
 18  ArthurRimbaudOF_019.jpg  3.1M    2024-11-07T18:42:08
 19  ArthurRimbaudOF_020.jpg  2.9M    2024-11-07T18:42:08
 20  ArthurRimbaudOF_021.jpg  3.1M    2024-11-07T18:42:10
 21  ArthurRimbaudOF_022.jpg  3.1M    2024-11-07T18:42:10
 22  ArthurRimbaudOF_023.jpg  3.1M    2024-11-07T18:42:12
 23  ArthurRimbaudOF_024.jpg  3.0M    2024-11-07T18:42:14
 24  ArthurRimbaudOF_025.jpg  2.9M    2024-11-07T18:42:14
(Page 1/14)
> extract 8

 |#######################################################################| 100%

> extract 8,9,16

 |#######################################################################| 100%

> extract 20,...,24
 
 |#######################################################################| 100%

>

First the entries in the archive are loaded, then the user is presented with a REPL, where the files could be browsed and extracted. Multiple entries could be downloaded currently using the range syntax, and the downloads are fast because of its asynchronous design.

Features & Limitations

  • Multiple parallel extractions.
  • HTTP/2 for better download performance.
  • Zip files over 4GiB (Zip64) supported.
  • DEFLATE, BZip2, LZMA and Zstd compression supported.
  • ZipCrypto or WinZip AES aren't supported.
  • Multi-part (spanned) files aren't supported.

Help

In the REPL, help command lists all the available commands and their corresponding arguments.

> help
This is the REPL, and the following commands are available.

list                            List entries in the current page
prev                            Go backward one page and show entries
next                            Go forward one page and show entries
extract <index> [dir]           Extract entry with index <index>
extract <start>,...,<end> [dir] Extract entries from <start> to <end>
extract <i0>,<i1>,...<in> [dir] Extract entries with specified indices

NOTE: The extract command accepts an optional path to the directory to extract into.
If not provided, it extracts into the current working directory
  1. If any of the arguments contain a space wrap it in a double-quote; if it contains a double quote, wrap in a double quote and backslash-escape it.
  2. If an index to a directory is provided to extract, it downloads the files and folders within it recursively.
  3. If a file or folder already exists in the filesystem, it doesn't ask for permission to overwrite it.

How it works

The Zip format compresses each files individually, unlike some other formats (e.g. Tarballs), and stores offsets to file entries along with all necessary metadata in its central directory, which is located at the end of the file. The existence of a central directory allows us to extract, not the whole archive, but just the file we're interested in. Though this might not translate to much for Zip files available locally, it however provides a great advantage when Zip files exist remotely (see this).

An Example

For instance, assume you have a large Zip archive full of pictures — weighing 42 GiB — stored on a remote server far away, and you require to fetch a few images worth 21 MiB. Now, would you rather download the entire archive and then extract the files you need, or would you rather download just the files you need? It depends on the speed of your connection, storage capacity, and patience. Not everyone has the luxury to afford these.

The procedure

Since, most people are unable to afford such lavish lifestyles, this tool comes in handy. When you run it with a URL,

  1. The list of entries in the Zip archive are loaded in first, reading the central directory record.
  2. The user can interactively browse the list of present entries using prev, next and list commands.
  3. File(s) could be extracted using the extract command with fearless concurrency and blazingly fastspeeds.

The result? You don't waste bandwidth more than the size of the files you asked for.

Remarks

The initial implementation consisted of zipfile, along with a seekable file object wrapper for the remote file. Though the prototype worked, its performance was abysmal. The major bottleneck of this naive approach was that, through the abstract interface sequential accesses couldn't be differentiated with random accesses.

Technically speaking, only sequential access is possible via HTTP, because it's a stateless protocol, but to support our needs, random accesses in the file are implemented using HTTP range requests. This incurs a performance penalty, and quite noticeable too even with a single transfer; for each GET request the server has to setup a handler (CGI, FastCGI, et al.), open the file, seek to a position, and various other tasks to be able to respond. Thus considering this overhead, we have to minimise the number of requests if the amount of data to be read is known in advance. Fortunately, we do, but zipfile API is oblivious to all these complexities about its interface, so it performs lots of unnecessary seeks that prevents optimisations to be made.

The solution?

Implement the Zip specification from scratch, preferably with an asynchronous API to allow concurrent extractions. That's what was done in this case — an HTTP-aware Zip extractor implementation. Much of the information concerning the format was derived from the Wikipedia page and PKWare's original APPNOTE.txt (Zip specification). Though it's not entirely on par the specification or more mature implementations, but it hopes to work in the majority of the cases. If you need a feature, open a Github Issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zipinspect-0.1.2.tar.gz (24.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zipinspect-0.1.2-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file zipinspect-0.1.2.tar.gz.

File metadata

  • Download URL: zipinspect-0.1.2.tar.gz
  • Upload date:
  • Size: 24.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for zipinspect-0.1.2.tar.gz
Algorithm Hash digest
SHA256 10d88d74afc52a0ea42c712e7cc2ae41b7c9d82d685d19f9d98f4baade1ae61c
MD5 17120b7edd9e9b0e2f7813f4ac64a8f5
BLAKE2b-256 55f465e7e6624a127af696f9cb6f64a866296c4c82b2f5b0bad50ad9f4072979

See more details on using hashes here.

File details

Details for the file zipinspect-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: zipinspect-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 24.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for zipinspect-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 22f0c1fd6abd0f424c665482e2ebcc8441cdb26e687079deb56fe28f691c8b15
MD5 cc291520c13829794973c33d0f7f0f7a
BLAKE2b-256 5cbd7bb20d2486a8ac4c13abbff286242a1585e4dde7bb1b3170c9739148ab04

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page