Skip to main content

Flardl

Project description

Flardl - Adaptive Multi-Site Downloading of Lists

PyPI Python Version Docs Tests Codecov Repo Downloads Dlrate Codacy Snyk Health

Who would flardls bear?

logo

Features

Flardl downloads lists of files from one or more servers using a novel adaptive asynchronous approach. Download rates are typically more than 300X higherr than synchronous utilities such ascurl, while use of multiple servers provides better robustness in the face of varying network and server loads. Download rates depend on network bandwidth, latencies, list length, file sizes, and HTTP protocol used, but even a single server on another continent can usually saturate a gigabit connection after about 50 files using flardl.

Fishing Theory

Collections of files generated by natural or human activity such as natural-language writing, protein structure determination, or genome sequencing tend to have size distributions with long tails. For collections with long-tail distributions, one finds many more examples of big files than of small files at a given additive distance above or below the peak (model) value. Examples of analytical forms of long-tail distributions include Zipf, power-law, and log-norm distributions. A real-world example of a long-tail distribution is shown in the figure below, which plots the file-size histogram for 1000 randomly-sampled examples CIF structure files from the Protein Data Bank along with a kernel-density estimate and fits to log-normal and normal distributions.

sizedist

There are big effects on overall statistics from the big files in the long tail, effects that are frequently ignored in queueing literature and many queuing algorithms which treat collections as normal-ish. The biggest single issue, which can be seen in the difference between normal-distribution fits to a randomly-selected 5% and the full 1000 points in the figure above, is that mean values are neither stable nor characteristic of the distribution. Unlike on normal distributions--means of runs drawn from them grow larger with the size of the run. Because of the appreciable likelihood of drawing a really large file to be downloaded, the total download time $t_{\rm tot}$ and therefore the mean per-file download rate $\overline{k_{\rm file}}$ both depend strongly on how many big-file outliers are included in your sample. If you are downloading multiple files simultaneously, the overall download time may also depend strongly on where in the list the large files happen to occur, because those at the end can cause an "overhang" of a single stream waiting for that file.

While the mean per-file download rate varies a lot between runs, the most-common per-file download rate $\tilde{k}_{\rm file}$ can be more consistent, at least on the timescale of days. If you are downloading a long list of files at the same time that someone else on your LAN is watching a video, then you may not achieve the same saturation bit rate $b{\rm sat}$ as when you're the only network user. The modal file size of a collection can be quite stable over time, so we have hope that if we formulate download times in terms of the modal file size and that day's estimated server latencies and achievable download bit rate, the situation might be more tractable still.

Even more than maximizing download rates, the highest priority must be to avoid black-listing by a server. Most public-facing servers have policies to recognize and defend against Denial-Of-Service (DOS) attacks. The response to a DOS event, at the very least, causes the server to dump your latest request, which is usually a minor nuisance as it can be retried later. Far worse is if the server responds by severely throttling further requests from your IP address for hours or sometime days. Worst of all, your IP address can get the "death penalty" and be put on a permanent blacklist that may require manual intervention for removal. You generally don't know thThe simplest possibility of le trigger levels for these policies. Blacklisting might not even be your personal fault, but a collective problem. I have seen a practical class of 20 students brought to a complete halt by a server's 24-hour black-listing of the institution's public IP address.

An analogy might help us here. Let's say you are a person who enjoys keeping track of statistics, and you decide to try fishing. At first, you have a single fishing rod and you go fishing at a series of local lakes where your catch consists of small bony fishes called "crappies". Your records reval that while the rate of catching fishes can vary from day to day--fish might be hungry or not--the average size of your catch is pretty stable. Bigger ponds tend to have bigger fish in them, and it might take slightly longer to reel in a bigger crappie than a small one, but big and small averages out to that pond.

Then one day you decide you love fishing so much, you drive to the coast and charter a fishing boat. On that boat, you can set out as many lines as you want (up to some limit) and fish in parallel. At first, you seem to be catching the ocean-going equivalent of crappies, small bony fishes. But then you hook a small shark, which not only takes a lot of your time and attention to reel in, but which totally skews your estimate of average weight of your catch. You know that if you can catch a small shark, then maybe if you fish for long enough you might catch a big shark, or even a small whale. But you and your crew can only effecively reel in so many hooked lines at once. Putting out more lines than that effective limit of hooked- plus waiting-to-be-hooked lines only results in fishes waiting on the line, when they may break the line or get partly eaten before you can reel them in.

Here I propose and implement a method called adaptilastic queuing that gives robust performance in real situations while being simple enough to be easily understood and coded. The basis of edaptilastic queueing is keeping the total request-queue depth just high enough to achieve saturation. The method launches a large number of requests at the most-likely per-file rate at saturation, up to some maximum permissible per-server queue depth $D_{i}_{\rm max}$ (either by guess or by previous knowledge of individual servers) during the period before any transfers have completed. As transfers are completed, the method estimates the total-over-all-servers depth at which saturation was achieved, and updates its estimate of the achievable line bit rate and the most-likely per-file return rate on a per-server basis as the bases for managing future requests. Servers that return modal-length files (crappies) more quickly thus are given a better chance at nabbing an open queue slot without penalizing a server that happened to draw a big download (whale).

Requirements

Flardl is tested under python 3.11, on Linux, MacOS, and Windows and under 3.9 and 3.10 on Linux. Under the hood, flardl relies on httpx and is supported on whatever platforms that library works under, for both HTTP/1.1 and HTTP/2. HTTP/3 support could easily be added via aioquic once enough servers are running HTTP/3 to make that worthwhile.

Installation

You can install Flardl via pip from PyPI:

$ pip install flardl

Usage

Flardl has no CLI and does no I/O other than downloading and writing files. See test examples for usage.

Contributing

Contributions are very welcome. To learn more, see the Contributor Guide.

License

Distributed under the terms of the BSD 3-clause_license, Flardl is free and open source software.

Issues

If you encounter any problems, please file an issue along with a detailed description.

Credits

Flardl was written by Joel Berendzen.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flardl-0.0.8.tar.gz (21.4 kB view hashes)

Uploaded Source

Built Distribution

flardl-0.0.8-py3-none-any.whl (19.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page