Skip to main content

Flardl

Project description

Flardl - Adaptive Multi-Site Downloading of Lists

PyPI Python Version Docs Tests Codecov Repo Downloads Dlrate Codacy Snyk Health

Who would flardls bear?

logo

Features

Flardl downloads lists of files from one or more servers using a novel adaptive asynchronous approach. Download rates are typically more than 300X higherr than synchronous utilities such ascurl, while use of multiple servers provides better robustness in the face of varying network and server loads. Download rates depend on network bandwidth, latencies, list length, file sizes, and HTTP protocol used, but even a single server on another continent can usually saturate a gigabit connection after about 50 files using flardl.

Fishing Theory

Collections of files generated by natural or human activity such as natural-language writing, protein structure determination, or genome sequencing tend to have size distributions with long tails. For collections with long-tail distributions, one finds many more examples of big files than of small files at a given additive distance above or below the peak (model) value. Examples of analytical forms of long-tail distributions include Zipf, power-law, and log-norm distributions. A real-world example of a long-tail distribution is shown in the figure below, which plots the file-size histogram for 1000 randomly-sampled examples CIF structure files from the Protein Data Bank along with a kernel-density estimate and fits to log-normal and normal distributions.

sizedist

The effects of the big files in the long tail are frequently ignored in queuing algorithms.

The nature of long-tail distributions is such that mean values are nearly worthless because--unlike on normal distributions--means of runs drawn from them grow larger with the size of the run. Because of the appreciable likelihood of drawing a really large file to be downloaded from a long-tail distribution, he total download time and therefore the mean downloading rate depends strongly on how many large-size outliers are included in your sample. Timings of algorithms that do If you are downloading multiple files simultaneously, the overall download time may also depend strongly on whether a large file happens to occur at the end of the list, causing an "overhang" of wwaiting for a single file. Theories and algorithms based on overall times or mean rates won't work very well on the long-tail distributions that often characterize real collections. T

Modal values are a good statistic for power-law distributions, unlike means. To put that another way, the average download time $\overline{t_{dl}}$ varies a lot between runs, but the most-common download time $\tilde{t}_{dl}$ can be pretty consistent. The mode of file lengths and the mode of download bit rate are both quantities that are easy to estimate for a collection and a collection and rarely change. If one happens to select the biggest files for downloading, or if one happens to try downloading a long collection at the same time that someone is watching a high-bit-rate video on the same shared connection, then it's easy to adjust a bit for just that time.

Here I propose a heuristic called adaptive-depth queuing that gives robust performance in real situations while being simple enough to be easily understood and coded.

Even more than maximizing download rates, the highest priority must be to avoid black-listing by a server. Most public-facing servers have policies to recognize and defend against Denial-Of-Service (DOS) attacks. The response to a DOS event, at the very least, causes the server to dump your latest request, which is usually a minor nuisance as it can be retried later. Far worse is if the server responds by severely throttling further requests from your IP address for hours or sometime days. Worst of all, your IP address can get the "death penalty" and be put on a permanent blacklist that may require manual intervention for removal. You generally don't know thThe simplest possibility of le trigger levels for these policies. Worse still, it might not even be you. I have seen a practical class of 20 students brought to a complete halt by a server's 24-hour black-listing of the institution's IP address.

Simply launching a large number of requests and letting the servers sort it out is a strategy that maximizes the chance of black-listing for two reasons. First, this strategy results in equal division of transfers without regard to varying transfer sizes or server latencies. Second,

Given that a single server can saturate a gigabit connection, given enough simultaneous downloads, a better strategy is to keep the total request-queue depth just high enough to achieve saturation. This goal can be achieved by launching a large number of requests, up to some maximum permissible queue depth $Q_{\rm max}$ (either by guess or by previous knowledge of individual servers), during the server latency period when no transfers have been completed. As transfers are completed, one can then calculate the saturation bandwidth $B$ and the total-over-all-servers depth at which saturation was achieved, $Q_{\rm sat}$

running the request For those who are lucky enough to be on a multi-gigabit connection, it's a good idea to limit the bandwidth to something you know the set of servers you are using won't complain about. It would be nice if one could query a server for an acceptable request queue depth which would guarantee no DOS response or other server throttling, but I have not seen such a mechanism implemented.

Requirements

Flardl is tested under python 3.11, on Linux, MacOS, and Windows and under 3.9 and 3.10 on Linux. Under the hood, flardl relies on httpx and is supported on whatever platforms that library works under, for both HTTP/1.1 and HTTP/2. HTTP/3 support could easily be added via aioquic once enough servers are running HTTP/3 to make that worthwhile.

Installation

You can install Flardl via pip from PyPI:

$ pip install flardl

Usage

Flardl has no CLI and does no I/O other than downloading and writing files. See test examples for usage.

Contributing

Contributions are very welcome. To learn more, see the Contributor Guide.

License

Distributed under the terms of the BSD 3-clause_license, Flardl is free and open source software.

Issues

If you encounter any problems, please file an issue along with a detailed description.

Credits

Flardl was written by Joel Berendzen.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flardl-0.0.7.tar.gz (20.3 kB view hashes)

Uploaded Source

Built Distribution

flardl-0.0.7-py3-none-any.whl (19.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page