Fast, pure Python S3 downloader

Project description

Hook-based retrieval library

Current State

Been moving code from exploration to usable with a proper interface where we can run benchmarks.
The code is in a pretty solid place, but there is still a lot that could be done.
The first thing to do is to get EC2-based benchmarking code working. Be able to run a benchmark with the new codebase.
- Requirements:
  - Calculate effective throughput
  - Track CPU/Memory/Network usage and store it in an easily parseable format
Did some benchmarking. Retriever uses way more CPU power. It also is much faster when bumping up the concurrency - on an m5n.2xlarge, getting about 11 gbit/s download speed, compared with 2 for download_file.
Saw much faster download speed when moving into the same region as the s3 bucket (curl -sI https://noaa-goes16.s3.amazonaws.com | grep bucket-region). About double the speed.
Probably more performance available with VPC endpoints - https://aws.amazon.com/premiumsupport/knowledge-center/s3-transfer-data-bucket-instance/
The background profiler is quite good, but need to port it into the actor system, probably add some additional profiled, and decide on a serialization schema so it can be loaded. Eventually will want to graph memory usage, cpu usage, and network throughput over time for various download approachs.
Want to do performance benchmarks across different instance types and configurations. Need to sweep various config values to tune params. Want to be able to write doc showing that performance is much better when using Retriever if you're willing to pay the mem/cpu cost.
ec2-cluster would be useful. But that project is a little rusty so it's a good time to improve it a bit - better UX, automatically determine more params like VPC id, subnet, keypair location, etc.
Would be nice to be able to combine actors and ec2-cluster to parallelize benchmarking.
Haven't defined the hooks yet.
Currently actors get references to other actors at creation time. Better to use a global registry (for extensibility), but need to manually overwrite the actor URNs since the current approach is UUIDs and not very usable.
The ParallelChunkDownloader actor implementation could be cleaner I think.

Stages of downloading a file from s3

1. init
    - Process pools, s3 clients, etc
2. file list collection
1. file start downloading
    - Check if cache is available, etc
1. chunk task generation
1. fan_out
    - Parallelize downloading of chunks across processes/threads
1. fan_in
    - Send chunks back to master thread and reorder
1. post_download
1. done

Retriever Architecture

The architecture is made up of multiple pipeline stages. Each stage should inputs in through a queue and send data out through a queue.

Retriever is the interface to the overall pipeline
FileListGenerator takes in a DownloadRequest, finds the files that match, and outputs FileDownloadRequests. [parallel - for HEAD requests]
- The FileDownloadRequests each contain the location on s3, the size of the file (and plugin specific information?)
FileChunker splits up FileDownloadRequests into ChunkGetTasks, batching multiple FileDownloadRequests [serial, although it needs info from chunk sequencer about progress somehow]
- Should only parallelize a few FileDownloadRequests at the same time. Should have a limit on WIP ChunkGetTasks so that batching is based on size of files.
ParallelChunkDownloader takes in ChunkGetTasks and distributes it to workers. The downloaded chunks are outputted (not in order) [parallel]
ChunkSequencer takes in Chunks and outputs in-order chunks [serial]
On top of the in-order chunks, we will have different consumption methods depending on the use case.
- Load into memory
- Save into file
- Load into memory and save into file
- Iterator of chunks [easiest initial option]
- Iterator of files?

The final user-facing abstraction should be an iterator.

Pipeline Stages

Each stage should take in an queue and send data out through a queue. What is the execution model for Pipeline stages - actor-like?

Stages will usually want to have background thread/processes to process data as quickly as it comes in.

How does fan-out/fan-in work? How do we expose a unified callback API when some stages are serial and some are parallelized. Do we change the Pipeline abstraction to unify it (each fan-out is composed of many stages running with the same input and output pipes? or every pipeline stage has fan-in and fan-out?). For now, we will make callbacks custom instead of being automatic features of pipeline stages.

Hooks

TODO: Define the relevant user-facing hooks. Will also put developer-focused hooks all over the place

Probably two types of hooks - Class hooks that are stateful and functional hooks that are pure. Pipeline stages that are parallel can only use functional hooks (although maybe give the functional hook access to queues?).

Communication channels

Reference

Optimizing s3 performance with request parallelization - https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html#optimizing-performance-parallelization

Several existing projects referenced in this thread - https://news.ycombinator.com/item?id=26764067. Good stuff and some impressive projects.

Project details

Release history Release notifications | RSS feed

0.0.6.dev1 pre-release

Mar 19, 2022

This version

0.0.5

Aug 21, 2021

0.0.4a35 pre-release

Aug 21, 2021

0.0.4a34 pre-release

Aug 21, 2021

0.0.4a33 pre-release

Aug 21, 2021

0.0.4a32 pre-release

Aug 21, 2021

0.0.4a31 pre-release

Aug 21, 2021

0.0.4a30 pre-release

Aug 21, 2021

0.0.4a28 pre-release

Aug 20, 2021

0.0.4a27 pre-release

Aug 20, 2021

0.0.4a26 pre-release

Aug 20, 2021

0.0.4a25 pre-release

Aug 20, 2021

0.0.4a24 pre-release

Aug 20, 2021

0.0.4a23 pre-release

Aug 20, 2021

0.0.4a22 pre-release

Aug 20, 2021

0.0.4a21 pre-release

Aug 20, 2021

0.0.4a20 pre-release

Aug 20, 2021

0.0.4a19 pre-release

Aug 20, 2021

0.0.4a18 pre-release

Aug 20, 2021

0.0.4a17 pre-release

Aug 20, 2021

0.0.4a16 pre-release

Aug 20, 2021

0.0.4a15 pre-release

Aug 20, 2021

0.0.4a14 pre-release

Aug 20, 2021

0.0.4a13 pre-release

Aug 20, 2021

0.0.4a12 pre-release

Aug 20, 2021

0.0.4a11 pre-release

Aug 20, 2021

0.0.4a10 pre-release

Aug 20, 2021

0.0.4a9 pre-release

Aug 20, 2021

0.0.4a8 pre-release

Aug 20, 2021

0.0.4a7 pre-release

Aug 20, 2021

0.0.4a6 pre-release

Aug 20, 2021

0.0.4a5 pre-release

Aug 20, 2021

0.0.4a4 pre-release

Aug 19, 2021

0.0.4a3 pre-release

Aug 14, 2021

0.0.4a2 pre-release

Aug 14, 2021

0.0.4a1 pre-release

Aug 14, 2021

0.0.3

Aug 14, 2021

0.0.2

Aug 14, 2021

0.0.2a1 pre-release

Aug 14, 2021

0.0.1

Aug 14, 2021

0.0.1a3 pre-release

Aug 1, 2021

0.0.1a2 pre-release

Jul 31, 2021

0.0.1a1 pre-release

Jul 31, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

retriever_research-0.0.5.tar.gz (24.5 kB view details)

Uploaded Aug 21, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

retriever_research-0.0.5-py3-none-any.whl (31.4 kB view details)

Uploaded Aug 21, 2021 Python 3

File details

Details for the file retriever_research-0.0.5.tar.gz.

File metadata

Download URL: retriever_research-0.0.5.tar.gz
Upload date: Aug 21, 2021
Size: 24.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.11

File hashes

Hashes for retriever_research-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`0b8044c927c948ec9acdf33d306b62e55d9ce5371d132e66708f2cfebf60ad08`
MD5	`b07cc86c8398b90d7b901bca646f5ec4`
BLAKE2b-256	`d625d5cee60bf2baf615ee8177c645864aa855431a891d311a3234ab9400e5cd`

See more details on using hashes here.

File details

Details for the file retriever_research-0.0.5-py3-none-any.whl.

File metadata

Download URL: retriever_research-0.0.5-py3-none-any.whl
Upload date: Aug 21, 2021
Size: 31.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.11

File hashes

Hashes for retriever_research-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8dc4d09068a68bce5d18aca2f7e05e0cdecc90dd2bdbc249206a1161f945445e`
MD5	`384a230cbade429b8f78ea5fd208e2ee`
BLAKE2b-256	`c0f578f89673747016197d6af98e130ec08ed655dd0231c7283d3dab526cccb1`

See more details on using hashes here.

retriever-research 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Hook-based retrieval library

Current State

Stages of downloading a file from s3

Retriever Architecture

Pipeline Stages

Hooks

Communication channels

Reference

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes