Skip to main content

Fast, pure Python S3 downloader

Project description

Hook-based retrieval library

Current State

  • Been moving code from exploration to usable with a proper interface where we can run benchmarks.

  • The code is in a pretty solid place, but there is still a lot that could be done.

  • The first thing to do is to get EC2-based benchmarking code working. Be able to run a benchmark with the new codebase.

    • Requirements:
      • Calculate effective throughput
      • Track CPU/Memory/Network usage and store it in an easily parseable format
  • Did some benchmarking. Retriever uses way more CPU power. It also is much faster when bumping up the concurrency - on an m5n.2xlarge, getting about 11 gbit/s download speed, compared with 2 for download_file.

  • Saw much faster download speed when moving into the same region as the s3 bucket (curl -sI https://noaa-goes16.s3.amazonaws.com | grep bucket-region). About double the speed.

  • Probably more performance available with VPC endpoints - https://aws.amazon.com/premiumsupport/knowledge-center/s3-transfer-data-bucket-instance/

  • The background profiler is quite good, but need to port it into the actor system, probably add some additional profiled, and decide on a serialization schema so it can be loaded. Eventually will want to graph memory usage, cpu usage, and network throughput over time for various download approachs.

  • Want to do performance benchmarks across different instance types and configurations. Need to sweep various config values to tune params. Want to be able to write doc showing that performance is much better when using Retriever if you're willing to pay the mem/cpu cost.

  • ec2-cluster would be useful. But that project is a little rusty so it's a good time to improve it a bit - better UX, automatically determine more params like VPC id, subnet, keypair location, etc.

  • Would be nice to be able to combine actors and ec2-cluster to parallelize benchmarking.

  • Haven't defined the hooks yet.

  • Currently actors get references to other actors at creation time. Better to use a global registry (for extensibility), but need to manually overwrite the actor URNs since the current approach is UUIDs and not very usable.

  • The ParallelChunkDownloader actor implementation could be cleaner I think.

Stages of downloading a file from s3

1. init
    - Process pools, s3 clients, etc
2. file list collection
1. file start downloading
    - Check if cache is available, etc
1. chunk task generation
1. fan_out
    - Parallelize downloading of chunks across processes/threads
1. fan_in
    - Send chunks back to master thread and reorder
1. post_download
1. done

Retriever Architecture

The architecture is made up of multiple pipeline stages. Each stage should inputs in through a queue and send data out through a queue.

  1. Retriever is the interface to the overall pipeline
  2. FileListGenerator takes in a DownloadRequest, finds the files that match, and outputs FileDownloadRequests. [parallel - for HEAD requests]
    • The FileDownloadRequests each contain the location on s3, the size of the file (and plugin specific information?)
  3. FileChunker splits up FileDownloadRequests into ChunkGetTasks, batching multiple FileDownloadRequests [serial, although it needs info from chunk sequencer about progress somehow]
    • Should only parallelize a few FileDownloadRequests at the same time. Should have a limit on WIP ChunkGetTasks so that batching is based on size of files.
  4. ParallelChunkDownloader takes in ChunkGetTasks and distributes it to workers. The downloaded chunks are outputted (not in order) [parallel]
  5. ChunkSequencer takes in Chunks and outputs in-order chunks [serial]
  6. On top of the in-order chunks, we will have different consumption methods depending on the use case.
    • Load into memory
    • Save into file
    • Load into memory and save into file
    • Iterator of chunks [easiest initial option]
    • Iterator of files?

The final user-facing abstraction should be an iterator.

Pipeline Stages

Each stage should take in an queue and send data out through a queue. What is the execution model for Pipeline stages - actor-like?

Stages will usually want to have background thread/processes to process data as quickly as it comes in.

How does fan-out/fan-in work? How do we expose a unified callback API when some stages are serial and some are parallelized. Do we change the Pipeline abstraction to unify it (each fan-out is composed of many stages running with the same input and output pipes? or every pipeline stage has fan-in and fan-out?). For now, we will make callbacks custom instead of being automatic features of pipeline stages.

Hooks

TODO: Define the relevant user-facing hooks. Will also put developer-focused hooks all over the place

Probably two types of hooks - Class hooks that are stateful and functional hooks that are pure. Pipeline stages that are parallel can only use functional hooks (although maybe give the functional hook access to queues?).

Communication channels

Reference

Optimizing s3 performance with request parallelization - https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html#optimizing-performance-parallelization

Several existing projects referenced in this thread - https://news.ycombinator.com/item?id=26764067. Good stuff and some impressive projects.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

retriever_research-0.0.5.tar.gz (24.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

retriever_research-0.0.5-py3-none-any.whl (31.4 kB view details)

Uploaded Python 3

File details

Details for the file retriever_research-0.0.5.tar.gz.

File metadata

  • Download URL: retriever_research-0.0.5.tar.gz
  • Upload date:
  • Size: 24.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.11

File hashes

Hashes for retriever_research-0.0.5.tar.gz
Algorithm Hash digest
SHA256 0b8044c927c948ec9acdf33d306b62e55d9ce5371d132e66708f2cfebf60ad08
MD5 b07cc86c8398b90d7b901bca646f5ec4
BLAKE2b-256 d625d5cee60bf2baf615ee8177c645864aa855431a891d311a3234ab9400e5cd

See more details on using hashes here.

File details

Details for the file retriever_research-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: retriever_research-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 31.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.11

File hashes

Hashes for retriever_research-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 8dc4d09068a68bce5d18aca2f7e05e0cdecc90dd2bdbc249206a1161f945445e
MD5 384a230cbade429b8f78ea5fd208e2ee
BLAKE2b-256 c0f578f89673747016197d6af98e130ec08ed655dd0231c7283d3dab526cccb1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page