Simple & fast multi-threaded S3 download tool.
Project description
S3Fetch
Simple & fast multi-threaded S3 download tool.
Source: https://github.com/rxvt/s3fetch
Demo
Features
- Fast.
- Simple to use.
- Multi-threaded, allowing you to download multiple objects concurrently.
- Quickly download a subset of objects under a prefix without listing all objects first.
- Object listing occurs in a separate thread and downloads start as soon as the first object key is returned while the object listing completes in the background.
- Filter list of objects using regular expressions.
- Uses standard Boto3 AWS SDK and standard AWS credential locations.
- List only mode if you just want to see what would be downloaded.
- Implemented as a simple API you can use in your own projects.
Why use S3Fetch?
Tools such as the AWS CLI and s4cmd are great and offer a lot of features, but S3Fetch out performs them when downloading a subset of objects from a large S3 bucket.
S3Fetch begins downloading objects immediately while listing is still in progress, so you never wait for a full bucket listing before the first byte lands on disk. This makes a dramatic difference when your prefix matches a small subset of a bucket containing millions of objects.
Installation
Requirements
- Python >= 3.10
- AWS credentials in one of the standard locations
S3Fetch is available on PyPI and can be installed via one of the following methods.
uv (recommended)
Ensure you have uv installed, then:
uv tool install s3fetch
pip
pip3 install s3fetch
Development Installation
For development work on S3Fetch:
-
Clone the repository:
git clone https://github.com/rxvt/s3fetch.git cd s3fetch
-
uv tool install hatch --with hatch-pip-compile
-
Set up the development environment using Hatch:
hatch env create
-
Run S3Fetch from source:
hatch run s3fetch --help
-
(Optional) Populate a test S3 bucket with data for development:
# First create your own test bucket (use a unique name!) aws s3 mb s3://your-unique-s3fetch-test-bucket-name --region us-east-1 # Then populate it with test data hatch run python scripts/populate_test_bucket.py --bucket your-unique-s3fetch-test-bucket-name --dry-run # See what would be created hatch run python scripts/populate_test_bucket.py --bucket your-unique-s3fetch-test-bucket-name # Actually populate
Usage
Usage: s3fetch [OPTIONS] S3_URI
Concurrently download objects from S3 buckets.
Examples:
s3fetch s3://my-bucket/
s3fetch s3://my-bucket/photos/ --regex ".*\.jpg$"
s3fetch s3://my-bucket/data/ --dry-run --threads 10
Options:
--version Show the version and exit.
--region TEXT AWS region for the S3 bucket (e.g., us-
east-1, eu-west-1). Defaults to 'us-east-1'.
-d, --debug Enable verbose debug output.
--download-dir PATH Local directory to save downloaded files.
Must already exist. Defaults to current
directory.
-r, --regex TEXT Filter objects using regular expressions
(e.g., '.*\.jpg$' for JPEG files).
-t, --threads INTEGER Number of concurrent download threads
(minimum 1, warns above 1000). Defaults to
CPU core count.
--dry-run, --list-only Show what would be downloaded without
actually downloading files.
--delimiter TEXT Object key delimiter for path structure.
Defaults to '/'.
-q, --quiet Suppress all stdout; errors still go to
stderr. Mutually exclusive with --progress.
--progress [simple|detailed|live-update|fancy]
Progress display mode. 'simple' (default)
prints each object key as it downloads.
'detailed' adds a summary at the end.
'live-update' shows a real-time status line
and summary (no per-object output).
'fancy' shows a Rich progress bar and summary
(requires: pip install s3fetch[fancy]).
--help Show this message and exit.
Examples
Full example
Download using 100 threads into ~/Downloads/tmp, only downloading objects that end in .dmg.
$ s3fetch s3://my-test-bucket --download-dir ~/Downloads/tmp/ --threads 100 --regex '\.dmg$'
test-1.dmg...done
test-2.dmg...done
test-3.dmg...done
test-4.dmg...done
test-5.dmg...done
Download all objects from a bucket
s3fetch s3://my-test-bucket/
Download objects with a specific prefix
Download all objects that start with birthday-photos/2020-01-01.
s3fetch s3://my-test-bucket/birthday-photos/2020-01-01
Download objects to a specific directory
Download objects to the ~/Downloads directory.
s3fetch s3://my-test-bucket/ --download-dir ~/Downloads
Download multiple objects concurrently
Download 100 objects concurrently.
s3fetch s3://my-test-bucket/ --threads 100
Filter objects using regular expressions
Download objects ending in .dmg.
s3fetch s3://my-test-bucket/ --regex '\.dmg$'
Library Usage
S3Fetch can be used as a library in your Python projects.
Basic Library Usage
from s3fetch import download
success_count, failures = download("s3://my-bucket/data/2023/")
print(f"Downloaded {success_count} objects successfully")
if failures:
print(f"{len(failures)} objects failed to download")
Common Options
from s3fetch import download
success_count, failures = download(
"s3://my-bucket/data/",
download_dir="./downloads", # local destination (default: cwd)
regex=r"\.csv$", # only download .csv files
threads=20, # concurrent downloads (default: CPU count)
dry_run=False, # set True to list without downloading
)
Configuring Logging
When using S3Fetch as a library, you can configure its logging behavior:
import logging
# Option 1: Reduce S3Fetch output
logging.getLogger("s3fetch").setLevel(logging.WARNING)
# Option 2: Disable S3Fetch logging completely
logging.getLogger("s3fetch").disabled = True
Progress Tracking
from s3fetch import download
from s3fetch.utils import ProgressTracker
tracker = ProgressTracker()
success_count, failures = download(
"s3://my-bucket/data/",
progress_tracker=tracker,
)
stats = tracker.get_stats()
print(f"Found: {stats['objects_found']} objects")
print(f"Downloaded: {stats['objects_downloaded']} objects")
print(f"Total size: {stats['bytes_downloaded'] / (1024 * 1024):.1f} MB")
print(f"Speed: {stats['download_speed_mbps']:.2f} MB/s")
The ProgressTracker is thread-safe and can be polled from a separate thread
for real-time updates while download() is running.
Advanced Usage — Custom boto3 Client
Pass a pre-built boto3 client to use a custom session, role, or region:
import boto3
from s3fetch import download
session = boto3.Session(profile_name="production")
client = session.client("s3", region_name="us-west-2")
success_count, failures = download(
"s3://my-bucket/data/",
client=client,
)
Download Callbacks
Use the on_complete parameter to receive a callback for each successfully
downloaded object:
from s3fetch import download
def on_object_complete(key: str) -> None:
print(f"Finished: {key}")
success_count, failures = download(
"s3://my-bucket/data/",
on_complete=on_object_complete,
)
DownloadResult fields (available when using create_completed_objects_thread
for lower-level access):
| Field | Type | Description |
|---|---|---|
key |
str |
S3 object key |
dest_filename |
Path |
Absolute local destination path |
success |
bool |
True on success, False on failure |
file_size |
int |
Bytes written (0 on failure or dry-run) |
error |
Exception | None |
Exception that caused the failure, or None |
Custom Progress Tracker
Implement ProgressProtocol to receive aggregate counts during listing and
downloading:
from s3fetch import download, ProgressProtocol
class MyTracker:
"""Minimal tracker that satisfies ProgressProtocol."""
def increment_found(self) -> None:
print(".", end="", flush=True) # one dot per object listed
def increment_downloaded(self, bytes_count: int) -> None:
print(f" +{bytes_count // 1024}KB", end="", flush=True)
success_count, failures = download(
"s3://my-bucket/data/",
progress_tracker=MyTracker(),
)
Troubleshooting
Existing files are silently overwritten
S3Fetch does not check whether a file already exists before downloading. If you run S3Fetch twice against the same download directory, existing files will be silently overwritten with the latest version from S3.
If you want to avoid overwriting files, use a fresh download directory or move previously downloaded files before re-running.
MacOS hangs when downloading using high number of threads
From my testing this is caused by Spotlight on MacOS trying to index a large number of files at once.
You can exclude the directory you're using to store your downloads via the Spotlight system preference control panel.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file s3fetch-2.0.0.tar.gz.
File metadata
- Download URL: s3fetch-2.0.0.tar.gz
- Upload date:
- Size: 548.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f306899f1a3ff0361a6a466ce152a19f8cfad668be78e416bc923b37a7116fc
|
|
| MD5 |
7c89227ff838078457ddc3b8527c0b8f
|
|
| BLAKE2b-256 |
a8185425652e8d0646f916ca5893d446a9a361affe6a54b6d85d117e2140f2f3
|
Provenance
The following attestation bundles were made for s3fetch-2.0.0.tar.gz:
Publisher:
build-and-publish.yml on rxvt/s3fetch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
s3fetch-2.0.0.tar.gz -
Subject digest:
1f306899f1a3ff0361a6a466ce152a19f8cfad668be78e416bc923b37a7116fc - Sigstore transparency entry: 1004607679
- Sigstore integration time:
-
Permalink:
rxvt/s3fetch@880b268a34e4e01a24f174e47dc2540f1985f2b2 -
Branch / Tag:
refs/tags/v2.0.0 - Owner: https://github.com/rxvt
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-and-publish.yml@880b268a34e4e01a24f174e47dc2540f1985f2b2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file s3fetch-2.0.0-py3-none-any.whl.
File metadata
- Download URL: s3fetch-2.0.0-py3-none-any.whl
- Upload date:
- Size: 28.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b4c310fed63661ccf1da63b5b6be69dbf9450e50fd42d03280d48722a8cc922
|
|
| MD5 |
29f0b7422d490fcd9da78991db9c1dd6
|
|
| BLAKE2b-256 |
1c774db88c6105515e85021a325750b88fdbe042327abef65c8c552dfd983b24
|
Provenance
The following attestation bundles were made for s3fetch-2.0.0-py3-none-any.whl:
Publisher:
build-and-publish.yml on rxvt/s3fetch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
s3fetch-2.0.0-py3-none-any.whl -
Subject digest:
2b4c310fed63661ccf1da63b5b6be69dbf9450e50fd42d03280d48722a8cc922 - Sigstore transparency entry: 1004607683
- Sigstore integration time:
-
Permalink:
rxvt/s3fetch@880b268a34e4e01a24f174e47dc2540f1985f2b2 -
Branch / Tag:
refs/tags/v2.0.0 - Owner: https://github.com/rxvt
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-and-publish.yml@880b268a34e4e01a24f174e47dc2540f1985f2b2 -
Trigger Event:
release
-
Statement type: