bulkboto3

Python package for fast and parallel transferring a bulk of files to S3 based on boto3

These details have not been verified by PyPI

Project links

Homepage

Project description

Bulk Boto3 (bulkboto3)

Python package for fast and parallel transferring a bulk of files to S3 based on boto3!
See on PyPI · View Examples · Report Bug/Request Feature

Python Version License

Table of Contents

About bulkboto3
Getting Started
- Prerequisites
- Installation
Usage
Blog Posts
Contributing
Contributors
Contact
License

About bulkboto3

Boto3 is the official Python SDK for accessing and managing all AWS resources such as Amazon Simple Storage Service (S3). Generally, it's pretty ok to transfer a small number of files using Boto3. However, transferring a large number of small files impede performance. Although it only takes a few milliseconds per file to transfer, it can take up to hours to transfer hundreds of thousands, or millions, of files if you do it sequentially. Moreover, because Amazon S3 does not have folders/directories, managing the hierarchy of directories and files manually can be a bit tedious especially if there are many files located in different folders.

The bulkboto3 package solves these issues. It speeds up transferring of many small files to Amazon AWS S3 by executing multiple download/upload operations in parallel by leveraging the Python multiprocessing module. Depending on the number of cores of your machine, Bulk Boto3 can make S3 transfers even 100X faster than sequential mode using traditional Boto3! Furthermore, Bulk Boto3 can keep the original folder structure of files and directories when transferring them. There are also some other features as follows.

Main Functionalities

Multi-thread uploading/downloading of a directory (keeping the directory structure) to/from S3 object storage
Deleting all objects of an S3 bucket
Checking the existence of an object on the S3 bucket
Listing all objects on an S3 bucket
Creating a new bucket on the S3

Getting Started

Prerequisites

Python 3.6+
pip
API credentials to access an S3

Note: You can deploy a free S3 server using MinIO on your local machine by following the steps explained in: Deploy Standalone MinIO using Docker Compose on Linux.

Installation

Use the package manager pip to install bulkboto3.

pip install bulkboto3

Usage

You can find the following scripts in examples.py and examples.ipynb Notebook.

Import and instantiate a `BulkBoto3` object with your credentials

from bulkboto3 import BulkBoto3
TARGET_BUCKET = "test-bucket"
NUM_TRANSFER_THREADS = 50
TRANSFER_VERBOSITY = True

bulkboto_agent = BulkBoto3(
    resource_type="s3",
    endpoint_url="<Your storage endpoint>",
    aws_access_key_id="<Your access key>",
    aws_secret_access_key="<Your secret key>",
    max_pool_connections=300,
    verbose=TRANSFER_VERBOSITY,
)

Create a new bucket

bulkboto_agent.create_new_bucket(bucket_name=TARGET_BUCKET)

Upload a whole directory with its structure to an S3 bucket in multi-thread mode

Suppose that there is a directory with the following structure on your local machine:

test_dir
├── first_subdir
│   ├── f1
│   ├── f2
│   └── f3
└── second_subdir
    └── f4

To upload the directory (with its subdirectories) to the bucket under a new directory name called my_storage_dir, use the following command:

bulkboto_agent.upload_dir_to_storage(
     bucket_name=TARGET_BUCKET,
     local_dir="test_dir",
     storage_dir="my_storage_dir",
     n_threads=NUM_TRANSFER_THREADS,
)
# output:
# 2022-03-26 18:12:40 — INFO — Start uploading from local 'test_dir' to 'my_storage_dir' on the object storage with 50 threads.
# 100%|██████████| 4/4 [00:00<00:00,  4.00s/it]
# 2022-03-26 18:12:41 — INFO — Successfully uploaded 4 files to bucket 'test-bucket' in 0.07 seconds.

Download a whole directory with its structure to a local directory in multi-thread mode

bulkboto_agent.download_dir_from_storage(
    bucket_name=TARGET_BUCKET,
    storage_dir="my_storage_dir",
    local_dir="new_test_dir",
    n_threads=NUM_TRANSFER_THREADS,
)
# output: 
# 2022-03-26 18:14:08 — INFO — Start downloading from 'my_storage_dir' on storage to local 'new_test_dir' with 50 threads.
# 100%|██████████| 4/4 [00:00<00:00,  4.00it/s]
# 2022-03-26 18:14:09 — INFO — Successfully downloaded 4 files from bucket: 'test-bucket' in 0.04 seconds.

The structure of the downloaded directory will be as follows on the local directory:

new_test_dir
└── my_storage_dir
    ├── first_subdir
    │   ├── f1
    │   ├── f2
    │   └── f3
    └── second_subdir
        └── f4

You can set local_dir='' (it is the default value) to avoid the creation of the new_test_dir directory.

Upload/Download arbitrary files to/from an S3 bucket

To transfer a list of arbitrary files to a bucket, you should instantiate StorageTransferPath class to determine the storage (s3) and local paths, and then use .upload() and .download() methods. Here is an example:

# upload arbitrary files from local to an S3 bucket
upload_paths = [
    StorageTransferPath(
        local_path="test_dir/first_subdir/f2",
        storage_path="f2",
    ),
    StorageTransferPath(
        local_path="test_dir/second_subdir/f4",
        storage_path="my_storage_dir/f4",
    ),
]
bulkboto_agent.upload(bucket_name=TARGET_BUCKET, upload_paths=upload_paths)
# output:
# 100%|██████████| 2/2 [00:00<00:00,  2.44it/s]
# 2022-04-05 13:40:10 — INFO — Successfully uploaded 2 files to bucket: 'test-bucket'.

# download arbitrary files from an S3 bucket to local
download_paths = [
    StorageTransferPath(
        storage_path="f2",
        local_path="f2",
    ),
    StorageTransferPath(
        storage_path="my_storage_dir/f4",
        local_path="f5",
    ),
]
bulkboto_agent.download(bucket_name=TARGET_BUCKET, download_paths=download_paths)
# output:
# 100%|██████████| 2/2 [00:00<00:00,  2.44it/s]
# 2022-04-05 13:34:10 — INFO — Successfully downloaded 2 files from bucket: 'test-bucket'.

Delete all objects on a bucket

bulkboto_agent.empty_bucket(TARGET_BUCKET)
# output: 
# 2022-03-26 22:23:23 — INFO — Successfully deleted objects on: 'test-bucket'.

Check if a file exists in a bucket

print(
    bulkboto_agent.check_object_exists(
        bucket_name=TARGET_BUCKET, object_path="my_storage_dir/first_subdir/test_file.txt"
    )
)
# output: False 

print(
    bulkboto_agent.check_object_exists(
        bucket_name=TARGET_BUCKET, object_path="my_storage_dir/first_subdir/f1"
    )
)
# output: True

Get the list of objects in a bucket (with prefix)

print(
    bulkboto_agent.list_objects(
        bucket_name=TARGET_BUCKET, storage_dir="my_storage_dir"
    )
)
# output: 
# ['my_storage_dir/first_subdir/f1', 'my_storage_dir/first_subdir/f2', 'my_storage_dir/first_subdir/f3', 'my_storage_dir/second_subdir/f4']

print(
    bulkboto_agent.list_objects(
        bucket_name=TARGET_BUCKET, storage_dir="my_storage_dir/second_subdir"
    )
)
# output: 
# ['my_storage_dir/second_subdir/f4']

Benchmark

Uploaded 88800 small files (totally about 7GB) with 100 threads in 505 seconds that was about 72X faster than the non-parallel mode.

Blog Posts

Contributing

Any contributions you make are greatly appreciated. If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". To contribute to bulkboto3, follow these steps:

Fork this repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Make your changes and commit them (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a pull request

Alternatively, see the GitHub documentation on creating a pull request.

Contributors

Thanks to the following people who have contributed to this project:

Amir Masoud Sefidian 📖

Contact

If you want to contact me you can reach me at a.m.sefidian@gmail.com.

License

Distributed under the MIT License. See LICENSE for more information.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.1.3

Jun 10, 2022

1.1.2

Apr 8, 2022

1.1.1

Apr 7, 2022

1.1.0

Apr 7, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bulkboto3-1.1.3.tar.gz (12.8 kB view details)

Uploaded Jun 10, 2022 Source

File details

Details for the file bulkboto3-1.1.3.tar.gz.

File metadata

Download URL: bulkboto3-1.1.3.tar.gz
Upload date: Jun 10, 2022
Size: 12.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.10.4

File hashes

Hashes for bulkboto3-1.1.3.tar.gz
Algorithm	Hash digest
SHA256	`4d33da2410b898ad2e0754e9221dbc702dafa7275a3a853026776bf776aefe53`
MD5	`c68dfc0328174271153db027108a0529`
BLAKE2b-256	`b04d5d4b4568ac64662c4fc14b4598f04eb6e4fe2279e9175a715e5a7d18d7f7`

See more details on using hashes here.

bulkboto3 1.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Bulk Boto3 (bulkboto3)

About bulkboto3

Main Functionalities

Getting Started

Prerequisites

Installation

Usage

Import and instantiate a BulkBoto3 object with your credentials

Create a new bucket

Upload a whole directory with its structure to an S3 bucket in multi-thread mode

Download a whole directory with its structure to a local directory in multi-thread mode

Upload/Download arbitrary files to/from an S3 bucket

Delete all objects on a bucket

Check if a file exists in a bucket

Get the list of objects in a bucket (with prefix)

Benchmark

Blog Posts

Contributing

Contributors

Contact

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Import and instantiate a `BulkBoto3` object with your credentials