Concat files in s3
Project description
Python S3 Concat
S3 Concat is used to concatenate many small files in an s3 bucket into fewer larger files.
Install
pip install s3-concat
Usage
Command Line
$ s3-concat -h
Import
from s3_concat import S3Concat
bucket = "YOUR_BUCKET_NAME"
path_to_concat = "PATH_TO_FILES_TO_CONCAT"
concatenated_file = "FILE_TO_SAVE_TO.json"
# Setting this to a size will always add a part number at the end of the file name
min_file_size = "50MB" # ex: FILE_TO_SAVE_TO-1.json, FILE_TO_SAVE_TO-2.json, ...
# Setting this to None will concat all files into a single file
# min_file_size = None ex: FILE_TO_SAVE_TO.json
# Init the job
job = S3Concat(bucket, concatenated_file, min_file_size,
content_type="application/json",
# source_bucket="SOURCE_BUCKET_NAME", # For copying files from another bucket
# session=boto3.session.Session(), # For custom aws session
# s3_client_kwargs={} # Use to pass arguments allowed by the s3 client: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html
# delimiter="\n", # Will insert this delimiter between each file when concatenating. Warning, this will need to download all files no matter the size to add this delimiter
)
# Add files, can call multiple times to add files from other directories
job.add_files(path_to_concat)
# Add a single file at a time
job.add_file("some/file_key.json")
# Only use small_parts_threads if you need to. See Advanced Usage section below.
job.concat(small_parts_threads=4, main_threads=2)
Advanced Usage
Depending on your use case, you may want to use more threads then just 1.
-
main_threadsis the number of threads to use when uploading files to s3. This will help when there are a lot of files that are already over themin_file_sizethat is set -
small_parts_threadsis only used when the files you are trying to concat are less then 5MB. These are spawned from inside of themain_threads. Due to the limitations of the s3 multipart_upload api (see Limitations below) any files less then 5MB need to be downloaded locally, concated together, then re uploaded. By setting this thread count it will download the parts in parallel for faster creation of the concatenation process.
The values set for these arguments depends on your use case and the system you are running this on.
Limitations
This uses the multipart upload of s3 and its limits are https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file s3_concat-0.3.0.tar.gz.
File metadata
- Download URL: s3_concat-0.3.0.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
287af84d4020d8ac5241abfa6a20ae3e7e94c3721bb7659e3f6ea45562117ac1
|
|
| MD5 |
dae6c31d0062a17d434b9b7720fd43b1
|
|
| BLAKE2b-256 |
d6d27e361e7046a16cb9f6f16bcc311107d3fcb73220637b4d78ba8fb30c756e
|
File details
Details for the file s3_concat-0.3.0-py3-none-any.whl.
File metadata
- Download URL: s3_concat-0.3.0-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
544f7bc7c1016a3aa98179ab39068640d4c003eca11060fc2629fdaf8c2ed054
|
|
| MD5 |
b55c89f448e7a7d5dc895bca41a30b92
|
|
| BLAKE2b-256 |
63ccc2fabe743b13a70e8a75d3088564899b4eb807577bf9148989cc00a51ee0
|