Skip to main content

Concat files in s3

Project description

Python S3 Concat

PyPI PyPI

S3 Concat is used to concatenate many small files in an s3 bucket into fewer larger files.

Install

pip install s3-concat

Usage

Command Line

$ s3-concat -h

Import

from s3_concat import S3Concat

bucket = "YOUR_BUCKET_NAME"
path_to_concat = "PATH_TO_FILES_TO_CONCAT"
concatenated_file = "FILE_TO_SAVE_TO.json"
# Setting this to a size will always add a part number at the end of the file name
min_file_size = "50MB"  # ex: FILE_TO_SAVE_TO-1.json, FILE_TO_SAVE_TO-2.json, ...
# Setting this to None will concat all files into a single file
# min_file_size = None  ex: FILE_TO_SAVE_TO.json

# Init the job
job = S3Concat(bucket, concatenated_file, min_file_size,
               content_type="application/json",
              #  source_bucket="SOURCE_BUCKET_NAME",  # For copying files from another bucket
              #  session=boto3.session.Session(),  # For custom aws session
              #  s3_client_kwargs={}  # Use to pass arguments allowed by the s3 client: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html
              # delimiter="\n",  # Will insert this delimiter between each file when concatenating. Warning, this will need to download all files no matter the size to add this delimiter
               )
# Add files, can call multiple times to add files from other directories
job.add_files(path_to_concat)
# Add a single file at a time
job.add_file("some/file_key.json")
# Only use small_parts_threads if you need to. See Advanced Usage section below.
job.concat(small_parts_threads=4, main_threads=2)

Advanced Usage

Depending on your use case, you may want to use more threads then just 1.

  • main_threads is the number of threads to use when uploading files to s3. This will help when there are a lot of files that are already over the min_file_size that is set

  • small_parts_threads is only used when the files you are trying to concat are less then 5MB. These are spawned from inside of the main_threads. Due to the limitations of the s3 multipart_upload api (see Limitations below) any files less then 5MB need to be downloaded locally, concated together, then re uploaded. By setting this thread count it will download the parts in parallel for faster creation of the concatenation process.

The values set for these arguments depends on your use case and the system you are running this on.

Limitations

This uses the multipart upload of s3 and its limits are https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3_concat-0.3.0.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

s3_concat-0.3.0-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file s3_concat-0.3.0.tar.gz.

File metadata

  • Download URL: s3_concat-0.3.0.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.3

File hashes

Hashes for s3_concat-0.3.0.tar.gz
Algorithm Hash digest
SHA256 287af84d4020d8ac5241abfa6a20ae3e7e94c3721bb7659e3f6ea45562117ac1
MD5 dae6c31d0062a17d434b9b7720fd43b1
BLAKE2b-256 d6d27e361e7046a16cb9f6f16bcc311107d3fcb73220637b4d78ba8fb30c756e

See more details on using hashes here.

File details

Details for the file s3_concat-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: s3_concat-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.3

File hashes

Hashes for s3_concat-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 544f7bc7c1016a3aa98179ab39068640d4c003eca11060fc2629fdaf8c2ed054
MD5 b55c89f448e7a7d5dc895bca41a30b92
BLAKE2b-256 63ccc2fabe743b13a70e8a75d3088564899b4eb807577bf9148989cc00a51ee0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page