Skip to main content

Concat files in s3

Project description

Python S3 Concat

PyPI PyPI

S3 Concat is used to concatenate many small files in an s3 bucket into fewer larger files.

Install

pip install s3-concat

Usage

Command Line

$ s3-concat -h

Import

from s3_concat import S3Concat

bucket = 'YOUR_BUCKET_NAME'
path_to_concat = 'PATH_TO_FILES_TO_CONCAT'
concatenated_file = 'FILE_TO_SAVE_TO.json'
# Setting this to a size will always add a part number at the end of the file name
min_file_size = '50MB'  # ex: FILE_TO_SAVE_TO-1.json, FILE_TO_SAVE_TO-2.json, ...
# Setting this to None will concat all files into a single file
# min_file_size = None  ex: FILE_TO_SAVE_TO.json

# Init the job
job = S3Concat(bucket, concatenated_file, min_file_size,
               content_type='application/json',
              #  session=boto3.session.Session(),  # For custom aws session
              # s3_client_kwargs={}  # Use to pass arguments allowed by the s3 client: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html
               )
# Add files, can call multiple times to add files from other directories
job.add_files(path_to_concat)
# Add a single file at a time
job.add_file('some/file_key.json')
# Only use small_parts_threads if you need to. See Advanced Usage section below.
job.concat(small_parts_threads=4, main_threads=2)

Advanced Usage

Depending on your use case, you may want to use more threads then just 1.

  • main_threads is the number of threads to use when uploading files to s3. This will help when there are a lot of files that are already over the min_file_size that is set

  • small_parts_threads is only used when the files you are trying to concat are less then 5MB. These are spawned from inside of the main_threads. Due to the limitations of the s3 multipart_upload api (see Limitations below) any files less then 5MB need to be downloaded locally, concated together, then re uploaded. By setting this thread count it will download the parts in parallel for faster creation of the concatenation process.

The values set for these arguments depends on your use case and the system you are running this on.

Limitations

This uses the multipart upload of s3 and its limits are https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3-concat-0.2.4.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

s3_concat-0.2.4-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file s3-concat-0.2.4.tar.gz.

File metadata

  • Download URL: s3-concat-0.2.4.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for s3-concat-0.2.4.tar.gz
Algorithm Hash digest
SHA256 597d3d679e8c29532a907b698de2fb918f13dfbc12f65b7687b6350f01665e0b
MD5 cdd0b6721aba3327be7edb91af9d6e66
BLAKE2b-256 eef6857d1ca1405bf9a78f658791d9b065ee83e2f778d99e409be7747579ea65

See more details on using hashes here.

File details

Details for the file s3_concat-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: s3_concat-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for s3_concat-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 179762a98ebd1a085901bb90c41cf030759e3aecc45e0dc56a4084ce11554ffa
MD5 a590a7ea3387ce6f8d9ade075b7dd5d4
BLAKE2b-256 f12770fa1dcbd92a293b3d3f73c33fd29128a5dea82c12d833fd02377c916947

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page