Package short description.
Project description
Welcome to s3splitmerge Documentation
Features
Split:
split big data file on (>=500MB) in common data format CSV, TSV, JSON into
Install
pip install awswrangler==2.10.0 --no-deps
s3splitmerge is released on PyPI, so all you need is:
$ pip install s3splitmerge
To upgrade to latest version:
$ pip install --upgrade s3splitmerge
Merge Multiple AWS S3 Json File into One Big
1. Input Data
Files:
s3://my-bucket/input/date=2000-01-01/a.json # 6MB s3://my-bucket/input/date=2000-01-01/b.json # 600MB s3://my-bucket/input/date=2000-01-01/c.json # 120MB s3://my-bucket/input/date=2000-01-02/... ...
Content:
{"id": 1, "value": "a", ...} {"id": 2, "value": "b", ...} {"id": 3, "value": "c", ...}
Normalize file size to approximately 6MB. If smaller than 6MB, keep it as it is:
s3://my-bucket/input-normalized/date=2000-01-01/a-1.json # 6MB s3://my-bucket/input-normalized/date=2000-01-01/b-1.json # 6MB s3://my-bucket/input-normalized/date=2000-01-01/b-2.json # 6MB ... s3://my-bucket/input-normalized/date=2000-01-01/b-100.json # 6MB s3://my-bucket/input-normalized/date=2000-01-01/c-1.json # 6MB s3://my-bucket/input-normalized/date=2000-01-01/c-2.json # 6MB ... s3://my-bucket/input-normalized/date=2000-01-01/c-20.json # 6MB
Performance per file ETL using AWS Lambda:
s3://my-bucket/output-normalized/date=2000-01-01/a-1.parquet # 6MB s3://my-bucket/output-normalized/date=2000-01-01/b-1.parquet # 6MB s3://my-bucket/output-normalized/date=2000-01-01/b-2.parquet # 6MB ... s3://my-bucket/output-normalized/date=2000-01-01/b-100.parquet # 6MB s3://my-bucket/output-normalized/date=2000-01-01/c-1.parquet # 6MB s3://my-bucket/output-normalized/date=2000-01-01/c-2.parquet # 6MB ... s3://my-bucket/output-normalized/date=2000-01-01/c-20.parquet # 6MB
Merge file into Bigger one for better Athena Query performance:
s3://my-bucket/output-normalized/date=2000-01-01/part-1.parquet # 500MB s3://my-bucket/output-normalized/date=2000-01-01/part-2.parquet # 500MB ...
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
s3splitmerge-0.0.1.tar.gz
(24.2 kB
view details)
Built Distribution
File details
Details for the file s3splitmerge-0.0.1.tar.gz
.
File metadata
- Download URL: s3splitmerge-0.0.1.tar.gz
- Upload date:
- Size: 24.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd19359ff8bd31ad64b626b131b71c9e8e217c2412efa477f5019e68ff9b6ab6 |
|
MD5 | db1ced6ec5e8c5201d6f4e161b7874d0 |
|
BLAKE2b-256 | c5618cd4146d802824eec017f79cb6965ed5ad68c2473d0a3f413a4466d81aed |
File details
Details for the file s3splitmerge-0.0.1-py2.py3-none-any.whl
.
File metadata
- Download URL: s3splitmerge-0.0.1-py2.py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a40ab311e40af111412ec27a2abdce1a358e174ca1699fbd14a5400a5fe9a61 |
|
MD5 | d50d007c042cdad7021f7fa38aff5803 |
|
BLAKE2b-256 | f93d812408eabd26c1cf73317731660fa0a557309952841428a466c35598848b |