Package short description.
Project description
Welcome to s3splitmerge Documentation
Features
Split:
split big data file on (>=500MB) in common data format CSV, TSV, JSON into
Install
pip install awswrangler==2.10.0 --no-deps
s3splitmerge is released on PyPI, so all you need is:
$ pip install s3splitmerge
To upgrade to latest version:
$ pip install --upgrade s3splitmerge
Merge Multiple AWS S3 Json File into One Big
1. Input Data
Files:
s3://my-bucket/input/date=2000-01-01/a.json # 6MB s3://my-bucket/input/date=2000-01-01/b.json # 600MB s3://my-bucket/input/date=2000-01-01/c.json # 120MB s3://my-bucket/input/date=2000-01-02/... ...
Content:
{"id": 1, "value": "a", ...}
{"id": 2, "value": "b", ...}
{"id": 3, "value": "c", ...}
Normalize file size to approximately 6MB. If smaller than 6MB, keep it as it is:
s3://my-bucket/input-normalized/date=2000-01-01/a-1.json # 6MB s3://my-bucket/input-normalized/date=2000-01-01/b-1.json # 6MB s3://my-bucket/input-normalized/date=2000-01-01/b-2.json # 6MB ... s3://my-bucket/input-normalized/date=2000-01-01/b-100.json # 6MB s3://my-bucket/input-normalized/date=2000-01-01/c-1.json # 6MB s3://my-bucket/input-normalized/date=2000-01-01/c-2.json # 6MB ... s3://my-bucket/input-normalized/date=2000-01-01/c-20.json # 6MB
Performance per file ETL using AWS Lambda:
s3://my-bucket/output-normalized/date=2000-01-01/a-1.parquet # 6MB s3://my-bucket/output-normalized/date=2000-01-01/b-1.parquet # 6MB s3://my-bucket/output-normalized/date=2000-01-01/b-2.parquet # 6MB ... s3://my-bucket/output-normalized/date=2000-01-01/b-100.parquet # 6MB s3://my-bucket/output-normalized/date=2000-01-01/c-1.parquet # 6MB s3://my-bucket/output-normalized/date=2000-01-01/c-2.parquet # 6MB ... s3://my-bucket/output-normalized/date=2000-01-01/c-20.parquet # 6MB
Merge file into Bigger one for better Athena Query performance:
s3://my-bucket/output-normalized/date=2000-01-01/part-1.parquet # 500MB s3://my-bucket/output-normalized/date=2000-01-01/part-2.parquet # 500MB ...
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file s3splitmerge-0.0.1.tar.gz.
File metadata
- Download URL: s3splitmerge-0.0.1.tar.gz
- Upload date:
- Size: 24.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.8.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd19359ff8bd31ad64b626b131b71c9e8e217c2412efa477f5019e68ff9b6ab6
|
|
| MD5 |
db1ced6ec5e8c5201d6f4e161b7874d0
|
|
| BLAKE2b-256 |
c5618cd4146d802824eec017f79cb6965ed5ad68c2473d0a3f413a4466d81aed
|
File details
Details for the file s3splitmerge-0.0.1-py2.py3-none-any.whl.
File metadata
- Download URL: s3splitmerge-0.0.1-py2.py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.8.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a40ab311e40af111412ec27a2abdce1a358e174ca1699fbd14a5400a5fe9a61
|
|
| MD5 |
d50d007c042cdad7021f7fa38aff5803
|
|
| BLAKE2b-256 |
f93d812408eabd26c1cf73317731660fa0a557309952841428a466c35598848b
|