Skip to main content

Package short description.

Project description

https://travis-ci.org/MacHu-GWU/s3splitmerge-project.svg?branch=master https://codecov.io/gh/MacHu-GWU/s3splitmerge-project/branch/master/graph/badge.svg https://img.shields.io/pypi/v/s3splitmerge.svg https://img.shields.io/pypi/l/s3splitmerge.svg https://img.shields.io/pypi/pyversions/s3splitmerge.svg https://img.shields.io/badge/STAR_Me_on_GitHub!--None.svg?style=social
https://img.shields.io/badge/Link-Document-blue.svg https://img.shields.io/badge/Link-API-blue.svg https://img.shields.io/badge/Link-Source_Code-blue.svg https://img.shields.io/badge/Link-Install-blue.svg https://img.shields.io/badge/Link-GitHub-blue.svg https://img.shields.io/badge/Link-Submit_Issue-blue.svg https://img.shields.io/badge/Link-Request_Feature-blue.svg https://img.shields.io/badge/Link-Download-blue.svg

Welcome to s3splitmerge Documentation

Features

Split:

  • split big data file on (>=500MB) in common data format CSV, TSV, JSON into

Install

pip install awswrangler==2.10.0 --no-deps

s3splitmerge is released on PyPI, so all you need is:

$ pip install s3splitmerge

To upgrade to latest version:

$ pip install --upgrade s3splitmerge

Merge Multiple AWS S3 Json File into One Big

1. Input Data

Files:

s3://my-bucket/input/date=2000-01-01/a.json # 6MB
s3://my-bucket/input/date=2000-01-01/b.json # 600MB
s3://my-bucket/input/date=2000-01-01/c.json # 120MB
s3://my-bucket/input/date=2000-01-02/...
...

Content:

{"id": 1, "value": "a", ...}
{"id": 2, "value": "b", ...}
{"id": 3, "value": "c", ...}
  1. Normalize file size to approximately 6MB. If smaller than 6MB, keep it as it is:

    s3://my-bucket/input-normalized/date=2000-01-01/a-1.json # 6MB
    
    s3://my-bucket/input-normalized/date=2000-01-01/b-1.json # 6MB
    s3://my-bucket/input-normalized/date=2000-01-01/b-2.json # 6MB
    ...
    s3://my-bucket/input-normalized/date=2000-01-01/b-100.json # 6MB
    
    s3://my-bucket/input-normalized/date=2000-01-01/c-1.json # 6MB
    s3://my-bucket/input-normalized/date=2000-01-01/c-2.json # 6MB
    ...
    s3://my-bucket/input-normalized/date=2000-01-01/c-20.json # 6MB
  2. Performance per file ETL using AWS Lambda:

    s3://my-bucket/output-normalized/date=2000-01-01/a-1.parquet # 6MB
    
    s3://my-bucket/output-normalized/date=2000-01-01/b-1.parquet # 6MB
    s3://my-bucket/output-normalized/date=2000-01-01/b-2.parquet # 6MB
    ...
    s3://my-bucket/output-normalized/date=2000-01-01/b-100.parquet # 6MB
    
    s3://my-bucket/output-normalized/date=2000-01-01/c-1.parquet # 6MB
    s3://my-bucket/output-normalized/date=2000-01-01/c-2.parquet # 6MB
    ...
    s3://my-bucket/output-normalized/date=2000-01-01/c-20.parquet # 6MB
  3. Merge file into Bigger one for better Athena Query performance:

    s3://my-bucket/output-normalized/date=2000-01-01/part-1.parquet # 500MB
    s3://my-bucket/output-normalized/date=2000-01-01/part-2.parquet # 500MB
    ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3splitmerge-0.0.1.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

s3splitmerge-0.0.1-py2.py3-none-any.whl (27.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file s3splitmerge-0.0.1.tar.gz.

File metadata

  • Download URL: s3splitmerge-0.0.1.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.8.6

File hashes

Hashes for s3splitmerge-0.0.1.tar.gz
Algorithm Hash digest
SHA256 fd19359ff8bd31ad64b626b131b71c9e8e217c2412efa477f5019e68ff9b6ab6
MD5 db1ced6ec5e8c5201d6f4e161b7874d0
BLAKE2b-256 c5618cd4146d802824eec017f79cb6965ed5ad68c2473d0a3f413a4466d81aed

See more details on using hashes here.

File details

Details for the file s3splitmerge-0.0.1-py2.py3-none-any.whl.

File metadata

  • Download URL: s3splitmerge-0.0.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.8.6

File hashes

Hashes for s3splitmerge-0.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6a40ab311e40af111412ec27a2abdce1a358e174ca1699fbd14a5400a5fe9a61
MD5 d50d007c042cdad7021f7fa38aff5803
BLAKE2b-256 f93d812408eabd26c1cf73317731660fa0a557309952841428a466c35598848b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page