Skip to main content

Package short description.

None None

Project description

https://travis-ci.org/MacHu-GWU/s3splitmerge-project.svg?branch=master https://codecov.io/gh/MacHu-GWU/s3splitmerge-project/branch/master/graph/badge.svg https://img.shields.io/pypi/v/s3splitmerge.svg https://img.shields.io/pypi/l/s3splitmerge.svg https://img.shields.io/pypi/pyversions/s3splitmerge.svg https://img.shields.io/badge/STAR_Me_on_GitHub!--None.svg?style=social
https://img.shields.io/badge/Link-Document-blue.svg https://img.shields.io/badge/Link-API-blue.svg https://img.shields.io/badge/Link-Source_Code-blue.svg https://img.shields.io/badge/Link-Install-blue.svg https://img.shields.io/badge/Link-GitHub-blue.svg https://img.shields.io/badge/Link-Submit_Issue-blue.svg https://img.shields.io/badge/Link-Request_Feature-blue.svg https://img.shields.io/badge/Link-Download-blue.svg

Welcome to s3splitmerge Documentation

Features

Split:

  • split big data file on (>=500MB) in common data format CSV, TSV, JSON into

Install

pip install awswrangler==2.10.0 --no-deps

s3splitmerge is released on PyPI, so all you need is:

$ pip install s3splitmerge

To upgrade to latest version:

$ pip install --upgrade s3splitmerge

Merge Multiple AWS S3 Json File into One Big

1. Input Data

Files:

s3://my-bucket/input/date=2000-01-01/a.json # 6MB
s3://my-bucket/input/date=2000-01-01/b.json # 600MB
s3://my-bucket/input/date=2000-01-01/c.json # 120MB
s3://my-bucket/input/date=2000-01-02/...
...

Content:

{"id": 1, "value": "a", ...}
{"id": 2, "value": "b", ...}
{"id": 3, "value": "c", ...}
  1. Normalize file size to approximately 6MB. If smaller than 6MB, keep it as it is:

    s3://my-bucket/input-normalized/date=2000-01-01/a-1.json # 6MB
    
    s3://my-bucket/input-normalized/date=2000-01-01/b-1.json # 6MB
    s3://my-bucket/input-normalized/date=2000-01-01/b-2.json # 6MB
    ...
    s3://my-bucket/input-normalized/date=2000-01-01/b-100.json # 6MB
    
    s3://my-bucket/input-normalized/date=2000-01-01/c-1.json # 6MB
    s3://my-bucket/input-normalized/date=2000-01-01/c-2.json # 6MB
    ...
    s3://my-bucket/input-normalized/date=2000-01-01/c-20.json # 6MB
  2. Performance per file ETL using AWS Lambda:

    s3://my-bucket/output-normalized/date=2000-01-01/a-1.parquet # 6MB
    
    s3://my-bucket/output-normalized/date=2000-01-01/b-1.parquet # 6MB
    s3://my-bucket/output-normalized/date=2000-01-01/b-2.parquet # 6MB
    ...
    s3://my-bucket/output-normalized/date=2000-01-01/b-100.parquet # 6MB
    
    s3://my-bucket/output-normalized/date=2000-01-01/c-1.parquet # 6MB
    s3://my-bucket/output-normalized/date=2000-01-01/c-2.parquet # 6MB
    ...
    s3://my-bucket/output-normalized/date=2000-01-01/c-20.parquet # 6MB
  3. Merge file into Bigger one for better Athena Query performance:

    s3://my-bucket/output-normalized/date=2000-01-01/part-1.parquet # 500MB
    s3://my-bucket/output-normalized/date=2000-01-01/part-2.parquet # 500MB
    ...

Project details

None None

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3splitmerge-0.0.1.tar.gz (24.2 kB view hashes)

Uploaded Source

Built Distribution

s3splitmerge-0.0.1-py2.py3-none-any.whl (27.7 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page