Skip to main content

Utility to convert avro files to csv, json and parquet formats

Project description

avroconvert

codecov docs docs

Utility to convert avro files to csv, json and parquet formats

  • Installation

Using pypi

pip install avroconvert

Using git:

git clone https://github.com/shrinivdeshmukh/avroconvert
make install
  • Usage

Using CLI

CLI can be used to interact with the tool. As the first argument, the source has to be passed. The source can be gs (google cloud storage bucket), s3 (amazon s3 bucket) or fs (local filesystem)

To read from cloud bucket (google cloud or amazon s3):

google cloud storage example:

avroconvert gs -b <BUCKET_NAME> -f <FORMAT> -o <OUTPUT_FOLDER>

amazon s3 example:

avroconvert s3 -b <BUCKET_NAME> -f <FORMAT> -o <OUTPUT_FOLDER>

The tool reads all avro files from the bucket specified by the -b parameter, converts them to the format specified by the -f parameter, and writes the output format files to the output folder specified by the -o parameter with the above command.

The cli accepts a few additional parameters to authenticate the tool with cloud providers. These parameters are only required if you haven't already been authenticated.

For google cloud, we have --auth-file:

avroconvert gs -b <BUCKET_NAME> -f <FORMAT> -o <OUTPUT_FOLDER> --auth-file <SERVICE_ACCOUNT_FILE_PATH>.json (or .p12)

For amazon s3, we have --access-key, --secret-key, --session-token:

avroconvert s3 -b <BUCKET_NAME> -f <FORMAT> -o <OUTPUT_FOLDER> --access-key <AWS_ACCESS_KEY_ID> --secret-key <AWS_SECRET_ACCESS_KEY> --session-token <AWS_SESSION_TOKEN> 

To read from local filesystem

avroconvert fs  -i <INPUT_DATA_FOLDER> -o <OUTPUT_FOLDER> -f <OUTPUT_FORMAT>

The tool reads all avro files from the input folder specified by the -i parameter, converts them to the format specified by the -f parameter, and writes the output format files to the output folder specified by the -o parameter with the above command.

Output folder structure

The tool replicates the cloud bucket's or local filesystem's directory structure. For example, suppose the output format is parquet and cloud bucket (or local filesystem) has the following structure:

BUCKET
├── 2021-06-17
│   └── file1.avro
│   └── file2.avro
│ 
├── 2021-06-16
│   └── data
│       └── file3.avro
│       └── file4.avro

the output files will then be saved as:

OUTPUT_FOLDER
├── 2021-06-17
│   └── file1.parquet
│   └── file2.parquet
│ 
├── 2021-06-16
│   └── data
│       └── file3.parquet
│       └── file4.parquet

Filter files to read

A parameter called -p or —-prefix can be passed as well. All three data sources, gs, s3, and fs, share this parameter. Only files with names that begin with the specified prefix will be read; all other files will be filtered out.

google cloud example with -p:

avroconvert gs -b <BUCKET_NAME> -f <FORMAT> -o <OUTPUT_FOLDER> -p 2021-06-17/file

amazon s3 example with -p:

avroconvert s3 -b <BUCKET_NAME> -f <FORMAT> -o <OUTPUT_FOLDER> -p 2021-06-17/file

local filesystem example with -p:

avroconvert fs  -i <INPUT_DATA_FOLDER> -o <OUTPUT_FOLDER> -f <OUTPUT_FORMAT> -p 2021-06-17/file

Using the API in code

    from avroconvert import Execute

    # for amazon s3 storage bucket reader
    output = Execute(source='gs', bucket='<BUCKET_NAME>, dst_format='parquet', auth_file='<SERVICE_ACCOUNT.json>',
                     outfolder='OUTPUT_FOLDER', access_key='<AWS ACCESS KEY>', secret_key='<AWS SECRET KEY>', 
                     session_token='<AWS SESSION TOKEN>(if any)', bucket='<S3 BUCKET>', prefix='<FILE PREFIX>').run()

    # google storage bucket reader
    output = Execute(source='gs', bucket='<BUCKET_NAME>, dst_format='parquet', auth_file='<SERVICE_ACCOUNT.json>',
                     outfolder='OUTPUT_FOLDER').run()

    # Local file system reader
    output = Execute(source='fs', bucket='<LOCAL_FOLDER NAME> dst_format='parquet', outfolder='OUTPUT_FOLDER').run()

For more details on using the API, please visit readthedocs

  • Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

avroconvert-0.1.1-py2.py3-none-any.whl (16.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file avroconvert-0.1.1-py2.py3-none-any.whl.

File metadata

  • Download URL: avroconvert-0.1.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.9

File hashes

Hashes for avroconvert-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 8b87d740075a9910b8c3b385f1400c6e7cca37f3b9cb28820428998b550634f7
MD5 4380c9e5026c30c1233ebb96cc5199a9
BLAKE2b-256 97a6456644eee60f8a7dd52d51d5427df51cc14831adc66e6f79382a2caa1df3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page