Skip to main content

No project description provided

Project description

Master codecov

data-tools(et)

data-toolset is designed to simplify your data processing tasks by providing a more user-friendly alternative to the traditional JAR utilities like avro-tools and parquet-tools. With this Python package, you can effortlessly handle various data file formats, including Avro and Parquet, using a simple and intuitive command-line interface.

Installation

Python 3.8, Python 3.9 and 3.10 are supported and tested (to some extent).

python -m pip install --user data-toolset

Usage

$ data-toolset -h
usage: data-toolset [-h] {head,tail,meta,schema,stats,query,validate,merge,count,to_json,to_csv,to_avro,to_parquet,random_sample} ...

positional arguments:
  {head,tail,meta,schema,stats,query,validate,merge,count,to_json,to_csv,to_avro,to_parquet,random_sample}
                        commands
    head                Print the first N records from a file
    tail                Print the last N records from a file
    meta                Print a file's metadata
    schema              Print the Avro schema for a file
    stats               Print statistics about a file
    query               Query a file
    validate            Validate a file
    merge               Merge multiple files into one
    count               Count the number of records in a file
    to_json             Convert a file to JSON format
    to_csv              Convert a file to CSV format
    to_avro             Convert a file to Avro format
    to_parquet          Convert a file to Parquet format
    random_sample       Randomly sample records from a file

Examples

Print the first 10 records of a Parquet file:

$ data-toolset head my_data.parquet -n 10
shape: (1, 7)
┌───────────┬─────┬──────────┬────────┬──────────────────────────┬────────────────────────────┬──────────────────┐
│ character  age  is_human  height  quote                     friends                     appearance       │
│ ---        ---  ---       ---     ---                       ---                         ---              │
│ str        i64  bool      f64     str                       list[str]                   struct[2]        │
╞═══════════╪═════╪══════════╪════════╪══════════════════════════╪════════════════════════════╪══════════════════╡
│ Alice      10   true      150.5   Curiouser and curiouser!  ["Rabbit", "Cheshire Cat"]  {"blue","small"} │
└───────────┴─────┴──────────┴────────┴──────────────────────────┴────────────────────────────┴──────────────────┘

Query a Parquet file using a SQL-like expression:

$ data-toolset query my_data.parquet "SELECT * FROM 'my_data.parquet' WHERE height > 165"
shape: (2, 7)
┌─────────────────┬─────┬──────────┬────────┬───────────────────────┬────────────────────────────────────┬───────────────────┐
│ character        age  is_human  height  quote                  friends                             appearance        │
│ ---              ---  ---       ---     ---                    ---                                 ---               │
│ str              i64  bool      f64     str                    list[str]                           struct[2]         │
╞═════════════════╪═════╪══════════╪════════╪═══════════════════════╪════════════════════════════════════╪═══════════════════╡
│ Mad Hatter       35   true      175.2   I'm late!              ["Alice"]                           {"green","tall"}  │
│ Queen of Hearts  50   false     165.8   Off with their heads!  ["White Rabbit", "King of Hearts"]  {"red","average"} │
└─────────────────┴─────┴──────────┴────────┴───────────────────────┴────────────────────────────────────┴───────────────────┘

Merge multiple Avro files into one:

data-toolset merge file1.avro file2.avro file3.avro merged_file.avro

Convert Avro file into Parquet:

data-toolset to_parquet my_data.avro output.parquet

Convert Parquet file into JSON:

data-toolset to_json my_data.parquet output.json

Contributing

Contributions are welcome! If you have any suggestions, bug reports, or feature requests, please open an issue on GitHub.

TODO

  • optimizations [TBD]
  • benchmarking

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_toolset-0.1.5.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

data_toolset-0.1.5-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file data_toolset-0.1.5.tar.gz.

File metadata

  • Download URL: data_toolset-0.1.5.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.9.16 Darwin/22.2.0

File hashes

Hashes for data_toolset-0.1.5.tar.gz
Algorithm Hash digest
SHA256 b4b0d1b8e08ef5b5d6501e183ebc9f4ca2db80359649c5cca28955ff92f6d7f5
MD5 b130f9fac43d202535280e3e995003a6
BLAKE2b-256 7304ac3987308dab10ae5dc478b2415a08c5255c8530f6b6fbe8ed3054f25c02

See more details on using hashes here.

File details

Details for the file data_toolset-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: data_toolset-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.9.16 Darwin/22.2.0

File hashes

Hashes for data_toolset-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 42277ff65286dcc6f371e89dff06fabec6c7d031998897219e0458d5fd4b9f75
MD5 2e591bbb2f53298ae1bbda2ab19d226b
BLAKE2b-256 c540a4b54d25f58e66c9b10112c039c909885190d5d7926e4952fc32c45b7ffc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page