Skip to main content

No project description provided

Project description

Master codecov

data-toolset

data-toolset is designed to simplify your data processing tasks by providing a more user-friendly alternative to the traditional JAR utilities like avro-tools and parquet-tools. With this Python package, you can effortlessly handle various data file formats, including Avro and Parquet, using a simple and intuitive command-line interface.

Installation

Python 3.9 and 3.10 are supported and tested (to some extent).

pip install poetry
pip install git+https://github.com/luminousmen/data-toolset.git

Usage

$ data-toolset -h
usage: data-toolset [-h] {head,tail,meta,schema,stats,query,validate,merge,count} ...

positional arguments:
  {head,tail,meta,schema,stats,query,validate,merge,count}
                        commands
    head                Print the first N records from a file
    tail                Print the last N records from a file
    meta                Print a file's metadata
    schema              Print the Avro schema for a file
    stats               Print statistics about a file
    query               Query a file
    validate            Validate a file
    merge               Merge multiple files into one
    count               Count the number of records in a file

optional arguments:
  -h, --help            show this help message and exit

Examples

Print the first 10 records of a Parquet file:

data-toolset head my_data.parquet -n 10

Query a Parquet file using a SQL-like expression:

data-toolset query my_data.parquet "SELECT * FROM 'my_data.parquet' WHERE age > 25"

Merge multiple Avro files into one:

data-toolset merge file1.avro file2.avro file3.avro merged_file.avro

Contributing

Contributions are welcome! If you have any suggestions, bug reports, or feature requests, please open an issue on GitHub.

TODO

  • proper online documentation
  • update README
  • proper method docstrings
  • add tests for validate and merge and count
  • create an artifact on PyPi
  • create random_sample function
  • create schema_evolution function
  • mature create_sample function
  • to/from csv and json functionality
  • optimizations TBD
  • test coverage
  • support 3.11+

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_toolset-0.1.1.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

data_toolset-0.1.1-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file data_toolset-0.1.1.tar.gz.

File metadata

  • Download URL: data_toolset-0.1.1.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.9.16 Darwin/22.2.0

File hashes

Hashes for data_toolset-0.1.1.tar.gz
Algorithm Hash digest
SHA256 dc8617112b5df3acdf93d53bcc3b35b8f840192117d1d1df112a3c8fd5adc873
MD5 cb743f759618f97cbdac4b82e71fae0a
BLAKE2b-256 6d654d13ece136e12df717c56ccc5dd0e685007b465a5d19f62de3e586667c1f

See more details on using hashes here.

File details

Details for the file data_toolset-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: data_toolset-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.9.16 Darwin/22.2.0

File hashes

Hashes for data_toolset-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ae963ae1ab738ad15a8e6996214de4bd7a2c1a4b4956824f86ec0e6c5c18c045
MD5 57f1626e6dc12a47073d2e9162bf6215
BLAKE2b-256 aa650c94d88322a6b71e9fee9845c9f7d566800428aadab2debcd8fb3b0bc150

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page