No project description provided
Project description
data-tools(et)
data-toolset is designed to simplify your data processing tasks by providing a more user-friendly alternative to the traditional JAR utilities like avro-tools and parquet-tools. With this Python package, you can effortlessly handle various data file formats, including Avro and Parquet, using a simple and intuitive command-line interface.
Installation
Python 3.8, Python 3.9 and 3.10 are supported and tested (to some extent).
python -m pip install --user data-toolset
Usage
$ data-toolset -h
usage: data-toolset [-h] {head,tail,meta,schema,stats,query,validate,merge,count,to_json,to_csv,to_avro,to_parquet,random_sample} ...
positional arguments:
{head,tail,meta,schema,stats,query,validate,merge,count,to_json,to_csv,to_avro,to_parquet,random_sample}
commands
head Print the first N records from a file
tail Print the last N records from a file
meta Print a file's metadata
schema Print the Avro schema for a file
stats Print statistics about a file
query Query a file
validate Validate a file
merge Merge multiple files into one
count Count the number of records in a file
to_json Convert a file to JSON format
to_csv Convert a file to CSV format
to_avro Convert a file to Avro format
to_parquet Convert a file to Parquet format
random_sample Randomly sample records from a file
Examples
Print the first 10 records of a Parquet file:
$ data-toolset head my_data.parquet -n 10
shape: (1, 7)
┌───────────┬─────┬──────────┬────────┬──────────────────────────┬────────────────────────────┬──────────────────┐
│ character ┆ age ┆ is_human ┆ height ┆ quote ┆ friends ┆ appearance │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ bool ┆ f64 ┆ str ┆ list[str] ┆ struct[2] │
╞═══════════╪═════╪══════════╪════════╪══════════════════════════╪════════════════════════════╪══════════════════╡
│ Alice ┆ 10 ┆ true ┆ 150.5 ┆ Curiouser and curiouser! ┆ ["Rabbit", "Cheshire Cat"] ┆ {"blue","small"} │
└───────────┴─────┴──────────┴────────┴──────────────────────────┴────────────────────────────┴──────────────────┘
Query a Parquet file using a SQL-like expression:
$ data-toolset query my_data.parquet "SELECT * FROM 'my_data.parquet' WHERE height > 165"
shape: (2, 7)
┌─────────────────┬─────┬──────────┬────────┬───────────────────────┬────────────────────────────────────┬───────────────────┐
│ character ┆ age ┆ is_human ┆ height ┆ quote ┆ friends ┆ appearance │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ bool ┆ f64 ┆ str ┆ list[str] ┆ struct[2] │
╞═════════════════╪═════╪══════════╪════════╪═══════════════════════╪════════════════════════════════════╪═══════════════════╡
│ Mad Hatter ┆ 35 ┆ true ┆ 175.2 ┆ I'm late! ┆ ["Alice"] ┆ {"green","tall"} │
│ Queen of Hearts ┆ 50 ┆ false ┆ 165.8 ┆ Off with their heads! ┆ ["White Rabbit", "King of Hearts"] ┆ {"red","average"} │
└─────────────────┴─────┴──────────┴────────┴───────────────────────┴────────────────────────────────────┴───────────────────┘
Merge multiple Avro files into one:
data-toolset merge file1.avro file2.avro file3.avro merged_file.avro
Convert Avro file into Parquet:
data-toolset to_parquet my_data.avro output.parquet
Convert Parquet file into JSON:
data-toolset to_json my_data.parquet output.json
Contributing
Contributions are welcome! If you have any suggestions, bug reports, or feature requests, please open an issue on GitHub.
TODO
- optimizations [TBD]
- benchmarking
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file data_toolset-0.1.5.tar.gz
.
File metadata
- Download URL: data_toolset-0.1.5.tar.gz
- Upload date:
- Size: 11.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.0 CPython/3.9.16 Darwin/22.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4b0d1b8e08ef5b5d6501e183ebc9f4ca2db80359649c5cca28955ff92f6d7f5 |
|
MD5 | b130f9fac43d202535280e3e995003a6 |
|
BLAKE2b-256 | 7304ac3987308dab10ae5dc478b2415a08c5255c8530f6b6fbe8ed3054f25c02 |
File details
Details for the file data_toolset-0.1.5-py3-none-any.whl
.
File metadata
- Download URL: data_toolset-0.1.5-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.0 CPython/3.9.16 Darwin/22.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 42277ff65286dcc6f371e89dff06fabec6c7d031998897219e0458d5fd4b9f75 |
|
MD5 | 2e591bbb2f53298ae1bbda2ab19d226b |
|
BLAKE2b-256 | c540a4b54d25f58e66c9b10112c039c909885190d5d7926e4952fc32c45b7ffc |