Metadata generator using polars as backend.
Project description
PyMetaGen
PyMetaGen is a powerful and fast data quality tool base on Polars designed for generating metadata and extracting useful information from various data file formats. It provides both a Python API and a command-line interface (CLI) to inspect, filter, and extract data from files such as CSV, JSON, Parquet, and Excel.
Key Features
- Metadata Generation: Automatically generates metadata for your datasets, including statistics such as min, max, standard deviation, and more.
- Data Extraction: Easily extract specific rows from your datasets using head, tail, or random sampling.
- Command Line Interface: Perform operations like metadata generation, data inspection, and filtering using an intuitive CLI.
- Multiple File Format Support: Import and export data in various formats, including CSV, Parquet, Excel, and JSON.
- SQL Query Support: Filter data using SQL queries directly on the command line.
Installation
To install the package, use the following command:
pip install pymetagen
Local Installation
To install the package locally, use the following command:
python -m pip install -U git+ssh://git@github.com/itsbigspark/dotdda.git@dev/main
Usage
Python API
You can use the Python API to load a data file and generate metadata:
from pymetagen import MetaGen
# Create an instance of the MetaGen class reading a data file
metagen = MetaGen.from_path("tests/data/testdata.csv", loading_mode="eager")
# Display the first few rows of the data
metagen.data.head()
# Generate metadata and reset the index
metadata = metagen.compute_metadata().reset_index()
# Save the metadata to a file
metagen.write_metadata("tests/data/testdata_metadata.csv")
Command Line Interface
- Metadata Generation Generate metadata for a tabular data file:
$ metagen metadata -i tests/data/testdata.csv -o tests/data/testdata_metadata.csv
>>> Generating metadata for tests/data/testdata.csv...
- Data Inspection Inspect a data file (e.g., a partitioned Parquet file):
metagen inspect -i tests/data/input_ab_partition.parquet
- Data Filtering Filter a data set using an SQL query:
metagen filter -i tests/data/testdata.csv -q "SELECT * FROM data WHERE imdb_score > 9"
- Data Extraction Extract a specific number of rows from a data set:
$ metagen extracts -i tests/data/testdata.csv -o tests.csv -n 3
>>> Writing extract in: tests-head.csv
>>> Writing extract in: tests-tail.csv
>>> Writing extract in: tests-sample.csv
Available Output Formats
- CSV
- Parquet
- JSON
- Excel
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pymetagen-0.4.1.tar.gz.
File metadata
- Download URL: pymetagen-0.4.1.tar.gz
- Upload date:
- Size: 23.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2792ead17ea7be58881e457a3b20db10c282fdb19b80fb9fb2d4090c1431c359
|
|
| MD5 |
c0c15967e83a56fecb23fc440699cba9
|
|
| BLAKE2b-256 |
a78d453880373684aa3064ac2739c5317277e7454c44e792610538a9b12d8a0b
|
Provenance
The following attestation bundles were made for pymetagen-0.4.1.tar.gz:
Publisher:
pypi-release.yml on itsbigspark/pymetagen
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymetagen-0.4.1.tar.gz -
Subject digest:
2792ead17ea7be58881e457a3b20db10c282fdb19b80fb9fb2d4090c1431c359 - Sigstore transparency entry: 232197535
- Sigstore integration time:
-
Permalink:
itsbigspark/pymetagen@8eb7e2b15efe9ed96ecc39a9233755f67f050ed9 -
Branch / Tag:
refs/heads/dev/main - Owner: https://github.com/itsbigspark
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-release.yml@8eb7e2b15efe9ed96ecc39a9233755f67f050ed9 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file pymetagen-0.4.1-py3-none-any.whl.
File metadata
- Download URL: pymetagen-0.4.1-py3-none-any.whl
- Upload date:
- Size: 19.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8916db6105e70674196ba9a254da2e47edfc8de6e70ea157d511332b1c9729fa
|
|
| MD5 |
eb903c634833c80d6fc068cd255a8030
|
|
| BLAKE2b-256 |
96621c7cd36f38ecd6fc4ecf0118cd55966afbecfb6e94031bf3f92c12081b94
|
Provenance
The following attestation bundles were made for pymetagen-0.4.1-py3-none-any.whl:
Publisher:
pypi-release.yml on itsbigspark/pymetagen
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pymetagen-0.4.1-py3-none-any.whl -
Subject digest:
8916db6105e70674196ba9a254da2e47edfc8de6e70ea157d511332b1c9729fa - Sigstore transparency entry: 232197540
- Sigstore integration time:
-
Permalink:
itsbigspark/pymetagen@8eb7e2b15efe9ed96ecc39a9233755f67f050ed9 -
Branch / Tag:
refs/heads/dev/main - Owner: https://github.com/itsbigspark
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-release.yml@8eb7e2b15efe9ed96ecc39a9233755f67f050ed9 -
Trigger Event:
workflow_dispatch
-
Statement type: