swiss army knife of scripts for transforming and processing datasets for machine learning.
Project description
mldatasets
swiss army knife of scripts for transforming and processing datasets for machine learning
conversion
Currently, mldataforge provides space- and time-efficient conversions between JSONL (with or without compression), MosaiclML Dataset (MDS format), and Parquet. The implementations handle conversions by individual samples or small batches of samples and make efficient use of multi-core architectures where possible. Consequently, mldataforge is an excellent choice when transforming TB-scale datasets on data processing nodes with many cores.
splitting
Currently, mldataforge provides space- and time-efficient splitting of JSONL (with or without compression). The implementations handle conversions by individual samples or small batches of samples and make efficient use of multi-core architectures where possible. The splitting function can take an already splitted dataset and re-split it with a different granularity.
installation and general usage
pip install mldataforge
python -m mldataforge --help
usage example: converting MosaiclML Dataset (MDS) to Parquet format
Usage: python -m mldataforge convert mds parquet [OPTIONS] OUTPUT_FILE
MDS_DIRECTORIES...
Options:
--compression [snappy|gzip|zstd]
Compress the output file (default: snappy).
--overwrite Overwrite existing path.
--yes Assume yes to all prompts. Use with caution
as it will remove files or even entire
directories without confirmation.
--batch-size INTEGER Batch size for loading data and writing
files (default: 65536).
--no-bulk Use a custom space and time-efficient bulk
reader (only gzip and no compression).
--help Show this message and exit.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mldataforge-0.1.5.tar.gz.
File metadata
- Download URL: mldataforge-0.1.5.tar.gz
- Upload date:
- Size: 10.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f10eb66caaa5135b5815af3f82b5a0a5008859d1a65544251e78b0fdb9e7a9b
|
|
| MD5 |
484ad36ea5d4b68aee7cdc442b2142ca
|
|
| BLAKE2b-256 |
9159880c0dda1ba36ac56dd0ffd66b52e3ea469a9981a5ff986e13405d1e95e7
|